Microsoft's Introduction to Data Science
In summer of 2018, I took a short module from Microsoft on Coursera, called DAT101x. The course was an Introduction to the fundamental concepts in Data Science with a brief preview of some ideas in Machine Learning, such as linear regression.
In addition to teaching Data Science Fundamentals, the course also introduced various ways to do Data Science in Microsoft products such as Excel or Azure Data Science Studio.
I found it a great course for both reasons. If you are interested in getting a fundamental understanding of Data Science, I recommend it to you.
This blog post includes the summary notes I took for the entire course. Enjoy!
Module 1: An Introduction to Data#
Intro: Exploring Data
Why do you like being a data scientist?#

The world’s data is set to be doubling every couple of years

Working with terabytes of data and making interpretations

Exciting time to be a data scientist / Working with people in different fields
What does a data scientist do?

Cleaning data (good enough to be trustworthy) — biggest challenge

Understanding and prepping the data

Seeing what kind of data formulations

Parsing the data
What kind of person succeeds?

You need to like data (dream about it!)

Curiosity

Collaborative and a problem solver (working with various individuals)

Quick Learner

New methodologies coming up everyday
What skills does a data scientist need?

Handle on the statistics

Communicate your findings to people

Programming Languages (R, Python)

Transform (using R, Azure etc)
Aspiring Data Scientist?

Take some courses

Look at available data spreads (data.gov)

Create models, write blogs (Machine Learning, get noticed by big companies)

Work on projects that you are excited about (with people that inspire you, and seek to solve problems)
“Your Experience and your work is your currency”
1.1 Getting Started with Data#
“It is important to understand the fields of data and the kinds of information they hold”
Fields of Data

Temporal: Relating to time (can do calculations but are not the same as numeric calculations)

Text: They are strings and are usually represented as categorical data (ex. Sunday, Monday, Tuesday etc.)

Numeric Data: Data that can be represented by a number

Continuous Data: Can be put on a scale and compared with highest and lowest

Discrete Data: Usually counting numbers
Sorting and Filtering Data

Need to be aware of outliers (and filter them if needed)

Cutoff data at points to get a better comparison
Derived Data
 Making new fields based on fields that already exist (ex. average, sum etc)
Highlighting

Using Color intensity to visualize the value (heatmap data)

Relative size using data bars
Lab 1
Weekday with the lowest temperature: Date 12/5/2017 T 22 R 1.82 F 11 Price 0.3 Sales 10
Number of flyers distributed in the month of Jan: 727
Module 2: Data Analysis Fundamentals#
2.1 Aggregating Data#

First step to perform when you summarize the data

Categorical Data: Distinct Count

Sum: Common Aggregation of Interest (Numerical Data)
Visual Representation

Line Graph: Changes in Values over time

Pie Chart: Shows the percentage (Data Scientists don’t use it often)

Bar Chart: Multiple Categories (Categorical Variables), you can add multiple series

Scatterplot: Comparing two numeric Values
Lab
Total Sum of Revenue for August: 470.5
Hottest Saturday in July: 102.9
Lowest Number of Flyers distributed in a day during November: 26
Most flyers distributed: Wednesday
Least Flyers distributed: Saturday
Temperature vs. Rainfall: As rainfall increases, log temperature decreases
Date vs. tr: As date goes to the midyer, temp increases and rainfall decreases
Module 3: Getting Started with Statistics#
Measuring Central Tendency#

Understand the common descriptive statistics

Mean: sum / n

Median: The middle term

Mode: The most occurring data
Visualizing Data#

Histogram: Portrays the data in frequency

Box and whiskers chart
Measuring Variance

Variance: Sum of the squared differences between each value and the mean / the number of observations

var.p(range)

var.s(range): variance of a sample

Standard Deviation: Square root of the variance

Stdev.s  standard deviation of a sample (The higher the standard deviation, the more variance)

1 standard deviation: 68.25%

2 standard deviation: 95.45%

3 standard deviation: 99.73%
Need to know the mean and the standard deviation.
Skewed Distributions#

Where most of the data is centered

Mean is higher or lower than median in the direction of the skewness
Working with Samples

Generally work with samples of data vs. population

How to make sure the samples are representative of the population?

Take the largest sample possible

Take a lot of random samples (The Central Limit Theorem, it will make a normal distribution and the mean of the sample mean will be equal to the mean of the population)
Correlation#

Quantifying relation?

COREL(range_x, range_y)

+1 +ve : 0 : 1 ve correlation
Hypothesis Testing#

Null Hypothesis:

H1 Mean for hot sales > mean sales for population

Significance Level: 0.05

ztest: if we know the standard deviation of the population

ttest: returns the pvalue

if pvalue is smaller than significance level > reject null hypothesis

pvalue represents the probability of getting the mean value purely by random chance
Mean of the temperature sampling dist: 60.39628
Correlation between rainfall and sales: .90921 / indicates that there is a inverse correlation)
Cannot reject the null hypothesis as pvalue 0.935 > 0.05
Module 4: Machine Learning Basics#

Need a way to train a computer to perform calculations

x → values that we have

y → values that we are trying to predict
Types of Machine Learning#
Supervised

Start with observations with known values (labels)

Use a machine learning algorithm to fit a label

We compare the value predicted to the actual value that we had in the first place

Add unknown values to make predictions about new values
Unsupervised

Don’t have a known label in the training data set

Each new observation is assigned to a cluster of observations with similar characteristics
Regression

RMSE (Root Mean Squared Error): Measure of the standard deviation

MAE (Mean Absolute Error): Average of all the model errors

Relative absolute error: MAE relative to the mean value of the label (0, 1) (evaluate performance)

Relative Squared Error: RMSE divided by the sum of the squares of the label

Rsquared: Coefficient of determination (Rsquared of the model, represents the predictive power of the model as a value between 0 and 1) 1 is perfect
Classification

Which class or category something belongs to

Binary classification (T or F)

False Positives / True Negative : If the model predicts 0 or 1 but the actual label is the opposite
Confusion Matrix: Calculating performance metrics for a classifier#

Accuracy is not useful for a measure of model’s performance

Precision: The fraction of cases that are classified as positive that are actually positive.

Recall (true positive rate) : The number of true positives / combines true positives and false negatives
Reciever Operating Characteristic (ROC):
 Area under the AUC is an indication of how well the model predicts (0, 0.5 is guessing, 1 is perfect)
Clustering (Unsupervised)

Kmeans clustering

Plot the data in a ndimensional space (numeric values)

Decide how many clusters you want to create

Randomize the centroid

Reposition the centroid

Reallocate to the base of the centroid