Microsoft's Introduction to Data Science

In summer of 2018, I took a short module from Microsoft on Coursera, called DAT101x. The course was an Introduction to the fundamental concepts in Data Science with a brief preview of some ideas in Machine Learning, such as linear regression.

In addition to teaching Data Science Fundamentals, the course also introduced various ways to do Data Science in Microsoft products such as Excel or Azure Data Science Studio.

I found it a great course for both reasons. If you are interested in getting a fundamental understanding of Data Science, I recommend it to you.

This blog post includes the summary notes I took for the entire course. Enjoy!

Module 1: An Introduction to Data#

Intro: Exploring Data

Why do you like being a data scientist?#

  • The world’s data is set to be doubling every couple of years

  • Working with terabytes of data and making interpretations

  • Exciting time to be a data scientist / Working with people in different fields

What does a data scientist do?

  • Cleaning data (good enough to be trustworthy) — biggest challenge

  • Understanding and prepping the data

  • Seeing what kind of data formulations

  • Parsing the data

What kind of person succeeds?

  • You need to like data (dream about it!)

  • Curiosity

  • Collaborative and a problem solver (working with various individuals)

  • Quick Learner

  • New methodologies coming up everyday

What skills does a data scientist need?

  • Handle on the statistics

  • Communicate your findings to people

  • Programming Languages (R, Python)

  • Transform (using R, Azure etc)

Aspiring Data Scientist?

  • Take some courses

  • Look at available data spreads (data.gov)

  • Create models, write blogs (Machine Learning, get noticed by big companies)

  • Work on projects that you are excited about (with people that inspire you, and seek to solve problems)

“Your Experience and your work is your currency”

1.1 Getting Started with Data#

“It is important to understand the fields of data and the kinds of information they hold”

Fields of Data

  • Temporal: Relating to time (can do calculations but are not the same as numeric calculations)

  • Text: They are strings and are usually represented as categorical data (ex. Sunday, Monday, Tuesday etc.)

  • Numeric Data: Data that can be represented by a number

  • Continuous Data: Can be put on a scale and compared with highest and lowest

  • Discrete Data: Usually counting numbers

Sorting and Filtering Data

  • Need to be aware of outliers (and filter them if needed)

  • Cutoff data at points to get a better comparison

Derived Data

  • Making new fields based on fields that already exist (ex. average, sum etc)

Highlighting

  • Using Color intensity to visualize the value (heatmap data)

  • Relative size using data bars

Lab 1

Weekday with the lowest temperature: Date 12/5/2017 T 22 R 1.82 F 11 Price 0.3 Sales 10

Number of flyers distributed in the month of Jan: 727

Module 2: Data Analysis Fundamentals#

2.1 Aggregating Data#

  • First step to perform when you summarize the data

  • Categorical Data: Distinct Count

  • Sum: Common Aggregation of Interest (Numerical Data)

Visual Representation

  • Line Graph: Changes in Values over time

  • Pie Chart: Shows the percentage (Data Scientists don’t use it often)

  • Bar Chart: Multiple Categories (Categorical Variables), you can add multiple series

  • Scatterplot: Comparing two numeric Values

Lab

Total Sum of Revenue for August: 470.5

Hottest Saturday in July: 102.9

Lowest Number of Flyers distributed in a day during November: 26

Most flyers distributed: Wednesday

Least Flyers distributed: Saturday

Temperature vs. Rainfall: As rainfall increases, log temperature decreases

Date vs. tr: As date goes to the mid-yer, temp increases and rainfall decreases

Module 3: Getting Started with Statistics#

Measuring Central Tendency#

  • Understand the common descriptive statistics

  • Mean: sum / n

  • Median: The middle term

  • Mode: The most occurring data

Visualizing Data#

  • Histogram: Portrays the data in frequency

  • Box and whiskers chart

Measuring Variance

  • Variance: Sum of the squared differences between each value and the mean / the number of observations

  • var.p(range)

  • var.s(range): variance of a sample

  • Standard Deviation: Square root of the variance

  • Stdev.s - standard deviation of a sample (The higher the standard deviation, the more variance)

  • 1 standard deviation: 68.25%

  • 2 standard deviation: 95.45%

  • 3 standard deviation: 99.73%

Need to know the mean and the standard deviation.

Skewed Distributions#

  • Where most of the data is centered

  • Mean is higher or lower than median in the direction of the skewness

Working with Samples

  • Generally work with samples of data vs. population

  • How to make sure the samples are representative of the population?

  • Take the largest sample possible

  • Take a lot of random samples (The Central Limit Theorem, it will make a normal distribution and the mean of the sample mean will be equal to the mean of the population)

Correlation#

  • Quantifying relation?

  • COREL(range_x, range_y)

  • +1 +ve : 0 : -1 -ve correlation

Hypothesis Testing#

  • Null Hypothesis:

  • H1 Mean for hot sales > mean sales for population

  • Significance Level: 0.05

  • z-test: if we know the standard deviation of the population

  • t-test: returns the p-value

  • if p-value is smaller than significance level -> reject null hypothesis

  • p-value represents the probability of getting the mean value purely by random chance

Mean of the temperature sampling dist: 60.39628

Correlation between rainfall and sales: -.90921 / indicates that there is a inverse correlation)

Cannot reject the null hypothesis as p-value 0.935 > 0.05

Module 4: Machine Learning Basics#

  • Need a way to train a computer to perform calculations

  • x → values that we have

  • y → values that we are trying to predict

Types of Machine Learning#

Supervised

  • Start with observations with known values (labels)

  • Use a machine learning algorithm to fit a label

  • We compare the value predicted to the actual value that we had in the first place

  • Add unknown values to make predictions about new values

Unsupervised

  • Don’t have a known label in the training data set

  • Each new observation is assigned to a cluster of observations with similar characteristics

Regression

  • RMSE (Root Mean Squared Error): Measure of the standard deviation

  • MAE (Mean Absolute Error): Average of all the model errors

  • Relative absolute error: MAE relative to the mean value of the label (0, 1) (evaluate performance)

  • Relative Squared Error: RMSE divided by the sum of the squares of the label

  • R-squared: Coefficient of determination (R-squared of the model, represents the predictive power of the model as a value between 0 and 1) 1 is perfect

Classification

  • Which class or category something belongs to

  • Binary classification (T or F)

  • False Positives / True Negative  : If the model predicts 0 or 1 but the actual label is the opposite

Confusion Matrix: Calculating performance metrics for a classifier#

  • Accuracy is not useful for a measure of model’s performance

  • Precision: The fraction of cases that are classified as positive that are actually positive.

  • Recall (true positive rate) : The number of true positives / combines true positives and false negatives

Reciever Operating Characteristic (ROC):

  • Area under the AUC is an indication of how well the model predicts (0, 0.5 is guessing, 1 is perfect)

Clustering (Unsupervised)

  • K-means clustering

  • Plot the data in a n-dimensional space (numeric values)

  • Decide how many clusters you want to create

  • Randomize the centroid

  • Reposition the centroid

  • Reallocate to the base of the centroid

© Swopnil Shrestha 2022