TensorFlow: Introduction to Machine Learning


The Introduction to Machine Learning Crash course, from Google which teaches Machine Learning Fundamentals using TensorFlow is a popular and brilliant way to begin Machine Learning. I took this self-paced over the time of a few weeks in 2018 and took some notes as I went.

This blog post are my summary notes for that course. In addition to taking notes for the course, I also put together a playlist for all the videos of that course, for those who prefer to learn solely from YouTube.

You can access the playlist here.

Here is the first video from the course, which is trailer to what the course is about. Enjoy!

1.0 Supervised ML / Labels / Features#

  • What is a supervised ML?
  • Creating models that combine inputs to produce useful predictions
  • Never before seen data
  • Label
  • We are providing it with label (spam or not spam)
  • It is what we are predicting (y variable in linear regression)
  • Features
  • Input variables (x in linear regression)
  • Like the columns (email address, words in the email, time it was sent)
  • What are labeled and unlabeled?
  • Labeled contains both features and the label like (spam or not spam)
  • Unlabeled contains features but not the label
  • Unlabeled examples add new examples that humans haven’t labeled
  • Models
  • Model defines the relationship between the features and the label
  • Learning or training entails learning the relationship between the features and the label based on what the model is trying to predict
  • Inference is applying the trained model to unlabeled examples (making predictions)
  • Regression vs. Classification
  • Regression: Predicts continuous values (more likely to work with numbers)
  • Classification: Predicts discrete values (sort into categories)

2.0 Descending into ML: Linear Regression#

  • It is like finding the slope of a line where
  • y' = b + w1x1 where:
  • y' - predicted label
  • b - bias (y-intercept)
  • w1 - weight of feature 1
  • x1 - feature (input)

2.1 Training and Loss#

  • Empirical Risk Minimization is the process the machine learning model uses for training by examining many examples and attempting to find a model that minimizes loss
  • Loss is a number predicting how bad the model’s prediction was on a single example
  • The squared loss is a popular loss function for measuring the accuracy of linear regression: (observation - prediction(x))2
  • Mean Squared Error (MSE) - Sum up all the squared losses per example over the whole dataset / number of examples

3.0 Reducing Loss: Iterative Approach#

  • Like the “hot and cold” kid’s game for finding a hidden object
  • The iterative process:
  • Data comes in
  • Make a prediction
  • Calculate the squared loss (derivative of squared loss easy to compute)
  • Compute the gradient of the loss function on the data
  • We get a -ve or +ve gradient which tells us to update model parameter and get a new version of the model
  • Repeat
  • Rate depends on how big of a gap we decide to take
  • Theory (math) tells us to compute the gradient over all examples in our dataset
  • Empirically - Compute the gradient of the loss function over a single (stocastic gradient descent)
  • Practically - Use a batch of gradient descents (mini-batch gradient descent)

3.1 Gradient Descent#

  • Calculating the loss function for every value of w over a convex problem would be inefficient
  • We start gradient descent by picking a starting value (random value or 0, doesn’t matter too much)
  • The gradient always points in the direction of the steepest increase in the loss function. The algorithm then takes a step in the direction of the negative gradient in order to reduce loss as quickly as possible.
  • To take the next point, the gradient descent add a fraction of the gradient’s magnitude to the starting point
  • Then the gradient descent repeats this process getting closer to the minimum

3.2 Learning Rate#

  • The gradients multiply the gradient by a scalar known as the learning rate (step size)
  • Hyperparameters - values that are set before the machine learning takes place (programmers tweak), learning rate is a hyper parameter
  • The value of the learning rate is related to how flat the function is, if the loss is small, then we can use a larger learning rate (small gradient / larger step size)

3.3 Stochastic Gradient Descent#

  • Batch is the total number of examples you use to calculate the gradient in a single iteration
  • With a larger batch, there is more redundant data as the batch size grows
  • SGD only uses 1 example per iteration (very noisy, but works) | stochastic means random
  • Mini-batch SGD is a middle, takes between 10 and 1,000 examples

4.0 TensorFlow

  • Graph based computational framework that have a lot of applications
  • Estimators API - higher level API
  • Lower level API (build models defining a series of mathematical operations)
  • Higher Level API - Specify predefined architectures like linear regressors or neural networks
  • TensorFlow consists of two components:
  • Graph protocol buffer
  • Runtime that executes the graph
  • Like Python code and python interpreter, can run on multiple hardware platforms
  • We should use the highest level of abstraction that solves the problem

4.1.1 Steps

  • Define features and configure feature columns (store a description of the feature data)
  • Define the target
  • Configure the LinearRegressor
  • Define the input function
  • Pandas feature data to NumPy arrays
  • TensorFlow Dataset API to construct a dataset object from the data, break our data into batches of batch_size
  • Train the Model
  • Evaluate the Model

4.1.2 Hyperparameters

  • steps: total number of training iterations. One step calculates the loss from one batch and uses that to modify the model’s weights once.
  • batch size: the number of examples (randomized) for a single step. Ex. batch size for SGD is 1.
  • total number of trained examples = batch size x steps
  • periods: Controls the granularity of reporting. If periods is 7 and steps is 70, the exercise will output the loss value every 10 steps (7 times)

5.0 Generalization#

  • Generalization of data
  • Overfitting of data - When it fits better with one example but does not fit all
  • Ocham’s Razor - A model should be as simple as possible
  • Test set methodology
  • One one set of data and use a training set
  • Use another set of data and use a test set
  • Assumptions:
  • We are drawing independently and identically
  • Distribution is stationary
  • From the same distribution

6.0 Training and Test Sets#

  • Divide large dataset into two smaller sets (Randomization before splitting)
  • Cross Validation (if small dataset)
  • Do not train on your test data

7.0 Validation#

  • Testing repetitively on the same dataset can lead to overfitting to that particular case
  • Iterations by training on the training data and validating on the validation data
  • Finally test on the test data

8.0 Representation#

  • Feature Engineering (Process of extracting features from raw data)
  • String value to feature value by one hot encoding
  • Database records - does not come to us in the form of information
  • Mapping Raw Data to Features:
    • Raw data: Data from an input source
    • Feature Vector: Floating point values comprising the examples in the dataset
    • Feature Engineering: Transforming raw data into a feature vector
  • Mapping Numeric Values
    • It is trivial to convert integers to features
  • Mapping Categorical Values
    • Strings are converted into numeric values
    • OOV bucket (out of vocabulary bucket) consists of the values that are not in the vocabulary of strings
    • One hot encoding extends to numeric data that you do not want to directly multiply by a weight (postal code)
  • Qualities of Good Features:
    • Should appear more than 5 or so times in a data set (ex. house_type)
    • Should have a clear and obvious meaning
    • Should not have noisy data
    • Convert “magic” values into two features, (data_supplied, rating)
    • Definition feature should not change over time
  • Cleaning Data
    • Scaling Feature Values
      • Scaling is converting floating point feature values from their natural range (100 to 900) to (0 to 1)
      • Helps gradient descent converge more quickly
      • Every number becomes NaN
      • Helps the model learn appropriate weights for each feature
        • Linearly map [min value, max value] to a scale such as [+1, -1]
        • scaledvalue = (value - mean)/stdev
        • Scaling with Z-Scores means that there will be values > 3 <
    • Handling Extreme Outliers:
      • Log Scaling: Still leaves a tail
      • Clipping Features: All the features beyond a number become a number
    • Binning
      • For features that have no linear relationship with the data
      • Binning by quantile removes the need for outliers
    • Scrubbing
      • Omitted Values: Person forgot to enter values
      • Duplicated Values: Uploaded the two logs
      • Bad Labels: A person labeled a picture of an oak tree as a maple
      • Bad feature values: Extra digit or a thermometer was left out in the sun
      • Use histograms, min and max, mean and median, standard deviation

9.0 Feature Crosses#

  • A linear problem is where you can fit a line to differentiate spam from not spam (depending on the input variables)
  • Define a synthetic feature (cross product)
  • Feature Crosses:
    • Bedrooms x
    • Linear learners scale to large datasets
  • Crossing One-Hot Vectors:
    • Linear learners scale well to massive dat
  • FTRL Optimization Algorithm
    • Benefits scaling the learning rate differently for different coefficients | scales the learning rate differently for different coefficients (can be useful if there are a lot of 0 values)

10.0 Regularization for Simplicity#

  • Minimizing Training Loss
  • Regularization - Not trusting your examples too much
  • Red line starts to go up (need to generalize on the test examples)
  • Early Stopping (stopping before it starts to converge)
  • Penalize the model complexity:
    • While we are training
    • Empirical Risk Minimization: minimize(loss(data|model))
    • Structural Risk Minimization: minimize(loss(data(model) + complexity(model))
  • Model Complexity
    • Model complexity as a function of the weights of all the features in the model
    • Model complexity as a function of the total number of features with nonzero weights
  • We quantify complexity using the L2 Regularization formula
  • L2 Regularization - Sum of the squared values of the weights

11.0 Linear Regression#

  • Sometimes we might end up with a range beyond 0 and 1 (for probabilities) if we use Linear Regression
  • Loss function and prediction method (never exceeds 0 or 1) Logistic Regression
  • Classification Tasks, Probability etc
  • Linear Model -> Sigmoid
  • Training using a log loss (entropy measure)
  • As you get closer to the bars, the loss get higher quickly
  • Because of the asymptote, we need to use regularization (L2 regularization)
  • Efficient to train, fast, efficient to use

12.0 Classification#

  • Is it spam or not spam?
  • Making a classification threshold, evaluating classification performance is accuracy
  • Accuracy breaks down if we have class imbalance
  • True Positive, False Negative, True Negative, False Positive
  • Precision - When the boy said wolf, how many times was he right?
  • Recall - Of all, how many did he say wolf?
  • Precision - Recall opposite
  • What classification threshold are you using?
    • Across many different classification thresholds
    • ROC curve (Receiver Operating Characteristics)
    • Area under the ROC curve, When you pick a random positive and a random negative example, what is the probability that the model will get a higher score for the positive than the negative * The probability value is exctly the same probability for the prob under RC
    • Prediction Bias - Sum of everything predicted / sum of everything observed
    • Bias - Simplistic (Easy to fool) | Need to make zero bias
    • Calibration Plot - Take groups of data, bucket and look at mean prediction and mean observation
      • Coin Flip will come up 1 or 0

13.0 Regularization for Sparsity#

  • Feature crosses can cause problems
  • Sparse features
  • Words in a search query * Unique video that we have to look up
  • Model size will take memory
  • Noisy Coefficients - Lose them
  • Regularize - Model Size, memory usage
  • L0 regularization - Zero out weights (not convex, hard to optimize)
  • L1 Regularization - Sum of the absolute value of the weights
  • L2 Regularization - Makes the weights small but won’t actually drive them to 0

14.0 Neural Nets#

  • Learn the non-linearity themselves - Image, audio and video data
  • Model with structure
  • Linear Model - nonlinearity by additional layer (RELU, Sigmoid, Tanh)
  • When training - non-convex optimization
  • Backpropagation is a variant of gradient descent

15.0 Training Neural Nets#

  • We need differentiable functions
  • Gradients vanish (noise ratios)
  • Gradients can explode (learning rates are too high)
  • ReLus can die (if we end up at everything below 0)
  • All of the inputs are on roughly the same scale
  • Drop out regularization - p, remove the node on each gradient step

16.0 Multi-Class Neural Nets#

  • Classification for binary class problems
  • Pick a label out of a range of classes
  • One vs. all multiclass classification
  • If the neural nets belong to only one class, make the sum of the total outputs to 1 (using Softmax)
  • Multilabel Classification problem - 3 different dogs, dog and a person (1 vs. all classification strategy) - using full Softmax which is expensive to train
  • Candidate Sampling - Train the output nodes for the set that it belongs to

17.0 Embeddings#

  • Collaborating Samples
  • 1-dimensional embedding
  • Multi-dimensional embedding

© Swopnil Shrestha 2021