Notes on a great course by Geoff Hulten on Machine Learning.

  • Course web page
  • Textbooks
  • 1: Overview of Machine Learning
    • slides
    • D. Sculley et al: Machine Learning: The High-Interest Credit Card of Technical Debt (2014)
    • Supervised learning \(Y=f(X)\):
      • Discrete \(Y\): Classification
      • Continuous \(Y\): Regression
      • Probability Estimation: \(P(Y \vert X)\)
    • Unsupervised, semi-supervised learning
    • Reinforcement learning
    • ML algorithm components
      • Model structure
        • Linear
        • Decision trees
        • Ensembles of models
        • Instance based methods
        • Neural nets
        • Support vector machines
        • Graphical models (Bayes/Markov nets)
        • Etc
      • Loss function
        • Accuracy
        • Precision & Recall
        • Cost / Utility
        • Squared error
        • Likelihood
        • Posterior probability
        • Margin
        • L1 & L2 normalization
        • Entropy
        • KL divergence
        • Etc
      • Optimization - how the learning algo finds the best model
        • Greedy search
        • Gradient descent
        • Linear programming
        • Regularization
        • Look ahead, momentum, stochastic methods / batching, learning rates, termination conditions, etc
  • Top 5 Career Paths for Data Professionals
  • 2: Basics of Evaluating Models
    • slides
    • Training set, validation set, test set
    • Overfit, underfit
    • Confusion matrix

  • 3: Logistic Regression
    • slides
    • Linear model for classification and probability estimation
    • Can be effective when:
      • The problem is linearly separable
      • Or there are a lot of relevant features (10s-100s)
      • Need efficient runtime or need a baseline
    • Not the most accurate option, but used a lot in practice
    • Model: linear with sigmoid activation \(\hat{y} = sigmoid(w_0 + \sum_{i=1}^n w_i * x_i)\) where \(sigmoid(x) = \frac{1}{1+e^x}\)
    • Loss function: log loss \(Loss(\hat{y},y) = -y \, log(\hat{y}) - (1-y) \, log(1 - \hat{y})\)
    • Optimization: gradient descent
    • Parameters
      • Step size used in gradient descent
      • Convergence - minimum loss increase required to continue gradient descent
      • Threshold - value between 0-1 to convert \(\hat{y}\) into a classification
  • 4: Intro to Feature Engineering with Text
    • slides
    • Binary Features
      • Contains word?
      • Is SMS longer than X?
      • Contains (*#)?
    • Categorical Features
      • First word is {noun, verb, other…}
      • Message length is {short, medium, long, very long}
      • Token type is {Number, URL, Word, Phone #, Unknown}
      • Grammar Analysis is {Fragment, Simple Sentence, Complex Sentence}
    • Numerical Features
      • CountOf Word(call)
      • MessageLength
      • FirstNumberInMessage
      • WritingGradeLevel
    • Feature Engineering for words
      • Tokenize
      • Bag of words
      • N-grams
      • TF-IDF (term frequency - inverse document frequency)
      • Embeddings - Word2Vec and FastText
      • NLP
    • Normalization of numeric features
    • Feature engineering in other domains
      • Computer vision
        • Gradients
        • Histograms
        • Convolutions
      • Internet
        • IP Parts
        • Domains
        • Relationships
        • Reputation
      • Time series
        • Window aggregation
        • Frequency domain transformation
  • 5: Intro to Feature Selection
    • slides
    • Frequency
      • Take N most common features in the training set
    • Mutual information
      • Take N features that contain most information about the value of Y on the training set
      • Mutual information: \(MI(X,Y)=\sum_{x \in X, y \in Y}p(x,y) ln (\frac{p(x,y)}{p(x)p(y)})\)
    • Accuracy
      • Greedy search, add and remove features, evaluate on validation data
  • 6: ROC Curves and Operating Points
    • slides
    • Logistic regression produces score between 0-1.
      • Use threshold for classification
      • Vary threshold to get best result
      • Build Receiver Operating Characteristic (ROC) curve to plot False Negative Rate by False Positive Rate by threshold

  • AUC is area under curve, used for comparing performance between models. Operating point is point on ROC curve closest to origin. PR curve is same but for precision/recall. Update threshold to avoid drift. Threshold can be updated w/o having to update model.
  • 7: Bounds and Comparing Models
    • slides
    • Tom M. Mitchell: Machine Learning Chap. 7
    • Goal: predict how a model will behave once deployed
    • Central limit theorem: if you have a population with mean \(\mu\) and standard deviation \(\sigma\) and take sufficiently large random samples from the population with replacement , then the distribution of the sample means will be approximately normally distributed with the same mean \(\mu\) and standard deviation \(\sigma\).
    • \(x\%\) confidence interval defines a range of values that you can be \(x\%\) certain contains the population mean.
      • IF: Better_Model – Bound > Worse_Model + Bound
      • THEN: With \(x\%\) confidence the model that looks better on the sample is better
      • ELSE: there is less than a \((1 - x \%)\) chance the model that looks worse is actually better
    • Is new parameter or feature worth using?
      • Train with or without feature
      • Compare using one sided bounds
      • Cross validation
        • Split data into k folds
        • Train with & without feature on k-1 folds
        • Validate on remaining fold
        • Use all validation data when computing bounds
      • When to use cross validation? When you need to measure accuracy variance:
        • For feature selection
        • When data changes from under the model, causing variance
        • When ML algorithm includes internal randomization, causing variance
      • Be careful of data drift
        • Time series
        • Dependencies of data on other factors that may change (e.g. spam campaigns)
        • Other violations of interdependence assumptions
  • 8: Naive Bayes
    • slides
    • See Tom M. Mitchell: Machine Learning, Chap. 6.9
    • Discriminative vs Generative modes
    • Bayes Theorem: \(P(A \vert B) = \frac{P(B \vert A) P(A)}{P(B)}\)
    • Bayes model: \(y = \underset{y_j \in Y}{argmax} \frac{P(\lt x \gt \vert y_j) P(y_j)}{P(\lt x \gt)}\)
      • Same as \(y = \underset{y_j \in Y}{argmax} P(\lt x \gt \vert y_j) P(y_j)\)
      • Simplifying assumption: the features \(x_i\) of \(\lt x \gt\) are independent (where the name ‘naive’ comes from)
      • Naive Bayes model: \(y = \underset{y_j \in Y}{argmax} P(y_j) \underset{i}{\prod} P(x_i \vert y_j)\)
      • Of theoretical importance, almost never used in practice
      • For small number of features \(x_i\) and classes \(y_j\), build incidence table of how many times \(<x> \rightarrow y\), and compute \(y\) using naive Bayes
  • 9: Implementing with Machine Learning
  • 10: Decision Trees
  • 11: Defining Success with Machine Learning
  • 12: Overfitting and Underfitting
    • slides
    • Inductive bias - assumptions you make about how likely any particular concept is
      • Model structure
        • Linear model
        • Axis-aligned tree structure
        • Labels are clustered
      • Dataset selection
        • Train/test/validate
        • Cross validation
      • Concept complexity
        • Occam’s Razor - all other things being equal, simpler model is better
        • Regularization
      • Types of model errors
        • Bias - caused when the model cannot represent the concept
        • Variance - caused when the model overreacts to small changes in the data
        • Noise - data was a bit corrupted; labels were a bit wrong, etc
        • Total loss = bias + variance + noise
  • 13: Intelligent User Experiences
  • 14: Ensembles 1: Bagging & Random Forests
    • slides
    • Models can have bias or variance
      • When models are too simple, they can’t represent complex concepts (bias)
      • Complex models can overfit noise in data (variance/jitter)
    • Ensembles: combine multiple models
      • Can mitigate bias or variance
      • Often result in better accuracy
    • Approaches
      • Use multiple independent models to learn, and combine the result randomly (bagging, random forest), or by using a meta-model to select result (stacking)
        • Works for high variance, low bias base models
      • Chain models, and use output error of previous model as input to next model, then combine the results (boosting, gradient boosting machines)
        • Models in chain learn different parts of the concept
        • Can work with high bias base models
        • May overfit
    • Bagging
      • Create $k$ training sets by sampling with replacement from original set
      • Train $k$ independent models
      • Combine predictions by uniform voting
        • Each model learns concept a different way
        • Bootstrap sampling accentuates variance between $k$ models
        • Voting tends to cancel out variance without increasing bias
      • Good when base models have low bias / high variance
    • Boosted trees
      • Same as bagging, but randomly drop features from each of the $k$ independent models
      • Compared to bagging, feature restriction introduces higher variance
  • 15: Ensembles 2: Boosting
    • slides
    • Create chain of models
    • Model $i$ trained on residual error of model $i-1$
    • Reweighting training data: give bigger weight to samples that had higher error
    • Final result is weighted sum of intermediate results
    • Stop adding models when starting to overfit (check on holdout set)
    • Boosting can help with bias and variance, but can overfit.
    • Gradient Boosting Machines (GBM) often used in practice to get high accuracy
  • 16: Ensembles 3: Stacking & Intelligence Architectures
    • slides
    • Stacking: same as boosting, but train metalearner on holdout data to pick winner
    • Organizing Intelligence is critical for long term success with ML
      • Stacking (meta models)
      • Business Logic
      • Sequencing (sequence of models can eliminate junk according to own rules; simple to extend and maintain)
      • Partitioning (break down into subproblems independently solved by different models)
  • 17: ML Design Pattern: Adversarial Learning
  • 18: Basics of Computer Vision
  • 19: Neural Networks
  • 20: Neural Network Architectures
  • 21: ML Design Pattern - Corpus Centric
  • 22: Reinforcement Learning
  • 23: ML Design Pattern - Ranking

Other