CS229 Machine Learning

CS229 Machine Learning
- Andrew Ng, Fall 2018 video, slides
- Anand Avati, Summer 2019 video, slides

Andrei’s notes:

2018

L3
- Locally Weighted Regression
- Parametric vs NonParametric Learning Algos
- Maximum Likelihood Estimation for Linear Regression assuming IID errors implies minimizing Square Error
- Logistic Regression
- Newton’s Method: quicker iteration than gradient ascent
L4
- Perceptron
- Exponential Family
  - See Berkeley 260 Lecture Notes, Chap 8 for
    - Exponential family
    - What is \(A(\nu)\) term for multinomial distribution written as exponential family
- Generalized Linear Model
L5
- Generative vs Discriminative Algorithms
- Gaussian Distributed Analysis (GDA)
  - Comparison with Logistic Regression
- Naive Bayes
- Class Notes
  - Generative Learning algorithms

2019

L19
- Topics
  - Definition of KL-Divergence: \(D_{KL}(P \vert\vert Q) = \sum_x P(x) log \frac{P(X)}{Q(X)} = \mathbb{E}_P [log \frac{P}{Q} ]\)
  - Definition of Entropy: \(H(P) = - \sum_x P(x) log P(x) = \mathbb{E}_P [ log \frac{1}{P}]\)
  - Definition of Cross Entropy \(H(P,Q)== - \sum_x P(x) log Q(x) = \mathbb{E}_P [ log \frac{1}{Q}]\)
  - \[D_{KL}(P \vert\vert Q) = H(P,Q) - H(P)\]
  - Maximum Entropy and Exponential Family
    - Exponential Family: \(f(y \vert \nu) = h(y) e^{\nu . T(y) - a(\nu)}\)
    - Maximum Likelyhood Estimate (MLE) of Exponential Family: obtained when gradient \(\nabla_\nu\) of distribution is zero. By calculation, this is when \(a^\prime(\nu) = \frac{1}{n} \sum_{i=1, ..., n} T(y^{(i)})\)
    - Maximum Entropy Principle
      - We want to estimate prob density \(p(y)\), with max entropy, given \(m\) constraints \(\sum_i T_j(y_i)p(y_i) = c_j\)
      - By convention, \(p(y_i)\) is denoted \(p_i\) (vector notation)
      - Typically want \(T_j(y)=y^k\), the \(k\)-th momentum
      - \(p\) will turn out to be the exponential family
      - Most times, \(c_j\) arises from observed data \(c_j = \frac{1}{n} \sum_i Tj(y_i)p_i\)
      - Solve using Lagrangian \(\mathcal{L}(p, \eta, \lambda) = H(p) + <\eta, Tp-c> + \lambda(<1, p>-1)\), zeroing the partial derivative with respect to \(p_i\) for all \(i\).
  - KL-Divergence
  - Calibration and Proper Scoring Rules
- Class notes
  - Max Entropy
- References
  - Wikipedia: Principle of maximum entropy
L20
- Variational Inference
- EM Variants
- Variational Autoencoder
- Class notes
  - VAE (Sec 4)
- References
  - Wikipedia: Feature Learning