Andrei’s notes:

2018

  • L3
    • Locally Weighted Regression
    • Parametric vs NonParametric Learning Algos
    • Maximum Likelihood Estimation for Linear Regression assuming IID errors implies minimizing Square Error
    • Logistic Regression
    • Newton’s Method: quicker iteration than gradient ascent
  • L4
    • Perceptron
    • Exponential Family
      • See Berkeley 260 Lecture Notes, Chap 8 for
        • Exponential family
        • What is \(A(\nu)\) term for multinomial distribution written as exponential family
    • Generalized Linear Model
  • L5
    • Generative vs Discriminative Algorithms
    • Gaussian Distributed Analysis (GDA)
      • Comparison with Logistic Regression
    • Naive Bayes
    • Class Notes

2019

  • L19
    • Topics
      • Definition of KL-Divergence: \(D_{KL}(P \vert\vert Q) = \sum_x P(x) log \frac{P(X)}{Q(X)} = \mathbb{E}_P [log \frac{P}{Q} ]\)
      • Definition of Entropy: \(H(P) = - \sum_x P(x) log P(x) = \mathbb{E}_P [ log \frac{1}{P}]\)
      • Definition of Cross Entropy \(H(P,Q)== - \sum_x P(x) log Q(x) = \mathbb{E}_P [ log \frac{1}{Q}]\)
      • \[D_{KL}(P \vert\vert Q) = H(P,Q) - H(P)\]
      • Maximum Entropy and Exponential Family
        • Exponential Family: \(f(y \vert \nu) = h(y) e^{\nu . T(y) - a(\nu)}\)
        • Maximum Likelyhood Estimate (MLE) of Exponential Family: obtained when gradient \(\nabla_\nu\) of distribution is zero. By calculation, this is when \(a^\prime(\nu) = \frac{1}{n} \sum_{i=1, ..., n} T(y^{(i)})\)
        • Maximum Entropy Principle
          • We want to estimate prob density \(p(y)\), with max entropy, given \(m\) constraints \(\sum_i T_j(y_i)p(y_i) = c_j\)
          • By convention, \(p(y_i)\) is denoted \(p_i\) (vector notation)
          • Typically want \(T_j(y)=y^k\), the \(k\)-th momentum
          • \(p\) will turn out to be the exponential family
          • Most times, \(c_j\) arises from observed data \(c_j = \frac{1}{n} \sum_i Tj(y_i)p_i\)
          • Solve using Lagrangian \(\mathcal{L}(p, \eta, \lambda) = H(p) + <\eta, Tp-c> + \lambda(<1, p>-1)\), zeroing the partial derivative with respect to \(p_i\) for all \(i\).
      • KL-Divergence
      • Calibration and Proper Scoring Rules
    • Class notes
    • References
  • L20