Notes reading K. Friston et al’s Active Inference: The Free Energy Principle in Mind, Brain, and Behavior (2022)

  • Karl Friston
  • Chap 2
    • Bayes’ Rule \(P(x \vert y) = \frac{P(y \vert x) \cdot P(x)}{P(y)}\)
      • Likelihood model \(P(y \vert x)\), prior belief \(P(x)\), posterior belief \(P(x \vert y)\)
      • Posterior is proportional \(\propto\) to prior * likelihood
      • But Posterior needs to be normalized, and that is computationally intractable.
        • Computing \(P(y) = \int_x P(y \vert x)\) is intractable.
    • Given the likelihood model \(P(y \vert x)\) and the prior belief \(P(x)\)
      • Compute joint probability \(P(x, y)\) and marginal likelyhood \(P(y)\)
      • If event \(y\) is actually observed, compute posterior belief \(P(x \vert y)\)
      • Surprise is \(\Im(y) := -\ln P(y)\)
      • Bayesian surprise is \(D_{KL}[P(x \vert y) \vert\vert P(x)]\). This scores the amount of belief updating, as opposed to simply how unlikely the observation was.
        • For example, if \(P(x)=1\), then \(P(x \vert y) = 1\), and the Bayesian surprise is \(0\), but the surprise \(\Im(y)\) is not \(0\).
      • \(-\ln()\) is convex, so we can use Jensen’s inequality: for a posterior distribution \(Q(x)\), we have
\[\begin{align*} \Im(y) &= -\ln P(y) \\ &= - \ln \int_x Q(x) \frac{P(y, x)}{Q(x)} \le \int_x -Q(x) \ln \frac{P(y, x)}{Q(x)} := F[Q,y] \end{align*}\]

We name \(F(Q,y)\) the Variational Free Energy.

  • \(Q\) is called an approximate posterior
  • Equality is when the posterior distribution \(P(x \vert y) = \frac{P(y, x)}{P(y)}\) matches the approximate distribution \(Q(x)\), given that \(y\) is fixed.

Can express the variational free energy as Energy minus Entropy:

\[\begin{align*} F[Q,y] &= - \int_x Q(x) \ln \frac{P(y,x)}{Q(x)}dx \\ &= - \int_x Q(x) \ln P(y, x)dx + \int_x Q(x) \ln Q(x)dx \\ &= - \mathbb{E}_{Q(x)}[\ln P(y, x)] - H(Q(x)) \end{align*}\]
  • Entropy is the average surprise
\[\begin{align*} H[Q(x)] = - \int_x Q(x) \ln Q(x) dx = \mathbb{E}_{Q(x)}[{\Im}_Q(x)] \end{align*}\]
  • In absence of data or precise prior beliefs (which only influence the energy term), we should adopt maximally uncertain beliefs about the hidden state of the world - in accordance with Jaynes’s maximum entropy principle.
  • Be uncertain (high entropy) when we have no information.
  • Here, energy has a statistical mechanics interpretation.
    • The Boltzmann distribution \(P(E) = \frac{1}{Z} \cdot e^{-\frac{E}{kT}}\) describes the statistical behavior of a system with energy \(E\) at thermal equilibrium temperature \(T\).
    • \(Z\) is the partition function (a normalization constant), \(k\) is the Boltzmann constant.
    • The average log probability \(\ln P(E)\) of a system at thermal equilibrium is inversely proportional to the energy \(E\) required to move the system into this configuration from a baseline configuration. [Andrei: inverse meant as negatively proportional].
  • The name variational free energy for \(F(Q,y)\) comes from this statistical mechanics interpretation as energy minus entropy.

Can express the variational free energy as Complexity minus Accuracy:

\[\begin{align*} F[Q,y] &= - \int_x Q(x) \ln \frac{P(y,x)}{Q(x)}dx \\ &= - \int_x Q(x) \ln \frac{P(y \vert x) P(x)}{Q(x)}dx \\ &= - \int_x Q(x) \ln \frac{P(x)}{Q(x)}dx - \int_x Q(x) \ln P(y \vert x)dx \\ &= D_{KL}[Q(x) \vert\vert P(x)] - \mathbb{E}_{Q(x)}[\ln P(y \vert x)] \end{align*}\]
  • Complexity means how much the approximation of the posterior \(Q(x)\) deviates from the prior \(P(x)\) - how many extra bits of information are encoded in \(Q(y)\) relative to \(P(x)\)
  • Accuracy \(\mathbb{E}_{Q(x)}[\ln P(y \vert x)]\) is maximized when the density \(Q\) places its mass on configurations of the latent variables that explain the observed data.

Can express the variational free energy as Divergence minus Evidence:

\[\begin{align*} F[Q,y] &= - \int_x Q(x) \ln \frac{P(y,x)}{Q(x)}dx \\ &= - \int_x Q(x) \ln \frac{P(x \vert y) P(y)}{Q(x)}dx \\ &= - \int_x Q(x) \ln \frac{P(x \vert y)}{Q(x)}dx - \int_x Q(x) \ln P(y)dx \\ &= D_{KL}[Q(x) \vert\vert P(x \vert y)] - \ln P(y) \end{align*}\]

Interpretation:

  • This inference procedure is a combination of top-down processes that encode predictions \(P(y)\), and bottom-up processes that encode sensory observations \(y\).
    • This interplay of top-down and bottom-up processes distinguishes the inferential view from alternative approaches that only consider bottom-up processes.
  • Bayesian inference is optimal w/ respect to cost function that is variational free energy.
    • Variational free energy is closely related to surprise \(-\ln P(y)\)
  • Bayesian inference is different from Maximum Likelihood Estimation, which simply selects the hidden state \(y\) most likely to have generated the data \(x\).
  • The results of inference are subjective, because
    • Biological creatures have limited computational and energetic resources, which make Bayesian inference intractable. They make approximations:
      • variational posterior - based on mean field approximations
    • The generative model may not correspond to the real generative process.
      • The generative model, as it is optimized with new experiences acquired, may not even converge to the generative process.
      • The generative process is in a true state \(x^*\), which generates an observation \(y\), which the organism senses. Both \(x^*\) and \(y\) are hidden state.
    • Psychological claim about optimality of inference is always contingent on the organism’s resources - its specific generative model, and bounded computational resources.

References:

More references:

Other