Active Inference - Free Energy Principle (Andrei’s study notes)

Notes reading K. Friston et al’s Active Inference: The Free Energy Principle in Mind, Brain, and Behavior (2022)

Karl Friston
- Sean Carroll: Mindscape 87: Karl Friston on Brains, Predictions, and Free Energy (2020)
Chap 2
- Bayes’ Rule \(P(x \vert y) = \frac{P(y \vert x) \cdot P(x)}{P(y)}\)
  - Likelihood model \(P(y \vert x)\), prior belief \(P(x)\), posterior belief \(P(x \vert y)\)
  - Posterior is proportional \(\propto\) to prior * likelihood
  - But Posterior needs to be normalized, and that is computationally intractable.
    - Computing \(P(y) = \int_x P(y \vert x)\) is intractable.
- Given the likelihood model \(P(y \vert x)\) and the prior belief \(P(x)\)
  - Compute joint probability \(P(x, y)\) and marginal likelyhood \(P(y)\)
  - If event \(y\) is actually observed, compute posterior belief \(P(x \vert y)\)
  - Surprise is \(\Im(y) := -\ln P(y)\)
  - Bayesian surprise is \(D_{KL}[P(x \vert y) \vert\vert P(x)]\). This scores the amount of belief updating, as opposed to simply how unlikely the observation was.
    - For example, if \(P(x)=1\), then \(P(x \vert y) = 1\), and the Bayesian surprise is \(0\), but the surprise \(\Im(y)\) is not \(0\).
  - \(-\ln()\) is convex, so we can use Jensen’s inequality: for a posterior distribution \(Q(x)\), we have

\[\begin{align*} \Im(y) &= -\ln P(y) \\ &= - \ln \int_x Q(x) \frac{P(y, x)}{Q(x)} \le \int_x -Q(x) \ln \frac{P(y, x)}{Q(x)} := F[Q,y] \end{align*}\]

We name \(F(Q,y)\) the Variational Free Energy.

\(Q\) is called an approximate posterior
Equality is when the posterior distribution \(P(x \vert y) = \frac{P(y, x)}{P(y)}\) matches the approximate distribution \(Q(x)\), given that \(y\) is fixed.

Can express the variational free energy as Energy minus Entropy:

\[\begin{align*} F[Q,y] &= - \int_x Q(x) \ln \frac{P(y,x)}{Q(x)}dx \\ &= - \int_x Q(x) \ln P(y, x)dx + \int_x Q(x) \ln Q(x)dx \\ &= - \mathbb{E}_{Q(x)}[\ln P(y, x)] - H(Q(x)) \end{align*}\]

Entropy is the average surprise

\[\begin{align*} H[Q(x)] = - \int_x Q(x) \ln Q(x) dx = \mathbb{E}_{Q(x)}[{\Im}_Q(x)] \end{align*}\]

In absence of data or precise prior beliefs (which only influence the energy term), we should adopt maximally uncertain beliefs about the hidden state of the world - in accordance with Jaynes’s maximum entropy principle.
Be uncertain (high entropy) when we have no information.
Here, energy has a statistical mechanics interpretation.
- The Boltzmann distribution \(P(E) = \frac{1}{Z} \cdot e^{-\frac{E}{kT}}\) describes the statistical behavior of a system with energy \(E\) at thermal equilibrium temperature \(T\).
- \(Z\) is the partition function (a normalization constant), \(k\) is the Boltzmann constant.
- The average log probability \(\ln P(E)\) of a system at thermal equilibrium is inversely proportional to the energy \(E\) required to move the system into this configuration from a baseline configuration. [Andrei: inverse meant as negatively proportional].
The name variational free energy for \(F(Q,y)\) comes from this statistical mechanics interpretation as energy minus entropy.

Can express the variational free energy as Complexity minus Accuracy:

\[\begin{align*} F[Q,y] &= - \int_x Q(x) \ln \frac{P(y,x)}{Q(x)}dx \\ &= - \int_x Q(x) \ln \frac{P(y \vert x) P(x)}{Q(x)}dx \\ &= - \int_x Q(x) \ln \frac{P(x)}{Q(x)}dx - \int_x Q(x) \ln P(y \vert x)dx \\ &= D_{KL}[Q(x) \vert\vert P(x)] - \mathbb{E}_{Q(x)}[\ln P(y \vert x)] \end{align*}\]

Complexity means how much the approximation of the posterior \(Q(x)\) deviates from the prior \(P(x)\) - how many extra bits of information are encoded in \(Q(y)\) relative to \(P(x)\)
Accuracy \(\mathbb{E}_{Q(x)}[\ln P(y \vert x)]\) is maximized when the density \(Q\) places its mass on configurations of the latent variables that explain the observed data.

Can express the variational free energy as Divergence minus Evidence:

\[\begin{align*} F[Q,y] &= - \int_x Q(x) \ln \frac{P(y,x)}{Q(x)}dx \\ &= - \int_x Q(x) \ln \frac{P(x \vert y) P(y)}{Q(x)}dx \\ &= - \int_x Q(x) \ln \frac{P(x \vert y)}{Q(x)}dx - \int_x Q(x) \ln P(y)dx \\ &= D_{KL}[Q(x) \vert\vert P(x \vert y)] - \ln P(y) \end{align*}\]

Classically, \(-F(Q,y)\) is called Evidence Lower Bound (ELBO). The ELBO is always \(\le \ln P(y)\), with equality when \(Q(x)=P(x \vert y)\) (see wiki).
Expected Free Energy \(G(\pi\) at p. 55 is incorrect but a corrected version is in Smith et al: A Step-by-Step Tutorial on Active Inference and its Application to Empirical Data (2021) p. 55.

Interpretation:

This inference procedure is a combination of top-down processes that encode predictions \(P(y)\), and bottom-up processes that encode sensory observations \(y\).
- This interplay of top-down and bottom-up processes distinguishes the inferential view from alternative approaches that only consider bottom-up processes.
Bayesian inference is optimal w/ respect to cost function that is variational free energy.
- Variational free energy is closely related to surprise \(-\ln P(y)\)
Bayesian inference is different from Maximum Likelihood Estimation, which simply selects the hidden state \(y\) most likely to have generated the data \(x\).
The results of inference are subjective, because
- Biological creatures have limited computational and energetic resources, which make Bayesian inference intractable. They make approximations:
  - variational posterior - based on mean field approximations
- The generative model may not correspond to the real generative process.
  - The generative model, as it is optimized with new experiences acquired, may not even converge to the generative process.
  - The generative process is in a true state \(x^*\), which generates an observation \(y\), which the organism senses. Both \(x^*\) and \(y\) are hidden state.
- Psychological claim about optimality of inference is always contingent on the organism’s resources - its specific generative model, and bounded computational resources.

References:

ChatGPT about Bayesian Statistics
Wikipedia: Gamma Function, Beta Function
Emil Artin: The Gamma Function
MIT RES.6-012 Intro to Probabilities L04.9: Multinomial Probabilities (2018)
Jordan Boyd-Graber: INST414: Advanced Data Science at UMD’s School:
- Expectations and Entropy
- Multinomial and Poisson Distributions
- Continuous Distributions: Beta and Dirichlet Distributions. Inverse Beta function should be Γ(α+β) / (Γ(α)Γ(β)).
Harvard Stat 110 (2013), Joe Blitzstein, book

More references:

S. Alexander: God Help Us, Let’s Try To Understand Friston On Free Energy (2018)
Maxwell Ramstead: A tutorial on active inference (2020)
Casper Hesp et al: Deeply Felt Affect: The Emergence of Valence in Deep Active Inference (2021)
Friston et al: Variational ecology and the physics of sentient systems (2018)
Friston et al: Knowing one’s place: a free-energy approach to pattern regulation (2015)
ActInfLab ModelStream #001: A Step-by-Step Tutorial on Active Inference (2021)
- Ryan Smith et al: A Step-by-Step Tutorial on Active Inference and its Application to Empirical Data (2021)
DaCosta et al: Active inference on discrete state-spaces: a synthesis (2020)
Comparisons to RL:
- Sajid et al: Active inference: demystified and compared (2019)
- Sajid et al: Reward Maximisation through Discrete Active Inference (2020)
Medium: O. Solopchuk:
- Intuitions on predictive coding and the free energy principle (2018)
- Tutorial on Active Inference (2018)
- Free Energy, Action Value, and Curiosity (2019)
R. Bogacz: A tutorial on the free-energy framework for modelling perception and learning (2017)
J.C.R. Whittington, R. Bogacz: An Approximation of the Error Backpropagation Algorithm in a Predictive Coding Network with Local Hebbian Synaptic Plasticity (2017)
B. Lotter et al: PredNet (2016)
S. Dora, C. Pennartz: A Deep Predictive Coding Network for Learning Latent Representations (2018)
Wikipedia: Variational Bayesian methods.
- See Chap.4 in MacKay’s Information Theory, Inference, and Learning Algorithms
- MacKay: Course on Information Theory, Pattern Recognition, and Neural Networks
UZH & ETH Zurich
- Computational Psychiatry Course 2019 (summer school)
  - Active Inference lecture
S. Levine: Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review (2018)
T. Parr: Neuronal message passing using Mean-field, Bethe, and Marginal approximations (2019)
K Friston: CCN Workshop: Predictive Coding (2016)
T. Parr, K. Friston: Generalised free energy and active inference (2019)
Lex Fridman #99: Neuroscience and the Free Energy Principle (2021)
Machine Learning Street Talk: #033 Karl Friston - The Free Energy Principle (2020)
Mathematical Consciousness Sciences: Markov blankets and Bayesian mechanics (Karl Friston) (2020)

Active Inference - Free Energy Principle (Andrei's study notes)

Other