Active Inference - Free Energy Principle (Andrei's study notes)
Notes reading K. Friston et al’s Active Inference: The Free Energy Principle in Mind, Brain, and Behavior (2022)
- Karl Friston
- Sean Carroll: Mindscape 87: Karl Friston on Brains, Predictions, and Free Energy (2020)
- Chap 2
- Bayes’ Rule \(P(x \vert y) = \frac{P(y \vert x) \cdot P(x)}{P(y)}\)
- Likelihood model \(P(y \vert x)\), prior belief \(P(x)\), posterior belief \(P(x \vert y)\)
- Posterior is proportional \(\propto\) to prior * likelihood
- But Posterior needs to be normalized, and that is computationally intractable.
- Computing \(P(y) = \int_x P(y \vert x)\) is intractable.
- Given the likelihood model \(P(y \vert x)\) and the prior belief \(P(x)\)
- Compute joint probability \(P(x, y)\) and marginal likelyhood \(P(y)\)
- If event \(y\) is actually observed, compute posterior belief \(P(x \vert y)\)
- Surprise is \(\Im(y) := -\ln P(y)\)
- Bayesian surprise is \(D_{KL}[P(x \vert y) \vert\vert P(x)]\). This scores the amount of belief updating, as opposed to simply how unlikely the observation was.
- For example, if \(P(x)=1\), then \(P(x \vert y) = 1\), and the Bayesian surprise is \(0\), but the surprise \(\Im(y)\) is not \(0\).
- \(-\ln()\) is convex, so we can use Jensen’s inequality: for a posterior distribution \(Q(x)\), we have
- Bayes’ Rule \(P(x \vert y) = \frac{P(y \vert x) \cdot P(x)}{P(y)}\)
We name \(F(Q,y)\) the Variational Free Energy.
- \(Q\) is called an approximate posterior
- Equality is when the posterior distribution \(P(x \vert y) = \frac{P(y, x)}{P(y)}\) matches the approximate distribution \(Q(x)\), given that \(y\) is fixed.
Can express the variational free energy as Energy minus Entropy:
\[\begin{align*} F[Q,y] &= - \int_x Q(x) \ln \frac{P(y,x)}{Q(x)}dx \\ &= - \int_x Q(x) \ln P(y, x)dx + \int_x Q(x) \ln Q(x)dx \\ &= - \mathbb{E}_{Q(x)}[\ln P(y, x)] - H(Q(x)) \end{align*}\]- Entropy is the average surprise
- In absence of data or precise prior beliefs (which only influence the energy term), we should adopt maximally uncertain beliefs about the hidden state of the world - in accordance with Jaynes’s maximum entropy principle.
- Be uncertain (high entropy) when we have no information.
- Here, energy has a statistical mechanics interpretation.
- The Boltzmann distribution \(P(E) = \frac{1}{Z} \cdot e^{-\frac{E}{kT}}\) describes the statistical behavior of a system with energy \(E\) at thermal equilibrium temperature \(T\).
- \(Z\) is the partition function (a normalization constant), \(k\) is the Boltzmann constant.
- The average log probability \(\ln P(E)\) of a system at thermal equilibrium is inversely proportional to the energy \(E\) required to move the system into this configuration from a baseline configuration. [Andrei: inverse meant as negatively proportional].
- The name variational free energy for \(F(Q,y)\) comes from this statistical mechanics interpretation as energy minus entropy.
Can express the variational free energy as Complexity minus Accuracy:
\[\begin{align*} F[Q,y] &= - \int_x Q(x) \ln \frac{P(y,x)}{Q(x)}dx \\ &= - \int_x Q(x) \ln \frac{P(y \vert x) P(x)}{Q(x)}dx \\ &= - \int_x Q(x) \ln \frac{P(x)}{Q(x)}dx - \int_x Q(x) \ln P(y \vert x)dx \\ &= D_{KL}[Q(x) \vert\vert P(x)] - \mathbb{E}_{Q(x)}[\ln P(y \vert x)] \end{align*}\]- Complexity means how much the approximation of the posterior \(Q(x)\) deviates from the prior \(P(x)\) - how many extra bits of information are encoded in \(Q(y)\) relative to \(P(x)\)
- Accuracy \(\mathbb{E}_{Q(x)}[\ln P(y \vert x)]\) is maximized when the density \(Q\) places its mass on configurations of the latent variables that explain the observed data.
Can express the variational free energy as Divergence minus Evidence:
\[\begin{align*} F[Q,y] &= - \int_x Q(x) \ln \frac{P(y,x)}{Q(x)}dx \\ &= - \int_x Q(x) \ln \frac{P(x \vert y) P(y)}{Q(x)}dx \\ &= - \int_x Q(x) \ln \frac{P(x \vert y)}{Q(x)}dx - \int_x Q(x) \ln P(y)dx \\ &= D_{KL}[Q(x) \vert\vert P(x \vert y)] - \ln P(y) \end{align*}\]-
Classically, \(-F(Q,y)\) is called Evidence Lower Bound (ELBO). The ELBO is always \(\le \ln P(y)\), with equality when \(Q(x)=P(x \vert y)\) (see wiki).
-
Expected Free Energy \(G(\pi\) at p. 55 is incorrect but a corrected version is in Smith et al: A Step-by-Step Tutorial on Active Inference and its Application to Empirical Data (2021) p. 55.
Interpretation:
- This inference procedure is a combination of top-down processes that encode predictions \(P(y)\), and bottom-up processes that encode sensory observations \(y\).
- This interplay of top-down and bottom-up processes distinguishes the inferential view from alternative approaches that only consider bottom-up processes.
- Bayesian inference is optimal w/ respect to cost function that is variational free energy.
- Variational free energy is closely related to surprise \(-\ln P(y)\)
- Bayesian inference is different from Maximum Likelihood Estimation, which simply selects the hidden state \(y\) most likely to have generated the data \(x\).
- The results of inference are subjective, because
- Biological creatures have limited computational and energetic resources, which make Bayesian inference intractable. They make approximations:
- variational posterior - based on mean field approximations
- The generative model may not correspond to the real generative process.
- The generative model, as it is optimized with new experiences acquired, may not even converge to the generative process.
- The generative process is in a true state \(x^*\), which generates an observation \(y\), which the organism senses. Both \(x^*\) and \(y\) are hidden state.
- Psychological claim about optimality of inference is always contingent on the organism’s resources - its specific generative model, and bounded computational resources.
- Biological creatures have limited computational and energetic resources, which make Bayesian inference intractable. They make approximations:
References:
- ChatGPT about Bayesian Statistics
- Wikipedia: Gamma Function, Beta Function
- Emil Artin: The Gamma Function
- MIT RES.6-012 Intro to Probabilities L04.9: Multinomial Probabilities (2018)
- Jordan Boyd-Graber: INST414: Advanced Data Science at UMD’s School:
- Expectations and Entropy
- Multinomial and Poisson Distributions
- Continuous Distributions: Beta and Dirichlet Distributions. Inverse Beta function should be Γ(α+β) / (Γ(α)Γ(β)).
- Harvard Stat 110 (2013), Joe Blitzstein, book
More references:
- S. Alexander: God Help Us, Let’s Try To Understand Friston On Free Energy (2018)
- Maxwell Ramstead: A tutorial on active inference (2020)
- Casper Hesp et al: Deeply Felt Affect: The Emergence of Valence in Deep Active Inference (2021)
- Friston et al: Variational ecology and the physics of sentient systems (2018)
- Friston et al: Knowing one’s place: a free-energy approach to pattern regulation (2015)
- ActInfLab ModelStream #001: A Step-by-Step Tutorial on Active Inference (2021)
- Ryan Smith et al: A Step-by-Step Tutorial on Active Inference and its Application to Empirical Data (2021)
- DaCosta et al: Active inference on discrete state-spaces: a synthesis (2020)
- Comparisons to RL:
- Sajid et al: Active inference: demystified and compared (2019)
- Sajid et al: Reward Maximisation through Discrete Active Inference (2020)
- Medium: O. Solopchuk:
- R. Bogacz: A tutorial on the free-energy framework for modelling perception and learning (2017)
- J.C.R. Whittington, R. Bogacz: An Approximation of the Error Backpropagation Algorithm in a Predictive Coding Network with Local Hebbian Synaptic Plasticity (2017)
- B. Lotter et al: PredNet (2016)
- S. Dora, C. Pennartz: A Deep Predictive Coding Network for Learning Latent Representations (2018)
- Wikipedia: Variational Bayesian methods.
- See Chap.4 in MacKay’s Information Theory, Inference, and Learning Algorithms
- MacKay: Course on Information Theory, Pattern Recognition, and Neural Networks
- UZH & ETH Zurich
- Computational Psychiatry Course 2019 (summer school)
- Active Inference lecture
- Computational Psychiatry Course 2019 (summer school)
- S. Levine: Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review (2018)
- T. Parr: Neuronal message passing using Mean-field, Bethe, and Marginal approximations (2019)
- K Friston: CCN Workshop: Predictive Coding (2016)
- T. Parr, K. Friston: Generalised free energy and active inference (2019)
- Lex Fridman #99: Neuroscience and the Free Energy Principle (2021)
- Machine Learning Street Talk: #033 Karl Friston - The Free Energy Principle (2020)
- Mathematical Consciousness Sciences: Markov blankets and Bayesian mechanics (Karl Friston) (2020)