Entropy

References:

C. Shannon: A Mathematical Theory of Communication (1948)
E.T. Jaynes: Probability Theory: The Logic of Science (2003)
K. Mallick: Thermodynamics and Information Theory, video
O. Rioul: This is IT: A Primer on Shannon’s Entropy and Information, video
The Cross-Entropy Method, R. Rubinstein, D. P. Kroese (2004)
Towards Data Science: Cross-entropy method for Reinforcement Learning, A.. Khare (2020)
Wikipedia entropy and information entropy pages

Suppose we start with a discrete probability distribution $(p_1, p_2, …, p_n)$, where $0 \le p_i \le 1$, and $\sum p_i = 1$. How would we define the entropy, or the amount of information $H(p_1, ..., p_n)$ given by the probability distribution?

We can reason by analogy with thermodynamics, see Mallick, where entropy of a microcanonical system enumerates the total number of microscopic states, $\Omega(E, V)$, and the Bolzmann formula states that entropy is

\[\begin{align*} S = k_B log \Omega \,\,\, \mathrm{with} \,\,\, k_B \sim 1.3810^{-23} \end{align*}\]

If $p_i$ are all rational numbers, we can write them as $p_i = \frac{a_i}{N}$, with $a_i, N$ nonnegative integers, and we can define by analogy the combinatorial entropy of $a_1, ... a_n$ as:

\[\begin{align*} H_N(a_1, ..., a_n) = \frac{1}{N} \ln {N \choose a_1, ... a_n} = \frac{1}{N} \ln \frac{N!}{a_1! ... a_n!} \end{align*}\]

Here, the choice of $N$ is not unique; we can replace $N$ with any of its multiples $kN$, and $a_1, ..., a_n$ with $ka_1, ..., ka_n$. We define

\[\begin{align*} H(p_1, ..., p_n) := \underset{k \rightarrow \infty}{\lim} H_{kN}(ka_1, ..., ka_n) \end{align*}\]

which evaluates, by the Stirling approximation $\ln t! \sim t \ln t - t + O(\ln t)$ to

\[\begin{align*} H(p_1, ..., p_n) = \underset{k \rightarrow \infty}{\lim} \frac{1}{kN} (kN \, \ln kN - \sum_{i=1}^n k a_i \ln k a_i) = - \sum_{i=1}^n \frac{a_i}{N} \, \ln \frac{a_i}{N} = - \sum_{i=1}^n p_i \ln p_i \end{align*}\]

To be continued…