Entropy
References:
- C. Shannon: A Mathematical Theory of Communication (1948)
- E.T. Jaynes: Probability Theory: The Logic of Science (2003)
- K. Mallick: Thermodynamics and Information Theory, video
- O. Rioul: This is IT: A Primer on Shannon’s Entropy and Information, video
- The Cross-Entropy Method, R. Rubinstein, D. P. Kroese (2004)
- Towards Data Science: Cross-entropy method for Reinforcement Learning, A.. Khare (2020)
- Wikipedia entropy and information entropy pages
Suppose we start with a discrete probability distribution $(p_1, p_2, …, p_n)$, where \(0 \le p_i \le 1\), and \(\sum p_i = 1\). How would we define the entropy, or the amount of information \(H(p_1, ..., p_n)\) given by the probability distribution?
We can reason by analogy with thermodynamics, see Mallick, where entropy of a microcanonical system enumerates the total number of microscopic states, \(\Omega(E, V)\), and the Bolzmann formula states that entropy is
\[\begin{align*} S = k_B log \Omega \,\,\, \mathrm{with} \,\,\, k_B \sim 1.3810^{-23} \end{align*}\]If \(p_i\) are all rational numbers, we can write them as \(p_i = \frac{a_i}{N}\), with \(a_i, N\) nonnegative integers, and we can define by analogy the combinatorial entropy of \(a_1, ... a_n\) as:
\[\begin{align*} H_N(a_1, ..., a_n) = \frac{1}{N} \ln {N \choose a_1, ... a_n} = \frac{1}{N} \ln \frac{N!}{a_1! ... a_n!} \end{align*}\]Here, the choice of \(N\) is not unique; we can replace \(N\) with any of its multiples \(kN\), and \(a_1, ..., a_n\) with \(ka_1, ..., ka_n\). We define
\[\begin{align*} H(p_1, ..., p_n) := \underset{k \rightarrow \infty}{\lim} H_{kN}(ka_1, ..., ka_n) \end{align*}\]which evaluates, by the Stirling approximation \(\ln t! \sim t \ln t - t + O(\ln t)\) to
\[\begin{align*} H(p_1, ..., p_n) = \underset{k \rightarrow \infty}{\lim} \frac{1}{kN} (kN \, \ln kN - \sum_{i=1}^n k a_i \ln k a_i) = - \sum_{i=1}^n \frac{a_i}{N} \, \ln \frac{a_i}{N} = - \sum_{i=1}^n p_i \ln p_i \end{align*}\]To be continued…