References:

Suppose we start with a discrete probability distribution $(p_1, p_2, …, p_n)$, where \(0 \le p_i \le 1\), and \(\sum p_i = 1\). How would we define the entropy, or the amount of information \(H(p_1, ..., p_n)\) given by the probability distribution?

We can reason by analogy with thermodynamics, see Mallick, where entropy of a microcanonical system enumerates the total number of microscopic states, \(\Omega(E, V)\), and the Bolzmann formula states that entropy is

\[\begin{align*} S = k_B log \Omega \,\,\, \mathrm{with} \,\,\, k_B \sim 1.3810^{-23} \end{align*}\]

If \(p_i\) are all rational numbers, we can write them as \(p_i = \frac{a_i}{N}\), with \(a_i, N\) nonnegative integers, and we can define by analogy the combinatorial entropy of \(a_1, ... a_n\) as:

\[\begin{align*} H_N(a_1, ..., a_n) = \frac{1}{N} \ln {N \choose a_1, ... a_n} = \frac{1}{N} \ln \frac{N!}{a_1! ... a_n!} \end{align*}\]

Here, the choice of \(N\) is not unique; we can replace \(N\) with any of its multiples \(kN\), and \(a_1, ..., a_n\) with \(ka_1, ..., ka_n\). We define

\[\begin{align*} H(p_1, ..., p_n) := \underset{k \rightarrow \infty}{\lim} H_{kN}(ka_1, ..., ka_n) \end{align*}\]

which evaluates, by the Stirling approximation \(\ln t! \sim t \ln t - t + O(\ln t)\) to

\[\begin{align*} H(p_1, ..., p_n) = \underset{k \rightarrow \infty}{\lim} \frac{1}{kN} (kN \, \ln kN - \sum_{i=1}^n k a_i \ln k a_i) = - \sum_{i=1}^n \frac{a_i}{N} \, \ln \frac{a_i}{N} = - \sum_{i=1}^n p_i \ln p_i \end{align*}\]

To be continued…