Information theory

Information theory
Review
Basics of information theory: Shannon and Kolmogorov
Published

July 15, 2025

In this note we informally discuss the foundations of information theory from Shannon entropy to Kolmogorov complexity. These concepts allow us to reckon with the inherent complexity of the data observed in a general sense, underpinning much of my research. The presentation is adapted from my PhD thesis, alongside reviews of statistical inference and physics.

Encoding

Central to information theory is a thought experiment aiming to encode a message as efficiently as possible. Suppose we would like to transmit to another party the results of 10 coin flips as the message "HHTTHTTHTT." Suppose further that we are restricted in this communication to use a binary channel which can only send a binary sequence of 0’s and 1’s of our choosing. We would then like the receiver on the other end of the channel to be able to decode our binary transmission back into our original message.

For example, we can encode these coin flips as the binary string "1100100100" where the digit "1" represents heads and "0" represents tails. This correspondence between meanings and binary strings is known as a codebook. So long as we and the receiver agree on the nature of the encoding, the receiver can use the codebook to decode the binary string back into the original message of outcomes. To determine the efficiency of our transmission we measure the length of our binary string in bits. In this case our encoding used 10 digits (bits) to transmit the message, one for each flip. A schematic of this encoding framework is given in Figure 1.

Figure 1: Encoding coin flips as a binary string

This association of digits to outcomes is quite a natural encoding of two-sided coin flips. In a more general setting, however, we will need to be more creative in our transmission. Suppose we would instead like to send the message "ELEVENELVES" as a binary string. Since this string contains 5 distinct characters, we can no longer encode the message by assigning each character its own binary digit. Some characters must be represented by a codeword of multiple binary digits. If we are not careful, however, this can render our message ambiguous. If we assign "E" to the codeword "0" and "L" to the codeword "00," the binary message "00" could decode into either "EE" or "L."

To avoid this polysemy, we can ensure that our transmission is uniquely decodable by using a prefix-free (or "instantaneous") code. If no codeword in our codebook is a prefix of another codeword, as the receiver reads the message from left to right the divisions between the codewords will always be clear. Our earlier example was not prefix-free as the codeword "0" is a prefix of the codeword "00," leading to the double meaning.

Any such prefix-free binary code can be usefully represented as a binary tree whose leaves each correspond to a character (or "symbol") being transmitted. The codeword associated to each symbol is then represented by the path from the root of the tree down to that symbol. Figure 2a shows an example of such a tree used to encode the five characters. Following the tree paths this "balanced" encoding represents the characters {"E", "L", "V", "N", "S"} with the codewords {"00", "01", "10", "110", "111"} respectively. With this codebook we can then encode the phrase "ELEVENELVES" into pictured binary string of length 24 bits and ensure that it uniquely decodes back into our desired dispatch.

Figure 2: Examples of an (a) balanced code and (b) optimized Huffman code to convert the phrase "ELEVENELVES" into a binary string. The string associated to each letter is denoted by the path from the top of the tree down to the appropriate node. The Huffman code is able to produce a shorter overall message than the balanced code by representing the common letter "E" with a short string "0" despite representing the uncommon letters "N" and "S" with longer strings.

Encoding efficiently

Now, a key goal of information theory is not only to successfully transmit a message but also to do so using as few bits as possible. This objective can be seen as a formalization of Occam’s razor, the scientific principle that favors the simplest possible answer to a question. In this analogy, transmitting our binary string effectively "explains" to the receiver the data we have observed, making a shorter transmission a more succinct explanation. If we understand predictable patterns in our observations, we can exploit them to construct a more efficient encodings. There is a fundamental duality between compression and modeling, as in this context, to compress is to understand.

In this spirit we can look for patterns in our message to try and come up with a clever way to shorten our encoding. The phrase "ELEVENELVES" has a rather lopsided distribution of characters at 5 E’s, 2 L’s, 2 V’s, 1 N, and 1 S. Some of our codewords are therefore being used in the transmission much more frequently than others. If symbol \(r = 1,...,q\) of the \(q\) symbols appears \(n_r\) times in our message as a codeword of length \(\ell_r\), the total message length is

\[\begin{aligned} \sum_{r=1}^q n_r \ell_r. \end{aligned} \tag{1}\]

To shorten the overall message we would therefore like the codewords to be as short as possible and to prioritize shortening frequently occurring symbols. In our original encoding the symbol "E" is transmitted with the codeword "00" at a cost of two bits apiece. Since "E" appears so frequently in the message it may be wise to instead represent it with a shorter codeword like "0" and save 5 bits in our total transmission.

Yet this choice has a cost. If we assign "0" to represent "E" none of the other four characters can be represented with a codeword that begins with a "1" or else the code would no longer be prefix-free. When shortening one codeword we must necessarily lengthen other codewords. This tradeoff is the content of Kraft’s inequality, which states that the codeword lengths of any prefix-free code must satisfy

\[\begin{aligned} \sum_{r=1}^q 2^{-\ell_r} \leq 1, \end{aligned} \tag{2}\]

considering the fraction of the binary tree each codeword occupies. Given the frequencies \(n_r\) at which each symbol appears, we would like to select codeword lengths that minimize the total message length Eq. 1 subject to the prefix-free constraint Eq. 2.

Huffman codes strike this balance and provably minimize the message length by assigning short codewords to frequent symbols while saturating Kraft’s inequality. Figure 2b contains an example of a Huffman code for our application. The code shortens the codeword for "E" from 2 to 1 bit while lengthening the codewords for "N" and "S" from 3 to 4 bits. Since "E" appears much more frequently than "N" and "S," this change shortens the overall message from 24 to 23 bits. By adapting the encoding to the nature of the data, the Huffman code achieves a more parsimonious representation of the message.

More generally given a frequency distribution \(\{n_r\}\) of symbols one can always construct a Huffman code with a simple recursive algorithm. The resulting optimal codeword lengths \(\{\ell_r\}\) follow a predictable pattern. If a symbol appears at a fraction \(p_r = \frac{n_r}{n}\) among the \(n\) total symbols, the length of its associated codeword satisfies

\[\begin{aligned} -\log_2(p_r) \leq \ell_r < -\log_2(p_r) + 1.\end{aligned}\]

Frequent symbols with high probability \(p_r\) are thus assigned small codewords as \(-\log_2(p_r)\) is small while infrequent symbols use longer codewords. In our "ELEVENELVES" example, the character "E" appears with probability \(p = 5/11\), and is so assigned a string of length \(1 \approx \log_2(11/5)\) while the character "S" appears at the ratio \(1/11\) and is encoded using \(4 \approx \log_2(11)\) bits.

Shannon entropy

In most contexts we consider, symbol probabilities are small and codeword lengths are long. In this regime we can approximate lengths as \(\ell_r = -\log_2(p_r)\), which would in fact be the optimal choices if the lengths could be non-integral numbers of bits. Using these optimal codeword lengths, the minimum message length per symbol is

\[\begin{aligned} S[\{p_r\}] = \frac{1}{n} \sum_{r=1}^q n_r \ell_r = -\sum_{r=1}^q p_r \log_2 p_r, \end{aligned} \tag{3}\]

known as the Shannon entropy (or simply "entropy") of the distribution \(\{p_r\}\). For continuous distributions \(P(\boldsymbol{x})\) this entropy generalizes to

\[\begin{aligned} S[P] = -\int P(\boldsymbol{x}) \log_2 P(\boldsymbol{x}) d\boldsymbol{x}. \end{aligned} \tag{4}\]

By providing an information theoretic lower bound on transmission, the Shannon entropy captures the inherent information content of a probability distribution that no amount of clever encoding tricks can overcome.

We can make contact between this information-theoretic entropy and the physical entropy described in this note. There the microcanonical ensemble is the uniform distribution \(P(\boldsymbol{s}) = \frac{1}{\Omega}\) over all possible configurations that conserve the total energy. The Shannon entropy Eq. 3 then agrees which the typical microcanonical entropy \(S = \log \Omega\). From this perspective, a macrostate with high physical entropy is one where a large amount of information is required to specify which of the many possible microstates it represents.

We also note that the uniform distribution has the highest entropy among all possible distributions \(\{p_r\}\) on \(q\) objects. By the convexity of the logarithm,

\[\begin{aligned} S[\{p_r\}] = -\sum_{r=1}^q p_r \log p_r \leq -\sum_{r=1}^q \frac{1}{q} \log \left(\frac{1}{q}\right) = \log q.\end{aligned}\]

This observation gives another motivation for the microcanonical ensemble: the maximum-entropy distribution over possible configurations. Since the entropy measures how structured, how compressible, a probability distribution is, this maximum-entropy distribution is structureless and maximally agnostic: a reasonable properties of an equilibrium distribution where any initial structure is thermalized away.

The canonical ensemble \(P(\boldsymbol{s}) \propto e^{-\beta H(\boldsymbol{s})}\) can likewise be motivated as a maximum entropy distribution with a given average energy, which determines the choice of \(\beta\). When designing priors for Bayesian inference, we will frequently appeal to this minimally assumptive principle and choose maximum-entropy priors subject to certain constraints we expect the system to provide. For example, a Gaussian distribution can be motivated as the maximum entropy distribution of a real random variable of a fixed mean and variance. A nice list of common statistical distribution and their maximum-entropy motivations can be found on Wikipedia.

Misspecification and KL-divergence

Returning to encodings, to obtain the entropy we had considered the total message length of a Huffman code optimized to send that particular message. If we believe that the symbols will be distributed with probabilities \(\{q_r\}\) we should optimize our Huffman code accordingly to have code lengths \(\ell_r = -\log q_r\). However in many contexts we do not a priori know what distribution of symbols to expect. We may assume a distribution \(\{q_r\}\) that is not borne out in practice. If the symbols we must transmit have true probabilities \(\{p_r\}\) the average code length becomes

\[\begin{aligned} \sum_r p_r (-\log q_r).\end{aligned}\]

Had we used code lengths \(-\log p_r\) attuned to the true distribution, this would give the Shannon lower bound. Since the our encoding is misspecified it will instead require a larger number of bits to transmit. The shortfall between the two, the extra cost we incur, is known as the Kullback-Leibler (KL) divergence between the true distribution \(\{p_r\}\) and our assumption \(\{q_r\}\):

\[\begin{aligned} D_{\text{KL}}(\{p_r\}||\{q_r\}) &= \left(\sum_{k=1}^q p_r (-\log q_r) \right) - \left(\sum_{k=1}^q p_r (-\log p_r) \right) \nonumber \\ &= \sum_{k=1}^q p_r \log \frac{p_r}{q_r} \geq 0.\end{aligned}\]

This story repeats when modeling data. Given a model with distribution \(Q(\boldsymbol{x})\) we can write the Huffman code length (or description length) of an observation \(\boldsymbol{x}\) as

\[\begin{aligned} H(\boldsymbol{x}) = -\log Q(\boldsymbol{x}),\end{aligned}\]

which we can compare to the relation \(H(\boldsymbol{x}) = - \log P(\boldsymbol{x})\) between system energy and model probability discussed here. If we use this model encoding on a stream of observations whose true distribution is \(P(\boldsymbol{x})\), the average description length decomposes as

\[\begin{aligned} \sum_{\boldsymbol{x}} P(\boldsymbol{x}) H(\boldsymbol{x}) &= -\sum_{\boldsymbol{x}} P(\boldsymbol{x})\log P(\boldsymbol{x}) + \sum_{\boldsymbol{x}} P(\boldsymbol{x})\log \frac{P(\boldsymbol{x})}{Q(\boldsymbol{x})} \nonumber \\ &= \underbrace{S[P]}_{\text{entropy}} + \underbrace{D_{\text{KL}}(P||Q)}_{\text{cross-entropy}}\end{aligned}\]

into the inherent entropy of the data and the cross-entropy cost of our model misspecification. As we model the random process \(P(\boldsymbol{x})\) our average description length can never fall below the entropic lower bound, but any description length above this point is evidence of the failure of our model to match reality. While it is relatively straightforward to measure this average description length in practice, deducing what fraction of it is due to the entropy or the cross-entropy is a hard problem. When comparing the average description lengths of two models on the same stream of data, however, we can confidently attribute their difference to a difference in the cross-entropies and prefer the model with the smaller average description length. This motivates the minimum description length (MDL) principle, which prefers models whose corresponding encodings across realistic data sets are as small as possible.

Statistical perspective

As an application, suppose that we would like to select the appropriate value of a parameter \(\boldsymbol{\theta}\) for a model \(P(\boldsymbol{x}|\boldsymbol{\theta})\). For each choice of parameter, the corresponding model description length is simply the minus log-likelihood \(H(\boldsymbol{x}|\boldsymbol{\theta}) = -\log P(\boldsymbol{x}|\boldsymbol{\theta})\). Choosing the model that minimizes the description length therefore amounts to finding the maximum-likelihood estimate of the parameter as

\[\begin{aligned} \text{argmin}_{\boldsymbol{\theta}} H(\boldsymbol{x}|\boldsymbol{\theta}) = \text{argmax}_{\boldsymbol{\theta}} P(\boldsymbol{x}|\boldsymbol{\theta}).\end{aligned}\]

As we discussed in this post, however, this maximum likelihood estimation is prone to overfitting. This approach is also problematic from an information-theoretic perspective. In our optimization of the transmission we have neglected the cost of transmitting the parameter \(\boldsymbol{\theta}\) itself. In the Bayesian context this parameter will be distributed according to a prior \(P(\boldsymbol{\theta})\) that corresponds to its own encoding

\[\begin{aligned} H(\boldsymbol{\theta}) = -\log P(\boldsymbol{\theta}). \end{aligned} \tag{5}\]

If we consider the total information cost of this now two stage process of first transmitting the parameter \(\boldsymbol{\theta}\) and then the data \(\boldsymbol{x}\) given that parameter, we recover Bayesian maximum a posteriori (MAP) estimation

\[\begin{aligned} \text{argmin}_{\boldsymbol{\theta}} \left[H(\boldsymbol{x}|\boldsymbol{\theta}) + H(\boldsymbol{\theta})\right] &= \text{argmax}_{\boldsymbol{\theta}} P(\boldsymbol{x}|\boldsymbol{\theta})P(\boldsymbol{\theta}) \nonumber \\ &= \text{argmax}_{\boldsymbol{\theta}} P(\boldsymbol{\theta}|\boldsymbol{x}). \end{aligned} \tag{6}\]

The maximum likelihood and MAP estimates of a parameter often differ considerably, particularly when a relatively small amount of data is available.

After the two stage transmission process of Eq. 6 we transmit to the receiver both the data of interest \(\boldsymbol{x}\) and the best-fit parameter \(\boldsymbol{\theta}\) used. When assessing model performance, however, knowledge of the the parameter is often redundant to the data. We can instead holistically evaluate model performance using the Bayesian evidence, the probability that a model generates a particular data \(\boldsymbol{x}\) summed over all possible parameters

\[\begin{aligned} P(\boldsymbol{x}) = \int P(\boldsymbol{x}|\boldsymbol{\theta})P(\boldsymbol{\theta}).\end{aligned}\]

The description length of the integrated model is then

\[\begin{aligned} H(\boldsymbol{x}) = -\log P(\boldsymbol{x}).\end{aligned}\]

Therefore we can motivate model selection that chooses model with higher Bayesian evidence, as when computing Bayes factors, as selecting the model that more efficiently compresses the data integrated over its latent parameters. These connections highlight the duality between the compression and modeling of a data set.

Kolmogorov complexity

To this point, we have considered the Shannon entropy of probability distributions. However, much of this thesis focuses on a subtly different notion of the complexity of objects. Shannon entropy quantifies the complexity of a probability distribution without regard to the specific objects within that distribution. For instance, suppose that we want to transmit one of two messages: the full text of Dune by Frank Hebert or It by Stephen King. While each book’s content is undoubtedly "complex," our current framework would allow us to "transmit" them with minimal information cost. If we define an encoding where "0" represents the text of Dune and "1" stands for It, we could send a single "0" to transmit the entirety of Dune. This setup might suggest that the inherent information cost of Dune’s content is just one bit, which is clearly an unreasonable conclusion.

The problem is that we have overlooked the information cost required to establish the codebooks. If the receiver is unaware of our coding scheme, we must first communicate the full text that each binary digit corresponds to before our transmission. Once this scheme is established, we can indeed send our choice of book with a single bit repeatedly at low cost. However, the initial information cost of creating the codebook is much higher. Shannon entropy measures the information needed to transmit objects drawn from a probability distribution, not the complexity of the objects themselves.

To broach the information content of an object, we should instead turn to the Kolmogorov complexity. Certain objects and outcomes appear to be inherently more complicated than others. For example a sequence of coin flips "HHHHHHHHHH" is easy to describe as "10 heads in a row." Even if the sequence was 1000 heads, the outcome would not be much more complex to describe. On the other hand, the pattern of coin flips "HHTTHTTHTT" we observed appears to be more complicated to describe. However, even in this case the coin flips we observed are simply the first 10 digits of \(\pi = 11.00100100..._2\) in binary: a concise, if unusual, explanation. Yet if we are presented with a truly "random" string of coin flips, there is little hope for such an efficient description of the outcomes. The Kolmogorov complexity is meant to capture the difference between these settings and fundamentally measure how structured a given data set is.

Roughly speaking, the Kolmogorov complexity \(K(\boldsymbol{x})\) can be understood as the length of the shortest computer program that would output the object (typically string) in question. In our earlier examples, this program might be "output 10 heads" or "first 10 digits of \(\pi_2\)" in pseudocode. In our earlier example of books, the full texts may be compressed with a technique like the Lempel-Ziv-Welch algorithm used in the .gif file format. The "program" in this case would consist of a description of the LZW algorithm, followed by the compressed file. The total program size, and so complexity will still be fairly large as some fraction of the original length of the book, but is much greater than the single bit we had used to transmit it in a probabilistic sense.

The "computer program1" in the definition of the Kolmogorov complexity is vague enough to accommodate any possible valid encoding or explanation of a string. For example, we can consider the Huffman encoding of the data \(\boldsymbol{x}\) generated by a model \(M\). We can imagine a computer program which consists of a description of the model \(M\) itself, then provides the Huffman code as a binary string of length \(H_M(\boldsymbol{x})\) that can be decoded with knowledge of the model. When the data set is large, we typically neglect the constant overhead required to describe the model and this framework and roughly say that an encoding of length \(H_M(\boldsymbol{x})\) is possible for the data \(\boldsymbol{x}\)2.

From this perspective each model can be viewed as a competing encoding of the data, each of which provides an upper bound on the inherent complexity. If we have a basket of candidate models \(M \in \mathcal{M}\), we can then loosely approximate the "true" information cost of \(\boldsymbol{x}\) as the minimum

\[\begin{aligned} K(\boldsymbol{x}) \sim \text{min}_{M \in \mathcal{M}} H_{M}(\boldsymbol{x}).\end{aligned}\]

The higher the model evidence the shorter the description length, yielding a tighter upper bound on the true cost.

Despite this relative improvement, it is not possible to conclusively show that our approximation is particularly close to the truth. There may always be clever encoding out there that transmits the data far more efficiently than the models we consider. To show that \(K(\boldsymbol{x})\) is above some value \(n\), we would need to check the outputs of all possible programs of length less than or equal to \(n\), an uncomputable task. For example, we may not recognize our initial sequence of coin flips "HHTTHTTHTT" as the first 10 digits of \(\pi = 11.00100100..._2\) in binary, a more concise explanation that our models.

Although we cannot find the perfect encoding, nor the perfect model, we can strive for a better understanding of systems and their complexity in this information-theoretic framework.

Footnotes

  1. More formally a universal Turing machine.↩︎

  2. In this pursuit we cannot consider models too finely attuned to a particular data set, or else we can no longer neglect the cost to specify the model itself.↩︎