Statistical physics
In this note we review the fundamentals of equilibrium statistical mechanics and their application to statistical modeling. We demonstrate how concepts of probability and inference can be usefully cast in the language of physics and energy. By leveraging this correspondence, we can import analytic and computational tools from physics to tackle statistical questions.
We ground our exploration in the simple statistical setting of independent flips of a coin with probability \(p\) of landing heads. As discussed in this note, the model likelihood of observing a sequence \(\boldsymbol{s}\) with \(n_H\) heads out of \(n\) flips is
We define an physical model equivalent to this statistical one and explore how this physical perspective motivates alternative types of models and inference techniques. The presentation here is adapted from my PhD thesis, alongside reviews of statistical inference and information theory
Energy and its conservation
Physics often describes systems in terms of a configuration space of possible states. In the coin flip example, the state of the system (or "configuration") is the sequence of \(n\) observed outcomes. We represent this configuration as a vector \(\boldsymbol{s}\), where each entry \(s_i = 1\) indicates that flip \(i = 1,...,n\) landed heads, and \(s_i = 0\) represents tails. The configuration space of coin flips is thus comprised of all possible sequences of \(n\) flips, forming all binary vectors of length \(n\).
This notation elicits the Ising model fundamental to statistical mechanics. The configurations of this model consist of atomic "spins" that reside at one of \(n\) "sites," each in either its excited state \(s_i = 1\) or ground state \(s_i = 0\)1. We assign the excited state an energy of 1 and the ground state energy 0, as shown in Figure 1. The total energy of the system, referred to as the Hamiltonian, is then the number of excited states (or heads)
A more general Ising model would include coupling energies between neighboring spins, but we omit them here to maintain independent coin flips.
In an isolated system the energy Eq. 2 is always conserved, a fundamental law of physics. Although changes in individual spin states can occur, they must be counterbalanced by changes in other spins to maintain the "global" energy of the system. Let \(E\) be this constant energy of the system, so that only configurations \(\boldsymbol{s}\) where \(H(\boldsymbol{s}) = E\) are permitted, cases with \(n_H = E\) excited states. The number of possible system configurations (also known as microstates) that satisfy this condition is then the number of ways to arrange the \(n_H = E\) excited states among the \(n\) sites,
Statistical mechanics fundamentally postulates that when a system is in equilibrium, all these microstates of the same total energy are equally likely to appear. This assumption defines equilibrium statistical mechanics, the framework we adopt in this note. Under this postulate the probability of observing any specific configuration \(\boldsymbol{s}\) is therefore
This uniform distribution over all configurations of a fixed energy is known as the microcanonical ensemble. As a distribution over possible sequences of coin flips, it can also be interpreted as a microcanonical model where the continuous local probability of heads \(p\) has been replaced by the discrete global number of heads \(n_H = E\).
Many statistical models are framed in this microcanonical form, where the "parameters" are globally observed quantities. While this formulation may not be well-suited for the coin flip example, where we have little reason to expect a "conservation of heads," it arises in other statistical settings where certain properties are conserved across random realizations. For instance, in a network of sports matches the total number of games played each season remains constant each year, even though the pattern of connections among teams changes.
Entropy
To return to the independent coin flip model we must broaden the physical picture. Realistically most physical systems are not truly isolated but rather exchange energy with their environment in a setting known as the canonical ensemble. In physical language the original system is now a subsystem in equilibrium with a large thermal bath. Although the combined system of both the subsystem and the thermal bath must still conserve overall energy, our subsystem of interest can gain and lose energy to the thermal bath.
Figure 2 illustrates this set up for the coin flip example. Let \(\tilde{\boldsymbol{s}}\) represent the configuration of the thermal bath, containing \(\tilde{n}\) spins and energy \(\tilde{E} = H(\tilde{\boldsymbol{s}})\) which counts the number of excited states in the bath. Including our subsystem \(\boldsymbol{s}\), the full system \((\boldsymbol{s},\tilde{\boldsymbol{s}})\) then has \(n + \tilde{n}\) total spins and total energy \(E_T = H(\boldsymbol{s},\tilde{\boldsymbol{s}}) = E + \tilde{E}\). Since this total energy is conserved, the full system \((\boldsymbol{s},\boldsymbol{s}')\) is uniformly distributed according to the microcanonical ensemble Eq. 4 of \(E_T\) excited states on \(n + \tilde{n}\) sites,
Although each pair \((\boldsymbol{s},\boldsymbol{s}')\) with total energy \(E_T\) is equally likely, certain subsystem energies \(E\) are more likely than others. Figure 2 demonstrates this effect with two example configurations. In panel (a) the excited states ("H") are all contained within the thermal bath \(\tilde{\boldsymbol{s}}\), meaning that the subsystem \(\boldsymbol{s}\) has its lowest possible energy \(E = 0\) and the thermal bath has the full energy \(E_T\). In panel (b) the excited states are evenly distributed between the subsystem and the bath, and the subsystem has absorbed some energy from the bath. Across all possible configurations of the joint subsystem-bath system, imbalanced distributions like (a) are less common than balanced ones (b). Random configurations of the overall ensemble are therefore likely to be balanced. As a result, if our subsystem begins in its ground state (a) it will tend to thermalize into a configuration like (b) with evenly distributed excited states.
To quantify this tendency, we can count the number of overall subsystem-bath configurations \((\boldsymbol{s},\tilde{\boldsymbol{s}})\) with subsystem energy \(E\). As in Eq. 3, there are \(\Omega(E)\) microstates of the subsystem \(\boldsymbol{s}\) all share the same energy \(E\). We refer to this collection of microstates as a macrostate of the subsystem with energy \(E\). Figure 3 illustrates this grouping of subsystem microstates into macrostates by energy. Macrostates that contain more microstates, those with higher \(\Omega(E)\), are naturally observed more often. We can likewise construct macrostates of the thermal bath \(\tilde{\boldsymbol{s}}\) at energies \(\tilde{E}\). Considering the number of ways to arrange the \(\tilde{E}\) excited states among the \(\tilde{n}\) sites of the bath, there are
microstates in the bath macrostate of energy \(\tilde{E}\). In many settings the full microstate is not observable, for instance the precise position and velocity of each molecule of a gas. In these settings the macrostate summarizes the pieces of physically relevant information that can be observed, such as the energy or pressure.
Since the subsystem macrostate has \(\Omega(E)\) microstates and the bath macrostate has \(\tilde{\Omega}(\tilde{E})\), there are \(\Omega(E)\tilde{\Omega}(\tilde{E})\) unique pairs \((\boldsymbol{s},\tilde{\boldsymbol{s}})\) with subsystem energy \(E\) and bath energy \(\tilde{E} = E_T - E\). Since each pair is equally likely to appear in the overall microcanonical ensemble, we thus observe subsystem energy \(E\) with probability
These macrostate multiplicities can also be written using macrostate entropy, defined as the logarithm2
The energy distribution then depends on the overall entropy \(S(E) + \tilde{S}(\tilde{E})\) as
Configurations with higher total entropy are therefore more likely to appear, as calculated in Figure 2. This tendency to observe higher entropy states is the content of the 2nd law of thermodynamics: that the entropy of the universe cannot decrease. On such large scales differences in entropy are large, and the probability Eq. 7 approaches a certainty, a law of physics.
The canonical ensemble
We can similarly compute the distribution of subsystem configurations \(\boldsymbol{s}\), not just of its energy \(E\). If we fix the subsystem microstate, valid pairs \((\boldsymbol{s},\tilde{\boldsymbol{s}})\) correspond to bath microstates \(\tilde{\boldsymbol{s}}\) among the \(\tilde{\Omega}(\tilde{E})\) configurations of the remaining energy. As each pair is equally probable, the subsystem is distributed as
Unlike the uniform, microcanonical ensemble of the subsystem, certain configurations are now more or less likely based on the entropy of the surrounding bath.
In this picture the canonical ensemble is defined by the limit \(\tilde{n} \rightarrow \infty\) of a large thermal bath that can effectively absorb subsystem fluctuations. We further fix the average energy \(p = \tilde{E}/\tilde{n}\) of the thermal bath in this limit, which corresponds to a fixed density of excited states. If we Stirling approximate the binomial coefficient Eq. 6, we find the bath’s entropy \(\tilde{S}(\tilde{E})\) is proportional to its energy (up to a constant \(C\)) as
This constant of proportionality \(\beta\) is known as the inverse temperature and relates changes in the bath energy \(\tilde{E}\) to changes of entropy \(\tilde{S}\) as
aligning with the usual thermodynamic definition3 of the temperature \(T\). We will typically assume this temperature is positive and so added energy increases the entropy, although negative temperatures are possible in certain cases such as population inversion in laser physics or the coin flips when \(p > 0.5\).
Comparing to Eq. 8, we then see that the probability of observing any given subsystem \(\boldsymbol{s}\) in this canonical ensemble is
where we have used that the total energy \(E_T = E + \tilde{E}\) is conserved. In fact, by similar arguments any system \(\boldsymbol{s}\) in thermal equilibrium with a large bath at inverse temperature \(\beta\) follows this same Boltzmann distribution (or "Gibbs distribution") \(P(\boldsymbol{s}) \propto e^{-\beta H(\boldsymbol{s})}\), which may itself taken as the definition of the canonical ensemble of \(\boldsymbol{s}\). When \(T = 0\), \(\beta \rightarrow \infty\), the subsystem is stuck in its ground state with the smallest possible energy \(H(\boldsymbol{s})\). This aligns with the usual tendency of a physical system towards smaller energy, as a ball rolls down a hill. As the temperature increases and \(T \rightarrow \infty, \beta = 0\), however, the Boltzmann distribution becomes uniform and every subsystem microstate is equally likely independent of its energy.
At finite \(\beta > 0\) between these extremes, the typical thermal configuration may not be the ground state, even though that is the single most likely microstate. Rather, the subsystem is likely to be found in some other macrostate with higher entropy \(S(E)\) as seen in Figure 2. Although each individual microstate of this macrostate has smaller probability than the ground state, their greater number makes their macrostate collectively more likely to be observed. This effect can be quantified by revisiting Eq. 7 for the distribution of energy \(E\). Since the large bath’s entropy is proportional to its energy we have
where we have defined the free energy
The most likely energy \(E\) to be observed is therefore the minimum of this free energy. Depending on the temperature \(T\), this may no longer be the ground state energy due to the influence of the entropy \(S(E)\).
To complete the description of the canonical ensemble, we normalize the Boltzmann distribution as
with the partition function \(Z(\beta)\), a quantity which tells us much about the system in its own right. For example, its logarithmic derivative
yields the average energy \(\langle E \rangle\) under the Boltzmann distribution.
Physics and models
If we normalize the Boltzmann distribution of our coin flip system with the partition function \(Z(\beta) = (1-p)^{-n}\), we obtain
which we recognize as the model likelihood Eq. 1 for biased coin flips with probability \(p\). In this correspondence, we can check that the average energy relation Eq. 12 of the physical system indeed recovers the expected number of heads
Unlike the microcanonical ensemble, the number of observed heads \(n_H = E\) can now thermally fluctuate about this expectation via exchange with its surroundings in a manner that exactly reproduces independent coin flips.
This example motivates us to draw a broader analogy between statistics and physics. In Eq. 13 we described a physical system that reproduces the biased coin flip likelihood, the same distribution of flip outcomes given a fixed parameter \(p\). Often, however, we are interested in the reverse inference problem of understanding the space of likely parameter values \(p\) given a particular observation. Suppose we have a generic model \(P(\boldsymbol{x}|\boldsymbol{\theta})\) of data \(\boldsymbol{x}\) with parameters \(\boldsymbol{\theta}\). As discussed here, the posterior distribution of parameters inferred from data \(\boldsymbol{x}\) is by Bayes’ law
For a fixed observation \(\boldsymbol{x}\), the corresponding physical system is defined by the Hamiltonian
over the configuration space of possible parameters \(\boldsymbol{\theta}\). If we assume a uniform prior \(P(\boldsymbol{\theta}) = 1\), the posterior distribution over parameters is proportional to the Boltzmann distribution of this system at unit temperature \(\beta = 1\)
In fact, if we normalize the Boltzmann distribution as
the partition function \(Z(1)\) is equal to the Bayesian evidence
Through this correspondence we can explore the posterior distribution of any Bayesian model by simulating the behavior of the analogous physical system.
This equivalence highlights a subtle philosophical difference between the physical perspective and the common statistical view of these problems. In statistics, particularly in frequentist treatments, we often consider and report the single best-fit parameter of a model that maximizes \(P(\boldsymbol{\theta}|\boldsymbol{x})\). Physically, this is akin to identifying the ground state configuration with the lowest energy \(H(\boldsymbol{\theta})\).
Yet in physics, we are usually more interested in the typical behavior of the system across the entire thermal ensemble rather than the nature of the single most probable microstate. As shown in Figure 2, there may be many other less likely yet more entropic configurations that dominate the overall distribution when taken together. In common glass, for example, although the ground state would be an ordered crystalline structure, thermal fluctuations make the typical configuration amorphous, giving the material its signature uniform transparency. Keeping with this perspective, in this work we will consider the full posterior distribution when possible to give a comprehensive picture of system behavior.
Markov Chain Monte Carlo
For this purpose we employ Markov Chain Monte Carlo (MCMC) methods, a common technique to simulate generic physical systems and so to sample from generic probability distributions. This strategy performs a weighted random walk over configuration space, analogous to thermalization and dispersion in a real system. Returning to our coin flip example, we can consider how in Figure 2 the subsystem dynamically evolves. If our subsystem \(\boldsymbol{s}\) begins in a the low entropy ground state (a), through the jostling of the coins the system will naturally tend towards a typical, high entropy configuration (b). Yet physically this transition is not instantaneous. For instance, gas molecules kinetically bump into each other and gradually disperse throughout their enclosure. We use Markov chain methods to simulate coin flip thermalization as a process that occurs one coin flip at a time.
A Markov chain walks through configuration space in discrete increments of time. At each step the Markov chain transitions from one state at time \(t\) to another state at slightly later time \(t + \Delta t\). This is a random walk, where a move from state \(\boldsymbol{s}\) to \(\boldsymbol{s}'\) is made with probability \(P(\boldsymbol{s} \rightarrow \boldsymbol{s}')\). Under mild conditions on the transition probabilities, this Markov chain will converge to some equilibrium distribution \(P(\boldsymbol{s})\). If the states of the Markov chain are distributed according the equilibrium \(P(\boldsymbol{s})\), the final \(\boldsymbol{s}'\) after the chain move \(\boldsymbol{s} \rightarrow \boldsymbol{s}'\) will by definition share the equilibrium distribution
We are then left with the problem of constructing a Markov chain which has the particular equilibrium distribution \(P(\boldsymbol{s})\) we are interested in.
A simple and ubiquitous way to establish such a chain is known as the Metropolis-Hastings algorithm. Each step in this Markov chain consists of two parts. First, a move to a new state is proposed given the current state. Second, that move is either accepted and the chain moves to the new state, or it is rejected and the chain remains at its current state. If we propose moves with transition probabilities \(P_{\text{prop}}(\boldsymbol{s} \rightarrow \boldsymbol{s}')\), we accept each proposal with probability
We will often use symmetric proposals where \(P_{\text{prop}}(\boldsymbol{s} \rightarrow \boldsymbol{s}') = P_{\text{prop}}(\boldsymbol{s}' \rightarrow \boldsymbol{s})\), for which this acceptance probability simplifies to
In terms of the system energy \(H(\boldsymbol{s})\), we see that the algorithm will always accept changes that decrease the energy, and occasionally moves that increase it, an emulation of how the real system evolves.
From this two-step process the overall Markov chain transition probabilities are then distinguished by if they increase or decrease the probability as
where sum in the case \(\boldsymbol{s}' = \boldsymbol{s}\) accounts for all the rejected proposals to states \(\boldsymbol{s}'\) with probability \(P(\boldsymbol{s}') < P(\boldsymbol{s})\). Summing over these cases, now of states \(\boldsymbol{s}\) that can produce a state \(\boldsymbol{s}'\), we can check the stability condition Eq. 15 as
Therefore the desired distribution is a fixed point of the Metropolis-Hastings algorithm, regardless of the choice of proposals \(P_{\text{prop}}(\boldsymbol{s} \rightarrow \boldsymbol{s}')\). However, being a Markov chain, there is still a clear correlation between subsequent steps in the random walk. Only after a typical number of Markov chain known as the mixing time are the samples meaningfully independent. Therefore to efficiently obtain independent samples from the distribution, to for example compute expectations, the mixing time should be as small as possible. The form these proposals take can considerably impact the mixing time, and clever choices are often needed to make Monte Carlo methods tractable.
A common choice of proposal for discrete settings like this is to consider single site flips where we choose a random site \(i\) and flip it from heads to tails, \(s_i = 1 \mapsto 0\) or vice versa. An advantage to this local change is that the resulting chance in the probability (or energy) is small and therefore likely to be accepted. If a entirely new global configuration is drawn uniformly at random, it likely has a much lower probability (or higher energy) than the current sample, and so will likely be rejected and waste an iteration of the algorithm.
If we apply this to our simple coin flipping example, Figure 4 shows how the number of flips changes over the course of the Metropolis-Hastings algorithm. We can observe that although the system starts in a configuration not particularly representative of the equilibrium distribution, after many iterations the Markov chain thermalizes the state. Although this is a fairly simple distribution, one we could sample directly, the flexibility of MCMC allows us apply it to far more complex distributions to draw inferences. In doing this, however, we must be careful to ensure that the Markov chain has adequately converged. If samples are interpreted too early in the algorithm, results will be skewed by the initial state, as in the initial samples of Figure 4.
These Monte Carlo methods are very powerful, and are the workhorse behind much of Bayesian inference. In order to apply these methods to their fullest, we can further augment the Metropolis-Hastings algorithm by performing parallel tempering to reduce the mixing time of the Markov chains and thermodynamic integration to compute the Bayesian evidence in a manner analogous to how such methods are used to compute the physical free energy.
Footnotes
The two values of an Ising model spin are typically denoted as \(s_i = 1\) for the "up" state and \(s_i = -1\) for the "down" state, in line with the magnetic dipole moments they physically represent, although the choice of basis does not alter the underlying physics.↩︎
More generally defined as \(S = k_B \log \Omega\) where \(k_B \approx 1.38 \text{ J}/\text{K}\) is the Boltzmann constant. The units of this constant are due to the thermodynamic relation Eq. 10.↩︎
This definition ensures that if two thermal baths at temperatures \(T_1 > T_2 > 0\) are brought into contact energy will flow from the higher temperature bath to the lower temperature bath to increase the overall entropy, in keeping with the 2nd law of thermodynamics.↩︎