An Introduction to Markov Chain Monte Carlo Sampling

Writing and diagnosing a Metropolis sampler in R

It is usually not too difficult to define priors and specify a likelihood function, which means we can calculate the unnormalized posterior for any combination of relevant parameter values. However, that is still insufficient to give us marginal posterior distributions for the parameters of interest. The grid method that was used in the previous post is not feasible for situations with a large number of parameters, and conjugate models with analytical solutions are mainly relevant for a subset of suitable problems. So what can we do? We sample from the posterior distribution!

This post illustrates Markov chain Monte Carlo sampling by writing a simple Metropolis sampler in R. There are some very efficient MCMC implementations already out there, so the goal of doing this manually is purely educational. (If you are unfamiliar with the basics of Bayesian inference, my previous post may be a better start.)

Defining the prior, likelihood, and posterior

To keep things simple, and to enable comparison, we will use the same data, model, and priors as in the previous post. That means we have three noisy observations of a single individual, which we use to estimate this individual’s latent trait, \(\mu\), as well as the standard deviation of the measurement errors, \(\sigma\).

To get started, we will recreate the data:

set.seed(4)                 # Set seed for reproducibility
n <- 3                      # Set number of observations
y <- rnorm(n,mean=1,sd=2)   # Generate data, taking random draws from a normal distribution

We exploit the fact that our individual is drawn from a population, and let our prior for \(\mu\) reflect the known population distribution: \(p(\mu) \sim N(-1,1.5)\). For \(\sigma\), we use a simple, flat prior (even though other choices may be better): \(p(\sigma) \sim U(0,10)\). The joint prior is the product of these: \[p(\mu,\sigma) = p(\mu)p(\sigma)\] We thus define the joint prior distribution as the following function:

prior <- function(mu,sigma) {dnorm(mu,mean=-1,sd=1.5) * dunif(sigma,0,10)}

The likelihood function is the product of the individual likelihoods, which in turn are defined by the normal probability density function and a given pair of values for \(\mu\) and \(\sigma\):

\[p(y|\mu,\sigma) = \prod^n_{i=1} N(y_i|\mu,\sigma^2)\] We thus define the likelihood function as follows:

likelihood <- function(y,mu,sigma) {prod(dnorm(y,mean=mu,sd=sigma))}

Recall that the posterior is proportional to the product of the likelihood and the prior: \[p(\mu,\sigma|y) \propto p(y|\mu,\sigma)p(\mu,\sigma)\] We thus define the unnormalized posterior as the following function:

unnorm.posterior <- function(y,mu,sigma) {likelihood(y,mu,sigma) * prior(mu,sigma)}

Setting up a Metropolis sampler

We are now able to calculate the unnormalized posterior probability for any given combination of relevant parameter values. But now what? For a high-dimensional problem (which we admittedly do not have here), Markov chain Monte Carlo (MCMC) sampling may be the best option for approximating the posterior distribution. MCMC can simulate draws from any distribution as long as we can provide a function whose output is proportional to the density of the relevant target distribution. That is typically the situation in Bayesian inference, as the unnormalized posterior is proportional to the actual posterior distribution. In this setting, MCMC methods will produce draws that converge to the target distribution (i.e. the posterior) as the number of draws increases.

The Metropolis algorithm is a straightforward example of MCMC and thus useful for understanding the basics. The algorithm runs through a number of iterations to create a sequence of parameter values, where each set of values depend on the previous set. More specifically, it takes a set of starting values (one per parameter), and proposes a set of new values by adding a random innovation drawn from a specified function (we will use a normal distribution). The proposal set can then either be accepted – and replace the starting values – or be rejected – so that the starting values are retained. Then the next iteration follows, based on whichever set of values was retained in the previous iteration.

The probability that a proposal set is accepted depends on a comparison of the unnormalized posterior for the set of current values and for the set of proposed values. If the proposal set has a higher unnormalized posterior, it will always be accepted. If it does not, it may still be accepted, but the probability of acceptance declines linearly with the ratio of the value for the proposal set over the value for the current set: If, for instance, the unnormalized posterior probability for the proposal set is half of that for the current set, then the proposal set has an acceptance probability of .5. The result is that the sampler (once the chain has reached its equilibrium) has acceptance probabilities equal to the target distribution, producing draws from the posterior distribution.

To start setting up the algorithm, let us first define our proposal function, which takes a pair of parameter values as input, and returns a new pair after adding random innovations. The innovations are drawn from a normal distribution with a mean of zero and a specified standard deviation (proposal.sd):

proposals <- function(mu,sigma,proposal.sd) {c(mu,sigma) + rnorm(2,0,proposal.sd)}

Now, let us set up the algorithm as a function running for a specified set of iterations (n.it), using a given proposal standard deviation, a set of starting values, and a set of data (y):

metropolis <- function(y,n.it,proposal.sd,starting.values) {
  sample <- vector()        
  current.draw <- starting.values
  for (it in 1:n.it) {      
    proposal <- proposals(current.draw[1],current.draw[2],proposal.sd)
    if (proposal[2] < 0) {acceptance.ratio = 0} # Reject if sigma < 0 (as prior = 0)
    else {acceptance.ratio <- (unnorm.posterior(y,proposal[1],proposal[2]) / 
                                 unnorm.posterior(y,current.draw[1],current.draw[2]))}
    random.threshold <- runif(1,0,1) # Threshold for accepting when ratio < 1
    if (acceptance.ratio >= random.threshold) {current.draw <- proposal}
    sample <- rbind(sample,current.draw) 
  } 
  return(sample)
}

Obtaining simulation draws

Let us see if the algorithm succeeds in sampling from the posterior density. First, we define a starting value for each of the parameters, and in this case we will draw these from our prior distributions:

starting.values <- c(rnorm(1,-1,1.5),runif(1,0,10)) # The first is mu, the second sigma

We will run the sampler for 2500 iterations, and give the innovations a standard deviation of 1:

n.it = 2500
draws <- metropolis(y,n.it,proposal.sd=1,starting.values=starting.values)

This gives us 2500 draws for each of the parameters, and it is useful to plot these draws over their draw-number as a sequence:

# For basic versions of these plot, run: plot(1:n.it,draws[,1],type="l"); plot(1:n.it,draws[,2],type="l") 

The plots above are encouraging, as the sequences both look stationary and highly variable. This suggests that the proposal distribution is sufficiently wide to move the sampler quickly across the parameter space, while the proposals still have a reasonable chance of being accepted.

Marginal distributions

Ultimately, we are interested in drawing conclusions regarding the parameter values, so we want to assess the marginal posterior distributions. As noted in the previous post, the marginal posterior for \(\mu\) is defined as: \[ p(\mu|y) = \int p(\mu,\sigma |y)d(\sigma)\] The MCMC approach makes it easy to approximate this distribution, as we can simply by ignore the other parameter(s) when summarizing the draws for the parameter we are interested in. In other words, to approximate the marginal distribution for \(\mu\), we can just estimate the density of its draws. Now, as we are using the same data and model as in the previous post, we can also compare the densities of our draws to the previously calculated densities (here represented by dark red lines):

That looks pretty good! And if we wanted greater accuracy, we could just increase the number of draws.

# For basic versions of these plot, run this (after also running the codes for the previous post): plot(density(draws[,1]));lines(mu.h,mu.marginal.posterior/res); plot(density(draws[,2]));lines(sigma.h,sigma.marginal.posterior/res)

What could go wrong?

Let us consider some potential issues that could arise. First, it should be noted that it is always a good idea to run the sampler more than once, as it helps diagnosing potential problems. Here, we will run the sampler twice, which gives us two sequences of draws for each parameter. Normally one would want to run one chain on each processor core, but we will keep it simple and one run after the other.

Too large step size

For the sake of illustration, let us run the sampler with a much larger step size, increasing the standard deviation of the proposal distribution to 20:

n.it = 2500
chain1 <- metropolis(y,n.it,proposal.sd=20,starting.values=c(rnorm(1,-1,1.5),runif(1,0,10)))
chain2 <- metropolis(y,n.it,proposal.sd=20,starting.values=c(rnorm(1,-1,1.5),runif(1,0,10)))
draws.2 <- data.frame(1:n.it,chain1,chain2)

The results are as follows:

What is going on here? The wide proposal distribution is giving us a large number of proposals that have very low or zero posterior probability, which results in a very high rejection rate (about 99%). This is visible as flat lines in the plot. In practice, we will need a greater number of draws to get an acceptable approximation of the posterior distribution.

# For basic versions of the plots above, run: plot(1:n.it,draws.2[,2],type="l",ylim=c(min(draws.2[,c(2,4)]),max(draws.2[,c(2,4)]))); lines(1:n.it,draws.2[,4]); plot(1:n.it,draws.2[,3],type="l",ylim=c(min(draws.2[,c(3,5)]),max(draws.2[,c(3,5)]))); lines(1:n.it,draws.2[,5])

Too small step size

When the situation above occurs, a natural response is to make the proposal distribution narrower. Let us see what happens if we go a bit far and reduce the standard deviation of the innovations to 1/100:

n.it = 2500
chain1 <- metropolis(y,n.it,proposal.sd=.01,starting.values=c(rnorm(1,-1,1.5),runif(1,0,10)))
chain2 <- metropolis(y,n.it,proposal.sd=.01,starting.values=c(rnorm(1,-1,1.5),runif(1,0,10)))
draws.3 <- data.frame(1:n.it,chain1,chain2)

Now we get the following results:

What do we see here? The small step size means the chain is moving very slowly around in the parameter space. It also gives a very high acceptance rate (about 97%), because few proposals are notably worse than the draws they compete with. With this set-up, we would need an extremely high number of iterations for the chains to converge and approximate the posterior distribution.

# For basic versions of the plots above, run: plot(1:n.it,draws.3[,2],type="l",ylim=c(min(draws.3[,c(2,4)]),max(draws.3[,c(2,4)]))); lines(1:n.it,draws.3[,4]); plot(1:n.it,draws.3[,3],type="l",ylim=c(min(draws.3[,c(3,5)]),max(draws.3[,c(3,5)]))); lines(1:n.it,draws.3[,5]) 

Final notes

In this example, it easy to set a step size that makes the Metropolis algorithm work well. In more complicated settings, with a large number of potentially correlated parameters, the algorithm may be less well suited, and more advanced samplers (e.g. Hamiltonian MC/NUTS) may be preferable. It should also be noted that there are also several topics I have left aside here. The present example is so limited that we could keep things simple and not work with log probabilities. In my next post, I expand the sampler to analyze a hierarchical dataset, which requires using logs to avoid computational underflows. Other issues I have left aside include the calculation of convergence statistics and effective simulation draws, as well as the fact that these algorithms typically require a warm-up period – with appropriate step size, there is hardly need for one in this example. If you find any mistakes in this post, please let me know!

Further reading

Gelman, Andrew, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari and Donald B Rubin. 2014. Bayesian Data Analysis. 3rd ed. London: Chapman & Hall/CRC Press.

McElreath, Richard. 2016. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. London: Chapman & Hall/CRC Press.


comments powered by Disqus