R on Jørgen Bølstad

How Efficient is Stan Compared to JAGS?

j.b [AT] post.harvard.edu (Jørgen Bølstad) — Wed, 02 Jan 2019 00:00:00 +0000

For a good while JAGS was your best bet if you wanted to do MCMC on a distribution of your own choosing. Then Stan came along, potentially replacing JAGS as the black-box sampler of choice for many Bayesians. But how do they compare in terms of performance? The obvious answer is: It depends. In fact, the question is nearly impossible to answer properly, as any comparison will be conditional on the data, model specifications, test criteria, and more. Nevertheless, this post offers a small simulation study.

The approach taken here is to compare the models in terms of seconds required to produce an effective simulation draw for the parameter with the least effective draws. Simulation time is taken to exclude compilation time, but include adaptation and sufficient warm-up for the chains to get into the typical set. The samplers are tested in a hierarchical setting, using six different models. The models differ in terms of pooling, conjugacy, and centered parameter specifications. The results are roughly as expected: JAGS exhibits both blazing fast and incredibly slow sampling, while Stan delivers somewhat more stable performance, being relatively efficient also in harder scenarios.

Data

This mini-study generates hierarchical datasets, containing \(m\) noisy observations for each of \(n\) individuals. For a given individual \(i \in \{1, \ldots, n\}\) and trial number \(j \in \{1, \ldots, m\}\), we have the observed outcome \(y_{ij}\). In addition to \(y\), the data contain the covariate \(x\), which also varies by \(i\) and \(j\). The outcome \(y\) will have normally distributed, homoskedastic errors, and depend linearly on \(x\), according to the following process: \(y_{ij} \sim N(\alpha_i + \beta_i x_{ij},\sigma^2)\). Each individual will have its own true intercept, \(\alpha_i\), as well as its own coefficient on \(x\), \(\beta_i\). These individual-specific parameters are in turn drawn from their own distributions: \(\alpha \sim N(2,1)\), and \(\beta \sim N(-2,1)\). The code to define the data-generating process (and load necessary packages) is:

library(rstan)
library(rjags)
library(MASS)   # To sample from multivariate normal
library(dclone) # To run JAGS in parallel
set.seed(1)
create.data <- function(n = 500, m = 5, covar = 0, xmean = 0) {
  Sigma <- matrix(c(1, covar, covar, 1), 2, 2)
  coefs <- mvrnorm(n, mu = c(2, -2), Sigma)
  x <- matrix(rnorm(n * m, xmean, 1), ncol = m)
  y <- matrix(NA, n, m)
  for (i in 1:n) { for (j in 1:m) { y[i, j] <- rnorm(1, coefs[i, 1] + coefs[i, 2] * x[i, j], 2) } }
  dat <- list(n = n, m = m, x = x, y = y)
  return(dat)
}

Posterior parameter correlations

A key issue for MCMC samplers is how strong posterior correlations there are between the parameters, which in turn depends on both the data and the model specification. Strong correlations mean the parameters are hard to separate, posing a more difficult task for the samplers. JAGS, which relies on either Gibbs sampling or Metropolis-Hastings, is likely to suffer under such circumstances, while Stan might be expected perform better. With the present data, we can minimize posterior correlations by centering \(x\). Conversely, we can create correlations by moving its mean away from zero. The larger the mean relative to the standard deviation, the harder it will be to separate the constants from the coefficients. The tests below entail two scenarios that differ in this respect. One scenario with “weak correlations”: \(x_{ij} \sim N(0,1)\), and one with “strong correlations”: \(x_{ij} \sim N(2,1)\).

Models

We will use a likelihood function that accurately reflects the data-generating process. In other words, we have:

\[p(y|x,\alpha,\beta,\sigma) = \prod^n_{i=1}\prod^m_{j=1} N(y_{ij}|\alpha_i + \beta_i x_{ij}, \sigma^2)\]

We will test five different models to see how the results differ across key specification choices. The main choices are: whether to do some pooling of individual coefficients, whether to use conjugate priors, and whether to use non-centered parameter specifications.

Hierarchical, weakly informative, fully conjugate (centered and non-centered)

We will start with a model doing partial pooling, assuming the \(\alpha\)’s and \(\beta\)’s are drawn from common population distributions: \[\alpha_i \sim N(\mu_\alpha,\sigma_\alpha^2), ~~~ \beta_i \sim N(\mu_\beta,\sigma_\beta^2)\] This model is fully conjugate, with normal priors on the means and inverse Gamma priors on the variances. (Note that the last parameter in the inverse gamma here is a scale and not a rate parameter.) The hyperpriors and the prior on \(\sigma\) are set to be weakly informative: \[\mu_\alpha,\mu_\beta \sim N(0,3^2), ~~~ \sigma_\alpha^2,\sigma_\beta^2,\sigma^2 \sim \text{Inv-Gamma}(1.5,2)\]

In some situations (with little data and/or correlated parameters), one can achieve more efficient and accurate sampling by using a non-centered parameterization (see Papaspiliopoulos et al. 2007, Betancourt and Girolami 2013). For the \(\alpha\)-parameters, this can be done by specifying \(\alpha_i = \mu_\alpha + \sigma_\alpha \alpha^*_i\), where \(\alpha^*_i \sim N(0,1)\). The tests below entail both a centered and non-centered version of the conjugate model. (It is tricky to do non-centering without breaking JAGS’ ability to exploit the conjugacy, so the non-centered JAGS version only separates out the common mean).

Hierarchical, weakly informative, partly conjugate (centered and non-centered)

The next model is very similar, but only partly conjugate, as the inverse Gamma priors on the variances are replaced with gamma priors on the standard deviations. The new priors are set to be approximately as informative as the inverse Gamma’s (although they are necessarily different). The model is then:

\[\alpha_i \sim N(\mu_\alpha,\sigma_\alpha^2), ~~~ \beta_i \sim N(\mu_\beta,\sigma_\beta^2)\]

\[\mu_\alpha,\mu_\beta \sim N(0,3^2), ~~~ \sigma_\alpha,\sigma_\beta,\sigma \sim \text{Gamma}(2,.5)\]

(Note that the last parameter in the gamma is a rate parameter, i.e. inverse scale).

Hierarchical, weakly informative, non-conjugate, centered

The next model is completely non-conjugate, replacing the normal distributions with student-t distributions with 100 degrees of freedom. This is practically the same as the normal, but should prevent JAGS from using Gibbs sampling.

\[\alpha_i \sim \text{Student-t}(100,\mu_\alpha,\sigma_\alpha^2), ~~~ \beta_i \sim \text{Student-t}(100,\mu_\beta,\sigma_\beta^2)\]

\[\mu_\alpha,\mu_\beta \sim \text{Student-t}(100,0,3^2), ~~~ \sigma_\alpha,\sigma_\beta,\sigma \sim \text{Gamma}(2,.5)\]

Unpooled with uniform priors

The final model is an unpooled model with uniform priors on all parameters. This model does not make too much sense here (as we know the individual parameters are drawn form a common distribution), but it may serve to illustrate performance in more difficult circumstances. The model is potentially challenging in that there is no conjugacy, no pooling of information across units, and no other help from the priors, which results in larger posterior variances (and potentially covariances): \[\alpha_i,\beta_i \sim U(-100,100), ~~~ \sigma \sim U(0,100)\]

This model only takes a few lines of Stan code, as we get uniform priors by not specifying anything (except relevant limits).

Approach

A key question is how to compare the samplers in way that is both fair and relevant for actual applications. As MCMC draws tend to be auto-correlated, the nominal number of simulation draws is less interesting than the number of effective draws (adjusting for auto-correlation). Furthermore, if we decide on a lower limit for how many effective draws we require, we will most likely want to apply this requirement to all parameters. In other words, the key question is how efficiently we are sampling the parameter with the fewest effective draws (after the chains have found the typical set). Finally, the most relevant measure of efficiency is probably time per effective draw.

However, it is not clear what is the most relevant specification of time-consumption either. Stan needs to compile models before running, while JAGS does not. Fortunately, Stan models only need to be compiled once, and it usually takes a minute or two. If you are using an already-compiled model, there is no loss of time. Similarly, if your model is sufficiently complicated to require a long run, compilation time is essentially irrelevant. There is really only one situation in which compilation-time counts, and that is when you are developing a new model, testing it on a small dataset to make things faster, but still having to re-compile for every little change you to make. Given the points above, however, I think it is most reasonable to compare run-times without including compilation-time.

Another issue is whether to include adaptation and warm-up time. I think it is reasonable to include this, as these phases are typically required for any analysis. If a sampler takes very long to converge to the target distribution, this is a relevant cost of using it. Including these phases may, however, introduce some inaccuracy: As we take more post-warm-up draws, the share of time spent on warm-up decreases. This makes the results depend on the length of the post-warm-up run, which is clearly not ideal, but I am still sticking to this approach for now. (I suppose the best approach might be to set the run-length for each model aiming for a specific minimum number of effective draws for all parameters, but this might take extremely long in the slowest instances.)

The models have been fit to 200 datasets for each of the two posterior-correlation scenarios. The datasets have been kept fairly small: \(n = 250\) and \(m=5\). The patterns reported below seem to remain when I scale up the data, but further testing may be called for (as the speed of JAGS may suffer more as the scale increases). For each software, I use three out of four available processor cores, and run one parallel chain on each core. To calculate split Rhats and the number of effective draws, I am using the specifications from Gelman et al. 2014 (BDA3). Setting the length of the warm-up phase is a bit tricky, as a longer warm-up is required for the more challenging combinations of data and models. By trial and error, I have set specific warm-up lengths for each model in each scenario so that the models typically converge. In the most challenging situations, the models will sometimes still not fully converge in time, and in these instances the results are not reported. (This may slightly favor models that are performing poorly, but it should not change the general patterns.)

Results

The results are shown on a logged scale in the boxplot below. The scenario with weak posterior correlations offers ideal conditions, and when JAGS is given a fully conjugate hierarchical model, it is sampling very fast here – showing the fastest performance in this test. Stan is also fast in this situation, but does take about 3 times as long JAGS. The non-centered parameterization appears to slow down both samplers slightly, suggesting it adds to the computational load without much gain. When we move to the partly conjugate models, JAGS gets notably slower, and about as fast as Stan, whether we use the centered or non-centered specification.

Moving from a partly conjugate to completely non-conjugate model (replacing normals with t-distributions), makes JAGS about three times slower still, while Stan retains its performance, being several times faster. Turning to the unpooled models with uniform priors, Stan gets slightly faster, while JAGS gets a bit slower – now sampling nearly five times slower than Stan.

As the posterior correlations increase, the patterns are similar, but with some notable changes: JAGS gets comparatively slower for the fully conjugate models, but still delivers some of the fastest performance. It is now on par with Stan’s performance for the non-centered specification. If we look at the partly conjugate models, non-centering speeds up Stan seven-fold in this scenario, while JAGS sees no such benefit. Finally, for the unpooled model with uniform priors, Stan is pretty fast, while JAGS is very slow – in fact, JAGS is 65 times slower than Stan here.

In sum, JAGS’s performance varies a lot: It is about 765 times faster using a fully conjugate, centered model in the easy scenario than using an unpooled model in the harder scenario. For Stan, the equivalent factor is 4.

Another question is how well the samplers are exploring the target distribution. The paper on NUTS by Hoffman and Gelman (2014) shows that NUTS can perform much better than Gibbs sampling and Random Walk Metropolis in exploring difficult posteriors with correlated parameters. At the most extreme, the authors use a highly correlated 250-dimensional distribution, whereas the data used here mainly entail pairwise parameter correlations (between pairs of \(\alpha\)’s and \(\beta\)’s, and between their hyperparameters in the partially pooled models). The plots below show 1,000 randomly selected draws of \(\alpha_1\) and \(\beta_1\) from a single trial using unpooled models with uniform priors. Visually, the samples produced by JAGS and Stan are very similar here. It is just that JAGS takes much longer to converge and produce effective draws.

Final notes

A key lesson to draw from these results is that model specification matters. For instance, a non-centered specification can be of great help to Stan, and when this is the case, there is no reason not to go this route. For JAGS, it is obviously a good idea to use conjugate priors wherever possible, and thus avoid turning JAGS into Just Another Metropolis Sampler. Another point to note is that centering the predictor \(x\) would have turned the scenario with high posterior correlations into one with low correlations – we could actually have avoided the posterior correlations altogether.

In addition to efficiency, there are other considerations that may be relevant for choosing between JAGS and Stan. Stan has a large development team and a very active user group, which makes it easy to solicit expert advice. Stan is also not reliant on a specific type of priors (conjugacy) and can handle a wide range of models. Furthermore, Stan is likely explore complicated posterior distributions better, and thus give more accurate results (see references below). While JAGS can be very fast for certain well-suited problems, Stan may offer the best all-round performance, especially if you are analyzing large datasets using complicated models.

How to cite

This material can be cited as follows:

Bølstad, Jørgen. 2019. “How Efficient is Stan Compared to JAGS? Conjugacy, Pooling, Centering, and Posterior Correlations”. Playing with Numbers: Notes on Bayesian Statistics. www.boelstad.net/post/stan_vs_jags_speed/.

Here is a BibTex-formatted reference, which should work when using the natbib and hyperref packages together with a suitable Latex style:

@Misc{boelstad:2019a,
  Author = {B{\o}lstad, J{\o}rgen},
  Title = {How Efficient is Stan Compared to JAGS? Conjugacy, Pooling, Centering, and Posterior Correlations},
  Howpublished = {Playing with Numbers: Notes on Bayesian Statistics},
  Url = {\url{http://www.boelstad.net/post/stan_vs_jags_speed/}},
  Year = {2019},
  Month = {January 2}}

Bayesian Hierarchical Modeling

j.b [AT] post.harvard.edu (Jørgen Bølstad) — Wed, 08 Aug 2018 00:00:00 +0000

I used to think so-called multilevel models were a little boring. I was interested in causal inference, and the people using these models did not seem to have better causal identification strategies than those running plain old regressions. I have gradually come to change my mind on these models, although it is not because I think they solve challenges of causal identification. It is rather because I think a large share of our data can be thought of as hierarchical, and that proper modeling help us make the most of such data.

This post illustrates the benefits of Bayesian hierarchical modeling, by expanding the Metropolis sampler from my previous post to deal with more parameters. I will refer to hierarchical rather than multilevel models, as this highlights the use of hierarchical priors. The key advantage of the hierarchical approach is that it uses information across groups of observations to reduce our lower-level parameters’ sensitivity to noise. To demonstrate how this improves our estimates, this post compares a hierarchical model to a so-called unpooled model. (If you are unfamiliar with the basics of Bayesian inference, my earlier posts may be a better start.)

Hierarchical data

Hierarchical modeling is relevant when we have observations that are somehow grouped. It could be that we observe several individuals within different states, or that we have multiple observations per individual. In standard econometric terms, this will often give rise to autocorrelation and heteroskedasticity, calling for estimation of group-specific means and variances. In a regression setting, one might additionally estimate how coefficients vary between groups.

Here, we will create a small hierarchical dataset, containing 5 noisy observations for each of 75 individuals (i.e. 375 observations in total). For a given individual \(i \in \{1, \ldots, n\}\) and trial number \(j \in \{1, \ldots, m\}\) we will denote the observed outcome \(y_{ij}\). In addition to \(y\), our data will contain the covariate \(x\), which also varies by \(i\) and \(j\). The outcome \(y\) will have normally distributed (and homoskedastic) errors, and depend linearly on \(x\), according to the following process: \(y_{ij} \sim N(\alpha_i + \beta_i x_{ij},\sigma^2)\). A key point here is that each individual will have its own true intercept, \(\alpha_i\), as well as its own coefficient on \(x\), \(\beta_i\). These individual-specific parameters are in turn drawn from their own distributions: \(\alpha \sim N(2,1)\), and \(\beta \sim N(-2,1)\).

set.seed(1)
n = 75 # Number of individuals
m = 5  # Number of observations for per individual
alpha <- rnorm(n, 2, 1) # True individual-specific intercepts
beta <- rnorm(n, -2, 1) # True individual-specific coefficients
x <- matrix(rnorm(n * m, 0, 1), ncol = m)
y <- matrix(NA, n, m)
for (i in 1:n) { 
  for (j in 1:m) { 
    y[i, j] <- rnorm(1, alpha[i] + beta[i] * x[i, j], 2) } }

The likelihood

We will use a likelihood function that accurately reflects the data-generating process. In other words, we have:

\[p(y|x,\alpha,\beta,\sigma) = \prod^n_{i=1}\prod^m_{j=1} N(y_{ij}|\alpha_i + \beta_i x_{ij}, \sigma^2)\]

However, as we increase the number of observations, some of these products will get exceedingly small, so we will need to work with log-densities to prevent computational underflows. Taking the log of the densities means we will sum the point-wise log-probabilities: \[\text{ln}(p(y|x,\alpha,\beta,\sigma)) = \sum^n_{i=1}\sum^m_{j=1} \text{ln}(N(y_{ij}|\alpha_i + \beta_i x_{ij}, \sigma^2))\]

In order to keep track of the parameters when we pass their values between functions, we will pass them as a named list containing both single parameters and vectors of parameters. We will also use apply for somewhat efficient looping over observations:

log.likelihood <- function(y, x, pars) { 
  # Loop over each row in data and its according alpha- and beta-parameter:
  sum(log(apply(cbind(y, x, pars$alpha, pars$beta), 1, function(mat_i) { 
    # Loop over each observation in this row (i.e. individual):
    apply(cbind(mat_i[1:m], mat_i[(m + 1):(2 * m)]), 1, function(dat_ij) { 
      # Compute the probability density for observation ij:
      dnorm(dat_ij[1], mat_i[(2 * m) + 1] + mat_i[(2 * m) + 2] * dat_ij[2], pars$sigma) 
    })
  }))) 
}

Priors for hierarchical data

The key question is how we model the \(\alpha\)- and \(\beta\)-parameters. A simple option is to do complete pooling, assuming the \(\alpha\)’s and \(\beta\)’s are equal for all individuals, and estimating a single \(\alpha\) and a single \(\beta\). This is similar to running a single regression on the whole dataset, which is not particularly interesting, so we will leave this option aside here.

No pooling

At the other extreme, we can give each individual their own \(\alpha\)- and \(\beta\)-parameter and place wide, flat priors on each: \(\alpha_i \sim U(-100,100)\), and \(\beta_i \sim U(-100,100)\). Models of this kind are often referred to as no pooling models, and in our case, this is similar to running a separate regression for each individual. If also add a flat prior on the variance, we get the following prior distribution:

\[\begin{aligned}p(\alpha,\beta,\sigma) = &\prod^n_{i=1}\big[U(\alpha_i|-100,100) \times U(\beta_i|-100,100)\big] \times U(\sigma|0,10)\end{aligned}\]

However, we will be using the log-prior:

\[\begin{aligned}\text{ln}(p(\alpha,\beta,\sigma)) = &\sum^n_{i=1}\big[\text{ln}(U(\alpha_i|-100,100)) + \text{ln}(U(\beta_i|-100,100))\big] + \text{ln}(U(\sigma|0,10))\end{aligned}\] In R, we have:

log.prior.1 <- function(pars) {
  sum(log(dunif(pars$alpha, -100, 100))) + 
  sum(log(dunif(pars$beta, -100, 100))) + 
  log(dunif(pars$sigma, 0, 10)) }

Partial pooling

Between the extremes of complete pooling and no pooling, we have the option of partial pooling, which is typically what is meant by hierarachical modeling. Here, we start from the assumption that the parameters come from a common population distribution – which is typically self-evident. While we could specify a certain prior representing our beliefs regarding population distribution, there is a better option. If we choose a parametric density function – assuming, for instance, that the population follows a normal distribution – then we can let the parameters of that distribution be estimated from the data. Instead of fixing those hyperparameters, we give them their own, higher-level priors – hyperpriors. An interesting point about the hyperparameters is that they do not change the likelihood function – they only influence the posterior through the prior on the lower-level parameters.

In our case, we will assume the \(\alpha\)’s and \(\beta\)’s come from two distinct normal distributions: \(\alpha_i \sim N(\mu_\alpha,\sigma_\alpha^2)\), and: \(\beta_i \sim N(\mu_\beta,\sigma_\beta^2)\). We further add hyperpriors on the population parameters: \(\mu_\alpha, \mu_\beta \sim N(0,3^2)\), and: \(\sigma_\alpha, \sigma_\beta \sim U(0,10)\). For the variance, we use the same flat prior as above: \(\sigma \sim U(0,10)\). Our joint prior distribution is then:

\[\begin{aligned}p(\alpha,\beta,\mu_\alpha,\mu_\beta,\sigma_\alpha,\sigma_\beta,\sigma) = &\prod^n_{i=1}\big[N(\alpha_i|\mu_\alpha,\sigma_\alpha^2) \times N(\beta_i|\mu_\beta,\sigma_\beta^2)\big]~\times \cr[3ex] &N(\mu_\alpha|0,3^2) \times N(\mu_\beta|0,3^2)~\times \cr[3ex] &U(\sigma_\alpha|0,10) \times U(\sigma_\beta|0,10) \times U(\sigma|0,10)\end{aligned}\]

And the log-prior is:

\[\begin{aligned}\text{ln}(p(\alpha,\beta,\mu_\alpha,\mu_\beta,\sigma_\alpha,\sigma_\beta,\sigma)) = &\sum^n_{i=1}\big[\text{ln}(N(\alpha_i|\mu_\alpha,\sigma_\alpha^2))+\text{ln}(N(\beta_i|\mu_\beta,\sigma_\beta^2))\big]~+\cr[3ex] &\text{ln}(N(\mu_\alpha|0,3^2))+\text{ln}(N(\mu_\beta|0,3^2))~+\cr[3ex] &\text{ln}(U(\sigma_\alpha|0,10))+\text{ln}(U(\sigma_\beta|0,10))+\text{ln}(U(\sigma|0,10))\end{aligned}\]

Writing this up as a function in R, we have:

log.prior.2 <- function(pars) {
  sum(log(dnorm(pars$alpha, pars$mu.alpha, pars$sigma.alpha))) + 
  sum(log(dnorm(pars$beta, pars$mu.beta, pars$sigma.beta))) +
  log(dnorm(pars$mu.alpha, 0, 3)) + log(dnorm(pars$mu.beta, 0, 3)) + 
  log(dunif(pars$sigma.alpha, 0, 10)) + log(dunif(pars$sigma.beta, 0, 10)) + 
  log(dunif(pars$sigma, 0, 10)) }

Posterior distributions

We will be working with the unnormalized log-posterior density, which is the sum of the log-likelihood and the log-prior. For the unpooled model, we have:

\[\text{ln}(p(\alpha,\beta,\sigma|y,x) \propto \text{ln}(p(y|x,\alpha,\beta,\sigma))+ \text{ln}(p(\alpha,\beta,\sigma)) \]

log.unnorm.posterior.1 <- function(y, x, pars) { 
  log.likelihood(y, x, pars) + log.prior.1(pars) }

For the hierarchical model, we have:

\[\text{ln}(p(\alpha,\beta,\mu_\alpha,\mu_\beta,\sigma_\alpha,\sigma_\beta,\sigma|y,x) \propto \text{ln}(p(y|x,\alpha,\beta,\sigma))+ \text{ln}(p(\alpha,\beta,\mu_\alpha,\mu_\beta,\sigma_\alpha,\sigma_\beta,\sigma)) \]

log.unnorm.posterior.2 <- function(y, x, pars) { 
  log.likelihood(y, x, pars) + log.prior.2(pars) }

Setting up the sampler

To estimate our parameters we will use a slightly modified version of the Metropolis sampler that was introduced in the previous post. Most importantly, we will make the code a bit more general, so that it can be applied to an arbitrary number of parameters. Our proposal function now takes a list of parameters – or vectors of parameters – and returns a list with the same dimensions:

proposals <- function(pars, proposal.sd) {
  sapply(pars, function(x) { sapply(x, function(x) { x + rnorm(1, 0, proposal.sd) }) }) }

To use log-probabilities as long as possible, we now calculate the acceptance ratio as a difference in log-probabilities before exponentiating. We also add a counter of accepted proposals to calculate the sampler’s average acceptance rate. The sampler code now looks as follows:

metropolis <- function(n.it,proposal.sd, y, x, log.unnorm.posterior, starting.values) {
  draws <- vector()
  n.accept <- 0
  current.draw <- starting.values
  for (t in 1:n.it) {
    proposal <- proposals(current.draw, proposal.sd)
    if (sum(c(proposal$sigma.alpha, proposal$sigma.beta, proposal$sigma) <= 0) > 0) {
      acceptance.ratio = 0 }
    else {acceptance.ratio <- exp(log.unnorm.posterior(y, x, proposal) - 
      log.unnorm.posterior(y, x, current.draw)) }
    random.threshold <- runif(1, 0, 1)
    if (random.threshold <= acceptance.ratio) {
      current.draw <- proposal
      n.accept <- n.accept + 1 }
    draws <- rbind(draws, unlist(current.draw)) }
  return(list(draws = draws, accept.rate = n.accept / n.it)) }

Obtaining estimates

Now, let us obtain 10,000 draws for each of our two specifications (which may take a while):

n.it <- 10000
# Unpooled model:
starting.values.1 <- list(alpha=rnorm(n, 2, .75), beta = rnorm(n, -2, .75), 
                          sigma = runif(1, 1.75, 2.25))
results.1 <- metropolis(n.it, proposal.sd = .125, y, x, 
                        log.unnorm.posterior.1, starting.values.1)
# Hierarchical model:
starting.values.2 <- list(alpha = rnorm(n, 2, .75), beta = rnorm(n, -2, .75),
                          mu.alpha = rnorm(1, 2, .25), mu.beta = rnorm(1, -2, .25),
                          sigma.alpha = runif(1, .75, 1.25), 
                          sigma.beta = runif(1, .75, 1.25),
                          sigma = runif(1, 1.75, 2.25))
results.2 <- metropolis(n.it, proposal.sd = .1, y, x, 
                        log.unnorm.posterior.2, starting.values.2)

We will drop the first 2000 draws to allow the sampler to find its equilibrium distribution. (I am otherwise leaving the issue of convergence aside here, so you will just have to trust me.)

warmup <- n.it / 5
draws.subset.1 <- results.1$draws[(warmup + 1):n.it, ]
draws.subset.2 <- results.2$draws[(warmup + 1):n.it, ]

Finally, we obtain point estimates from the models by calculating posterior means:

unpooled.est <- colMeans(draws.subset.1)
hierarchical.est <- colMeans(draws.subset.2)

Comparing the models

Asymptotically, the unpooled and the hierarchical model will yield the same result: With an infinite number of observations per individual, the likelihood will prevail anyhow (as long as the prior does not rule out the true parameter values). However, the typical situation for hierarchical analysis is one where we have few observations per cluster (i.e. individual in our case), and then these choices matter. To illustrate the differences, the figure below plots the posterior means for the \(\alpha\)’s and \(\beta\)’s for each model. The estimates are shown in dark blue, while the true values are in lighter transparent blue:

One might think the hierarchical model has more informative priors on the lower-level parameters, but this would not be entirely correct. The uniform priors of the unpooled models are not “uninformative” – in fact, they attribute significant prior probability to areas of the parameter space that we do not truly consider plausible. In other words, these priors can be seen as both informative and implausible. The consequence of the flat priors of the unpooled model is to pull the estimates away from each other. In frequentist terms, these estimates are unbiased, but have high sampling variance – they are very sensitive to noise, as a few large errors in the same direction are allowed to pull the estimates into implausible regions of the parameter space.

In contrast, the hierarchical model uses information from all individuals to estimate the population distribution from which the lower-level parameters are drawn. Compared to the unpooled model, this prior structure pulls the estimates towards the population distribution – an effect often referred to as shrinkage. This yields estimates that are considerably less sensitive to noise, as we require more observations and a narrower likelihood to pull the posterior probabilities into unusual regions of the parameter space. Accordingly, the plot shows hierarchical estimates that are a lot more concentrated than the unpooled ones.

Is one model better? If our sole criterion is for the lower-level parameters to be unbiased, then we would prefer the unpooled model. But this would typically not be a good choice, as it ignores the sampling variance and thus sensitivity to noise. A straightforward way to assess bias and variance together is to consider mean squared errors (MSE) or their root (RMSE). Here, it is useful to do so across the sets of \(\alpha\)- and \(\beta\)-estimates. For a given set of \(\alpha\)-estimates, this means: \[\text{RMSE}(\hat\alpha) = \sqrt{\frac{1}{n}\sum_{i=1}^n (\hat\alpha_i - \alpha_i)^2}\] The RMSEs for each model are shown in the upper-right corner of the plot above. The values are considerably lower for the hierarchical model, which means this model on average yields estimates that are closer to the true parameter values.

# The RMSEs can be calculated as follows:
# alpha.unpooled <- unpooled.est[1:n]
# sqrt(mean((alpha.unpooled-alpha)^2)) # RMSE, alpha, unpooled model
# alpha.hierarchical <- hierarchical.est[1:n]
# sqrt(mean((alpha.hierarchical - alpha)^2)) # RMSE, alpha, hierarchical model
# beta.unpooled <- unpooled.est[(n + 1):(2 * n)]
# sqrt(mean((beta.unpooled - beta)^2)) # RMSE, beta, unpooled model
# beta.hierarchical <- hierarchical.est[(n + 1):(2 * n)]
# sqrt(mean((beta.hierarchical - beta)^2)) # RMSE, beta, hierarchical model

# For a basic version of the plot above run: par(mfrow=c(1,2)); plot(alpha.unpooled, beta.unpooled, type = "n", xlim = c(-5, 7), ylim = c(-7, 5)); for (i in 1:n) { arrows(alpha.unpooled, beta.unpooled, alpha[i], beta[i], col = "lightgray", length = 0) }; points(alpha, beta, col = "lightgray"); points(alpha.unpooled, beta.unpooled); plot(alpha.hierarchical, beta.hierarchical, type = "n", xlim=c(-5, 7), ylim = c(-7, 5)); for (i in 1:n) { arrows(alpha.hierarchical, beta.hierarchical, alpha[i], beta[i], col = "lightgray", length = 0) }; points(alpha, beta, col = "lightgray"); points(alpha.hierarchical, beta.hierarchical)

Final notes

In this post, I have left aside the issues of MCMC convergence and how to sample efficiently. The Metropolis sampler was used because it is easy to implement manually – not because it is well-suited. For an actual analysis of this kind, it would make more sense to use NUTS (see the next post). It could also be noted that while this post has focused on hierarchical modeling within a Bayesian framework – which is intuitive and straightforward – it is possible to achieve something similar within a frequentist framework. As Gelman notes, “anything Bayesian can be done non-Bayesianly” – it is just not clear why you would want to.

How to cite

This material can be cited as follows:

Bølstad, Jørgen. 2019. “The Benefits of Bayesian Hierarchical Modeling: Comparing Partially Pooled and Unpooled Models in R”. Playing with Numbers: Notes on Bayesian Statistics. www.boelstad.net/post/bayesian_hierarchical_modeling/.

Here is a BibTex-formatted reference, which should work when using the natbib and hyperref packages together with a suitable Latex style:

@Misc{boelstad:2018c,
  Author = {B{\o}lstad, J{\o}rgen},
  Title = {The Benefits of Bayesian Hierarchical Modeling: Comparing Partially Pooled and Unpooled Models in R},
  Howpublished = {Playing with Numbers: Notes on Bayesian Statistics},
  Url = {\url{http://www.boelstad.net/post/bayesian_hierarchical_modeling/}},
  Year = {2018},
  Month = {August 8}}

An Introduction to Markov Chain Monte Carlo Sampling

j.b [AT] post.harvard.edu (Jørgen Bølstad) — Mon, 23 Jul 2018 00:00:00 +0000

It is usually not too difficult to define priors and specify a likelihood function, which means we can calculate the unnormalized posterior for any combination of relevant parameter values. However, that is still insufficient to give us marginal posterior distributions for the parameters of interest. The grid method that was used in the previous post is not feasible for situations with a large number of parameters, and conjugate models with analytical solutions are mainly relevant for a subset of suitable problems. So what can we do? We sample from the posterior distribution!

This post illustrates Markov Chain Monte Carlo sampling by writing a simple Metropolis sampler in R. There are some very efficient MCMC implementations already out there, so the goal of doing this manually is purely educational. (If you are unfamiliar with the basics of Bayesian inference, my previous post may be a better start.)

Defining the prior, likelihood, and posterior

To keep things simple, and to enable comparison, we will use the same data, model, and priors as in the previous post. That means we have three noisy observations of a single individual, which we use to estimate this individual’s latent trait, \(\mu\), as well as the standard deviation of the measurement errors, \(\sigma\).

To get started, we will recreate the data:

set.seed(4)                 # Set seed for reproducibility
n <- 3                      # Set number of observations
y <- rnorm(n, mean = 1, sd = 2) # Generate data

We exploit the fact that our individual is drawn from a population, and let our prior for \(\mu\) reflect the known population distribution: \(p(\mu) \sim N(-1,1.5)\). For \(\sigma\), we use a simple, flat prior (even though other choices may be better): \(p(\sigma) \sim U(0,10)\). The joint prior is the product of these: \[p(\mu,\sigma) = p(\mu)p(\sigma)\] We thus define the joint prior distribution as the following function:

prior <- function(mu, sigma) { dnorm(mu, mean = -1, sd = 1.5) * dunif(sigma, 0, 10) }

The likelihood function is the product of the individual likelihoods, which in turn are defined by the normal probability density function and a given pair of values for \(\mu\) and \(\sigma\):

\[p(y|\mu,\sigma) = \prod^n_{i=1} N(y_i|\mu,\sigma^2)\] We thus define the likelihood function as follows:

likelihood <- function(y, mu, sigma) { prod(dnorm(y, mean = mu, sd = sigma)) }

Recall that the posterior is proportional to the product of the likelihood and the prior: \[p(\mu,\sigma|y) \propto p(y|\mu,\sigma)p(\mu,\sigma)\] We thus define the unnormalized posterior as the following function:

unnorm.posterior <- function(y, mu, sigma) { likelihood(y, mu, sigma) * prior(mu, sigma) }

Setting up a Metropolis sampler

We are now able to calculate the unnormalized posterior probability for any given combination of relevant parameter values. But now what? For a high-dimensional problem (which we admittedly do not have here), Markov Chain Monte Carlo (MCMC) sampling may be the best option for approximating the posterior distribution. MCMC can simulate draws from any distribution as long as we can provide a function whose output is proportional to the density of the relevant target distribution. That is typically the situation in Bayesian inference, as the unnormalized posterior is proportional to the actual posterior distribution. In this setting, MCMC methods will produce draws that converge to the target distribution (i.e. the posterior) as the number of draws increases.

The Metropolis algorithm is a straightforward example of MCMC and thus useful for understanding the basics. The algorithm runs through a number of iterations to create a sequence of parameter values, where each set of values depend on the previous set. More specifically, it takes a set of starting values (one per parameter), and proposes a set of new values by adding a random innovation drawn from a specified function (we will use a normal distribution). The proposal set can then either be accepted – and replace the starting values – or be rejected – so that the starting values are retained. Then the next iteration follows, based on whichever set of values was retained in the previous iteration.

The probability that a proposal set is accepted depends on a comparison of the unnormalized posterior for the set of current values and for the set of proposed values. If the proposal set has a higher unnormalized posterior, it will always be accepted. If it does not, it may still be accepted, but the probability of acceptance declines linearly with the ratio of the value for the proposal set over the value for the current set: If, for instance, the unnormalized posterior probability for the proposal set is half of that for the current set, then the proposal set has an acceptance probability of .5. The result is that the sampler (once the chain has reached its equilibrium) has acceptance probabilities equal to the target distribution, producing draws from the posterior distribution.

To start setting up the algorithm, let us first define our proposal function, which takes a pair of parameter values as input, and returns a new pair after adding random innovations. The innovations are drawn from a normal distribution with a mean of zero and a specified standard deviation (proposal.sd):

proposals <- function(mu, sigma, proposal.sd) { c(mu, sigma) + rnorm(2, 0, proposal.sd) }

Now, let us set up the algorithm as a function running for a specified set of iterations (n.it), using a given proposal standard deviation, a set of starting values, and a set of data (y):

metropolis <- function(y, n.it, proposal.sd, starting.values) {
  sample <- vector()        
  current.draw <- starting.values
  for (it in 1:n.it) {      
    proposal <- proposals(current.draw[1], current.draw[2], proposal.sd)
    if (proposal[2] < 0) { acceptance.ratio = 0 } # Reject if sigma < 0 (as prior = 0)
    else { acceptance.ratio <- (unnorm.posterior(y, proposal[1], proposal[2]) / 
                                 unnorm.posterior(y,  current.draw[1],current.draw[2])) }
    random.threshold <- runif(1, 0, 1) # Threshold for accepting when ratio < 1
    if (acceptance.ratio >= random.threshold) { current.draw <- proposal }
    sample <- rbind(sample, current.draw) 
  } 
  return(sample)
}

Obtaining simulation draws

Let us see if the algorithm succeeds in sampling from the posterior density. First, we define a starting value for each of the parameters, and in this case we will draw these from our prior distributions:

starting.values <- c(rnorm(1, -1, 1.5), runif(1, 0, 10)) # The first is mu, the second sigma

We will run the sampler for 2500 iterations, and give the innovations a standard deviation of 1:

n.it = 2500
draws <- metropolis(y, n.it, proposal.sd = 1, starting.values = starting.values)

This gives us 2500 draws for each of the parameters, and it is useful to plot these draws over their draw-number as a sequence:

# For basic versions of these plots, run: plot(1:n.it, draws[, 1], type = "l"); plot(1:n.it, draws[, 2] , type = "l")

The plots above are encouraging, as the sequences both look stationary and highly variable. This suggests that the proposal distribution is sufficiently wide to move the sampler quickly across the parameter space, while the proposals still have a reasonable chance of being accepted.

Marginal distributions

Ultimately, we are interested in drawing conclusions regarding the parameter values, so we want to assess the marginal posterior distributions. As noted in the previous post, the marginal posterior for \(\mu\) is defined as: \[ p(\mu|y) = \int p(\mu,\sigma |y)d(\sigma)\] The MCMC approach makes it easy to approximate this distribution, as we can simply ignore the other parameter(s) when summarizing the draws for the parameter we are interested in. In other words, to approximate the marginal distribution for \(\mu\), we just estimate the density of its draws. Now, as we are using the same data and model as in the previous post, we can also compare the densities of our draws to the previously calculated densities (here represented by dark blue lines):

That looks pretty good! And if we wanted greater accuracy, we could just increase the number of draws.

# For basic versions of these plot, run this (after also running the codes for the previous post): plot(density(draws[, 1])); lines(mu.h, mu.marginal.posterior / res); plot(density(draws[, 2])); lines(sigma.h, sigma.marginal.posterior / res)

What could go wrong?

Let us consider some potential issues that could arise. First, it should be noted that it is always a good idea to run the sampler more than once, as it helps diagnosing potential problems. Here, we will run the sampler twice, which gives us two sequences of draws for each parameter. Normally one would want to run one chain on each processor core, but we will keep it simple and one run after the other.

Too large step size

For the sake of illustration, let us run the sampler with a much larger step size, increasing the standard deviation of the proposal distribution to 20:

n.it = 2500
chain1 <- metropolis(y, n.it, proposal.sd = 20, 
                     starting.values = c(rnorm(1, -1, 1.5), runif(1, 0, 10)))
chain2 <- metropolis(y, n.it, proposal.sd = 20, 
                     starting.values = c(rnorm(1, -1, 1.5), runif(1, 0, 10)))
draws.2 <- data.frame(1:n.it, chain1, chain2)

The results are as follows:

What is going on here? The wide proposal distribution is giving us a large number of proposals that have very low or zero posterior probability, which results in a very high rejection rate (about 99%). This is visible as flat lines in the plot. In practice, we will need a greater number of draws to get an acceptable approximation of the posterior distribution.

# For basic versions of the plots above, run: plot(1:n.it, draws.2[, 2], type = "l", ylim = c(min(draws.2[, c(2, 4)]), max(draws.2[, c(2, 4)]))); lines(1:n.it, draws.2[, 4]); plot(1:n.it, draws.2[, 3], type = "l", ylim = c(min(draws.2[, c(3, 5)]), max(draws.2[, c(3, 5)]))); lines(1:n.it, draws.2[, 5])

Too small step size

When the situation above occurs, a natural response is to make the proposal distribution narrower. Let us see what happens if we go a bit far and reduce the standard deviation of the innovations to 1/100:

n.it = 2500
chain1 <- metropolis(y, n.it, proposal.sd = .01, 
                     starting.values = c(rnorm(1, -1, 1.5), runif(1, 0, 10)))
chain2 <- metropolis(y, n.it, proposal.sd = .01, 
                     starting.values = c(rnorm(1, -1, 1.5), runif(1, 0, 10)))
draws.3 <- data.frame(1:n.it, chain1, chain2)

Now we get the following results:

What do we see here? The small step size means the chain is moving very slowly around in the parameter space. It also gives a very high acceptance rate (about 97%), because few proposals are notably worse than the draws they compete with. With this set-up, we would need an extremely high number of iterations for the chains to converge and approximate the posterior distribution.

# For basic versions of the plots above, run: plot(1:n.it, draws.3[, 2], type = "l", ylim = c(min(draws.3[, c(2, 4)]), max(draws.3[, c(2, 4)]))); lines(1:n.it, draws.3[, 4]); plot(1:n.it, draws.3[, 3], type = "l", ylim = c(min(draws.3[, c(3, 5)]), max(draws.3[, c(3, 5)]))); lines(1:n.it, draws.3[, 5])

Final notes

In this example, it easy to set a step size that makes the Metropolis algorithm work well. In more complicated settings, with a large number of potentially correlated parameters, the algorithm may be less well suited, and more advanced samplers (e.g. Hamiltonian MC/NUTS) may be preferable. It should also be noted that there are also several topics I have left aside here. The present example is so limited that we could keep things simple and not work with log probabilities. In my next post, I expand the sampler to analyze a hierarchical dataset, which requires using logs to avoid computational underflows. Other issues I have left aside include the calculation of convergence statistics and effective simulation draws, as well as the fact that these algorithms typically require a warm-up period – with appropriate step size, there is hardly need for one in this example.

How to cite

This material can be cited as follows:

Bølstad, Jørgen. 2018. “An Introduction to Markov Chain Monte Carlo Sampling: Writing and Diagnosing a Metropolis Sampler in R”. Playing with Numbers: Notes on Bayesian Statistics. www.boelstad.net/post/mcmc_sampling_introduction/.

Here is a BibTex-formatted reference, which should work when using the natbib and hyperref packages together with a suitable Latex style:

@Misc{boelstad:2018b,
  Author = {B{\o}lstad, J{\o}rgen},
  Title = {An Introduction to Markov Chain Monte Carlo Sampling: Writing and Diagnosing a Metropolis Sampler in R},
  Howpublished = {Playing with Numbers: Notes on Bayesian Statistics},
  Url = {\url{http://www.boelstad.net/post/mcmc_sampling_introduction/}},
  Year = {2018},
  Month = {July 23}}

R on Jørgen Bølstad

How Efficient is Stan Compared to JAGS?

Data

Posterior parameter correlations

Models

Hierarchical, weakly informative, fully conjugate (centered and non-centered)

Hierarchical, weakly informative, partly conjugate (centered and non-centered)

Hierarchical, weakly informative, non-conjugate, centered

Unpooled with uniform priors

Approach

Results

Final notes

How to cite

Further reading

Bayesian Hierarchical Modeling

Hierarchical data

The likelihood

Priors for hierarchical data

No pooling

Partial pooling

Posterior distributions

Setting up the sampler

Obtaining estimates

Comparing the models

Final notes

How to cite

Further reading

An Introduction to Markov Chain Monte Carlo Sampling

Defining the prior, likelihood, and posterior

Setting up a Metropolis sampler

Obtaining simulation draws

Marginal distributions

What could go wrong?

Too large step size

Too small step size

Final notes

How to cite

Further reading