Bayesian Hierarchical Modeling

I used to think so-called multilevel models were a little boring. I was interested in causal inference, and the people using these models did not seem to have better causal identification strategies than those running plain old regressions. I have gradually come to change my mind on these models, although it is not because I think they solve challenges of causal identification. It is rather because I think a large share of our data can be thought of as hierarchical, and that proper modeling help us make the most of such data.

This post illustrates the benefits of Bayesian hierarchical modeling, by expanding the Metropolis sampler from my previous post to deal with more parameters. I will refer to hierarchical rather than multilevel models, as this highlights the use of hierarchical priors. The key advantage of the hierarchical approach is that it uses information across groups of observations to reduce our lower-level parameters’ sensitivity to noise. To demonstrate how this improves our estimates, this post compares a hierarchical model to a so-called unpooled model. (If you are unfamiliar with the basics of Bayesian inference, my earlier posts may be a better start.)

Hierarchical data

Hierarchical modeling is relevant when we have observations that are somehow grouped. It could be that we observe several individuals within different states, or that we have multiple observations per individual. In standard econometric terms, this will often give rise to autocorrelation and heteroskedasticity, calling for estimation of group-specific means and variances. In a regression setting, one might additionally estimate how coefficients vary between groups.

Here, we will create a small hierarchical dataset, containing 5 noisy observations for each of 75 individuals (i.e. 375 observations in total). For a given individual \(i \in \{1, \ldots, n\}\) and trial number \(j \in \{1, \ldots, m\}\) we will denote the observed outcome \(y_{ij}\). In addition to \(y\), our data will contain the covariate \(x\), which also varies by \(i\) and \(j\). The outcome \(y\) will have normally distributed (and homoskedastic) errors, and depend linearly on \(x\), according to the following process: \(y_{ij} \sim N(\alpha_i + \beta_i x_{ij},\sigma^2)\). A key point here is that each individual will have its own true intercept, \(\alpha_i\), as well as its own coefficient on \(x\), \(\beta_i\). These individual-specific parameters are in turn drawn from their own distributions: \(\alpha \sim N(2,1)\), and \(\beta \sim N(-2,1)\).

set.seed(1)
n = 75 # Number of individuals
m = 5  # Number of observations for per individual
alpha <- rnorm(n, 2, 1) # True individual-specific intercepts
beta <- rnorm(n, -2, 1) # True individual-specific coefficients
x <- matrix(rnorm(n * m, 0, 1), ncol = m)
y <- matrix(NA, n, m)
for (i in 1:n) { 
  for (j in 1:m) { 
    y[i, j] <- rnorm(1, alpha[i] + beta[i] * x[i, j], 2) } }

The likelihood

We will use a likelihood function that accurately reflects the data-generating process. In other words, we have:

\[p(y|x,\alpha,\beta,\sigma) = \prod^n_{i=1}\prod^m_{j=1} N(y_{ij}|\alpha_i + \beta_i x_{ij}, \sigma^2)\]

However, as we increase the number of observations, some of these products will get exceedingly small, so we will need to work with log-densities to prevent computational underflows. Taking the log of the densities means we will sum the point-wise log-probabilities: \[\text{ln}(p(y|x,\alpha,\beta,\sigma)) = \sum^n_{i=1}\sum^m_{j=1} \text{ln}(N(y_{ij}|\alpha_i + \beta_i x_{ij}, \sigma^2))\]

In order to keep track of the parameters when we pass their values between functions, we will pass them as a named list containing both single parameters and vectors of parameters. We will also use apply for somewhat efficient looping over observations:

log.likelihood <- function(y, x, pars) { 
  # Loop over each row in data and its according alpha- and beta-parameter:
  sum(log(apply(cbind(y, x, pars$alpha, pars$beta), 1, function(mat_i) { 
    # Loop over each observation in this row (i.e. individual):
    apply(cbind(mat_i[1:m], mat_i[(m + 1):(2 * m)]), 1, function(dat_ij) { 
      # Compute the probability density for observation ij:
      dnorm(dat_ij[1], mat_i[(2 * m) + 1] + mat_i[(2 * m) + 2] * dat_ij[2], pars$sigma) 
    })
  }))) 
}

Priors for hierarchical data

The key question is how we model the \(\alpha\)- and \(\beta\)-parameters. A simple option is to do complete pooling, assuming the \(\alpha\)’s and \(\beta\)’s are equal for all individuals, and estimating a single \(\alpha\) and a single \(\beta\). This is similar to running a single regression on the whole dataset, which is not particularly interesting, so we will leave this option aside here.

No pooling

At the other extreme, we can give each individual their own \(\alpha\)- and \(\beta\)-parameter and place wide, flat priors on each: \(\alpha_i \sim U(-100,100)\), and \(\beta_i \sim U(-100,100)\). Models of this kind are often referred to as no pooling models, and in our case, this is similar to running a separate regression for each individual. If also add a flat prior on the variance, we get the following prior distribution:

\[\begin{aligned}p(\alpha,\beta,\sigma) = &\prod^n_{i=1}\big[U(\alpha_i|-100,100) \times U(\beta_i|-100,100)\big] \times U(\sigma|0,10)\end{aligned}\]

However, we will be using the log-prior:

\[\begin{aligned}\text{ln}(p(\alpha,\beta,\sigma)) = &\sum^n_{i=1}\big[\text{ln}(U(\alpha_i|-100,100)) + \text{ln}(U(\beta_i|-100,100))\big] + \text{ln}(U(\sigma|0,10))\end{aligned}\] In R, we have:

log.prior.1 <- function(pars) {
  sum(log(dunif(pars$alpha, -100, 100))) + 
  sum(log(dunif(pars$beta, -100, 100))) + 
  log(dunif(pars$sigma, 0, 10)) }

Partial pooling

Between the extremes of complete pooling and no pooling, we have the option of partial pooling, which is typically what is meant by hierarachical modeling. Here, we start from the assumption that the parameters come from a common population distribution – which is typically self-evident. While we could specify a certain prior representing our beliefs regarding population distribution, there is a better option. If we choose a parametric density function – assuming, for instance, that the population follows a normal distribution – then we can let the parameters of that distribution be estimated from the data. Instead of fixing those hyperparameters, we give them their own, higher-level priors – hyperpriors. An interesting point about the hyperparameters is that they do not change the likelihood function – they only influence the posterior through the prior on the lower-level parameters.

In our case, we will assume the \(\alpha\)’s and \(\beta\)’s come from two distinct normal distributions: \(\alpha_i \sim N(\mu_\alpha,\sigma_\alpha^2)\), and: \(\beta_i \sim N(\mu_\beta,\sigma_\beta^2)\). We further add hyperpriors on the population parameters: \(\mu_\alpha, \mu_\beta \sim N(0,3^2)\), and: \(\sigma_\alpha, \sigma_\beta \sim U(0,10)\). For the variance, we use the same flat prior as above: \(\sigma \sim U(0,10)\). Our joint prior distribution is then:

\[\begin{aligned}p(\alpha,\beta,\mu_\alpha,\mu_\beta,\sigma_\alpha,\sigma_\beta,\sigma) = &\prod^n_{i=1}\big[N(\alpha_i|\mu_\alpha,\sigma_\alpha^2) \times N(\beta_i|\mu_\beta,\sigma_\beta^2)\big]~\times \cr[3ex] &N(\mu_\alpha|0,3^2) \times N(\mu_\beta|0,3^2)~\times \cr[3ex] &U(\sigma_\alpha|0,10) \times U(\sigma_\beta|0,10) \times U(\sigma|0,10)\end{aligned}\]

And the log-prior is:

\[\begin{aligned}\text{ln}(p(\alpha,\beta,\mu_\alpha,\mu_\beta,\sigma_\alpha,\sigma_\beta,\sigma)) = &\sum^n_{i=1}\big[\text{ln}(N(\alpha_i|\mu_\alpha,\sigma_\alpha^2))+\text{ln}(N(\beta_i|\mu_\beta,\sigma_\beta^2))\big]~+\cr[3ex] &\text{ln}(N(\mu_\alpha|0,3^2))+\text{ln}(N(\mu_\beta|0,3^2))~+\cr[3ex] &\text{ln}(U(\sigma_\alpha|0,10))+\text{ln}(U(\sigma_\beta|0,10))+\text{ln}(U(\sigma|0,10))\end{aligned}\]

Writing this up as a function in R, we have:

log.prior.2 <- function(pars) {
  sum(log(dnorm(pars$alpha, pars$mu.alpha, pars$sigma.alpha))) + 
  sum(log(dnorm(pars$beta, pars$mu.beta, pars$sigma.beta))) +
  log(dnorm(pars$mu.alpha, 0, 3)) + log(dnorm(pars$mu.beta, 0, 3)) + 
  log(dunif(pars$sigma.alpha, 0, 10)) + log(dunif(pars$sigma.beta, 0, 10)) + 
  log(dunif(pars$sigma, 0, 10)) }

Posterior distributions

We will be working with the unnormalized log-posterior density, which is the sum of the log-likelihood and the log-prior. For the unpooled model, we have:

\[\text{ln}(p(\alpha,\beta,\sigma|y,x) \propto \text{ln}(p(y|x,\alpha,\beta,\sigma))+ \text{ln}(p(\alpha,\beta,\sigma)) \]

log.unnorm.posterior.1 <- function(y, x, pars) { 
  log.likelihood(y, x, pars) + log.prior.1(pars) }

For the hierarchical model, we have:

\[\text{ln}(p(\alpha,\beta,\mu_\alpha,\mu_\beta,\sigma_\alpha,\sigma_\beta,\sigma|y,x) \propto \text{ln}(p(y|x,\alpha,\beta,\sigma))+ \text{ln}(p(\alpha,\beta,\mu_\alpha,\mu_\beta,\sigma_\alpha,\sigma_\beta,\sigma)) \]

log.unnorm.posterior.2 <- function(y, x, pars) { 
  log.likelihood(y, x, pars) + log.prior.2(pars) }

Setting up the sampler

To estimate our parameters we will use a slightly modified version of the Metropolis sampler that was introduced in the previous post. Most importantly, we will make the code a bit more general, so that it can be applied to an arbitrary number of parameters. Our proposal function now takes a list of parameters – or vectors of parameters – and returns a list with the same dimensions:

proposals <- function(pars, proposal.sd) {
  sapply(pars, function(x) { sapply(x, function(x) { x + rnorm(1, 0, proposal.sd) }) }) }

To use log-probabilities as long as possible, we now calculate the acceptance ratio as a difference in log-probabilities before exponentiating. We also add a counter of accepted proposals to calculate the sampler’s average acceptance rate. The sampler code now looks as follows:

metropolis <- function(n.it,proposal.sd, y, x, log.unnorm.posterior, starting.values) {
  draws <- vector()
  n.accept <- 0
  current.draw <- starting.values
  for (t in 1:n.it) {
    proposal <- proposals(current.draw, proposal.sd)
    if (sum(c(proposal$sigma.alpha, proposal$sigma.beta, proposal$sigma) <= 0) > 0) {
      acceptance.ratio = 0 }
    else {acceptance.ratio <- exp(log.unnorm.posterior(y, x, proposal) - 
      log.unnorm.posterior(y, x, current.draw)) }
    random.threshold <- runif(1, 0, 1)
    if (random.threshold <= acceptance.ratio) {
      current.draw <- proposal
      n.accept <- n.accept + 1 }
    draws <- rbind(draws, unlist(current.draw)) }
  return(list(draws = draws, accept.rate = n.accept / n.it)) }

Obtaining estimates

Now, let us obtain 10,000 draws for each of our two specifications (which may take a while):

n.it <- 10000
# Unpooled model:
starting.values.1 <- list(alpha=rnorm(n, 2, .75), beta = rnorm(n, -2, .75), 
                          sigma = runif(1, 1.75, 2.25))
results.1 <- metropolis(n.it, proposal.sd = .125, y, x, 
                        log.unnorm.posterior.1, starting.values.1)
# Hierarchical model:
starting.values.2 <- list(alpha = rnorm(n, 2, .75), beta = rnorm(n, -2, .75),
                          mu.alpha = rnorm(1, 2, .25), mu.beta = rnorm(1, -2, .25),
                          sigma.alpha = runif(1, .75, 1.25), 
                          sigma.beta = runif(1, .75, 1.25),
                          sigma = runif(1, 1.75, 2.25))
results.2 <- metropolis(n.it, proposal.sd = .1, y, x, 
                        log.unnorm.posterior.2, starting.values.2)

We will drop the first 2000 draws to allow the sampler to find its equilibrium distribution. (I am otherwise leaving the issue of convergence aside here, so you will just have to trust me.)

warmup <- n.it / 5
draws.subset.1 <- results.1$draws[(warmup + 1):n.it, ]
draws.subset.2 <- results.2$draws[(warmup + 1):n.it, ]

Finally, we obtain point estimates from the models by calculating posterior means:

unpooled.est <- colMeans(draws.subset.1)
hierarchical.est <- colMeans(draws.subset.2)

Comparing the models

Asymptotically, the unpooled and the hierarchical model will yield the same result: With an infinite number of observations per individual, the likelihood will prevail anyhow (as long as the prior does not rule out the true parameter values). However, the typical situation for hierarchical analysis is one where we have few observations per cluster (i.e. individual in our case), and then these choices matter. To illustrate the differences, the figure below plots the posterior means for the \(\alpha\)’s and \(\beta\)’s for each model. The estimates are shown in dark blue, while the true values are in lighter transparent blue:

One might think the hierarchical model has more informative priors on the lower-level parameters, but this would not be entirely correct. The uniform priors of the unpooled models are not “uninformative” – in fact, they attribute significant prior probability to areas of the parameter space that we do not truly consider plausible. In other words, these priors can be seen as both informative and implausible. The consequence of the flat priors of the unpooled model is to pull the estimates away from each other. In frequentist terms, these estimates are unbiased, but have high sampling variance – they are very sensitive to noise, as a few large errors in the same direction are allowed to pull the estimates into implausible regions of the parameter space.

In contrast, the hierarchical model uses information from all individuals to estimate the population distribution from which the lower-level parameters are drawn. Compared to the unpooled model, this prior structure pulls the estimates towards the population distribution – an effect often referred to as shrinkage. This yields estimates that are considerably less sensitive to noise, as we require more observations and a narrower likelihood to pull the posterior probabilities into unusual regions of the parameter space. Accordingly, the plot shows hierarchical estimates that are a lot more concentrated than the unpooled ones.

Is one model better? If our sole criterion is for the lower-level parameters to be unbiased, then we would prefer the unpooled model. But this would typically not be a good choice, as it ignores the sampling variance and thus sensitivity to noise. A straightforward way to assess bias and variance together is to consider mean squared errors (MSE) or their root (RMSE). Here, it is useful to do so across the sets of \(\alpha\)- and \(\beta\)-estimates. For a given set of \(\alpha\)-estimates, this means: \[\text{RMSE}(\hat\alpha) = \sqrt{\frac{1}{n}\sum_{i=1}^n (\hat\alpha_i - \alpha_i)^2}\] The RMSEs for each model are shown in the upper-right corner of the plot above. The values are considerably lower for the hierarchical model, which means this model on average yields estimates that are closer to the true parameter values.

# The RMSEs can be calculated as follows:
# alpha.unpooled <- unpooled.est[1:n]
# sqrt(mean((alpha.unpooled-alpha)^2)) # RMSE, alpha, unpooled model
# alpha.hierarchical <- hierarchical.est[1:n]
# sqrt(mean((alpha.hierarchical - alpha)^2)) # RMSE, alpha, hierarchical model
# beta.unpooled <- unpooled.est[(n + 1):(2 * n)]
# sqrt(mean((beta.unpooled - beta)^2)) # RMSE, beta, unpooled model
# beta.hierarchical <- hierarchical.est[(n + 1):(2 * n)]
# sqrt(mean((beta.hierarchical - beta)^2)) # RMSE, beta, hierarchical model

# For a basic version of the plot above run: par(mfrow=c(1,2)); plot(alpha.unpooled, beta.unpooled, type = "n", xlim = c(-5, 7), ylim = c(-7, 5)); for (i in 1:n) { arrows(alpha.unpooled, beta.unpooled, alpha[i], beta[i], col = "lightgray", length = 0) }; points(alpha, beta, col = "lightgray"); points(alpha.unpooled, beta.unpooled); plot(alpha.hierarchical, beta.hierarchical, type = "n", xlim=c(-5, 7), ylim = c(-7, 5)); for (i in 1:n) { arrows(alpha.hierarchical, beta.hierarchical, alpha[i], beta[i], col = "lightgray", length = 0) }; points(alpha, beta, col = "lightgray"); points(alpha.hierarchical, beta.hierarchical)

Final notes

In this post, I have left aside the issues of MCMC convergence and how to sample efficiently. The Metropolis sampler was used because it is easy to implement manually – not because it is well-suited. For an actual analysis of this kind, it would make more sense to use NUTS (see the next post). It could also be noted that while this post has focused on hierarchical modeling within a Bayesian framework – which is intuitive and straightforward – it is possible to achieve something similar within a frequentist framework. As Gelman notes, “anything Bayesian can be done non-Bayesianly” – it is just not clear why you would want to.

How to cite

This material can be cited as follows:

Bølstad, Jørgen. 2019. “The Benefits of Bayesian Hierarchical Modeling: Comparing Partially Pooled and Unpooled Models in R”. Playing with Numbers: Notes on Bayesian Statistics. www.boelstad.net/post/bayesian_hierarchical_modeling/.

Here is a BibTex-formatted reference, which should work when using the natbib and hyperref packages together with a suitable Latex style:

@Misc{boelstad:2018c,
  Author = {B{\o}lstad, J{\o}rgen},
  Title = {The Benefits of Bayesian Hierarchical Modeling: Comparing Partially Pooled and Unpooled Models in R},
  Howpublished = {Playing with Numbers: Notes on Bayesian Statistics},
  Url = {\url{http://www.boelstad.net/post/bayesian_hierarchical_modeling/}},
  Year = {2018},
  Month = {August 8}}