Posts on Jørgen Bølstad
http://www.boelstad.net/post/
Recent content in Posts on Jørgen BølstadHugo -- gohugo.iojorgen.bolstad@arena.uio.no (Jørgen Bølstad)jorgen.bolstad@arena.uio.no (Jørgen Bølstad)Wed, 08 Aug 2018 00:00:00 +0000The Advantages of Bayesian Hierarchical Modeling
http://www.boelstad.net/post/bayesian_hierarchical_modeling/
Wed, 08 Aug 2018 00:00:00 +0000jorgen.bolstad@arena.uio.no (Jørgen Bølstad)http://www.boelstad.net/post/bayesian_hierarchical_modeling/<p>I used to think so-called multilevel models were a little boring. I was interested in causal inference, and the people using these models did note seem to have better causal identification strategies than those running plain old regressions. I have gradually come to change my mind on these models, although it is not because I think they solve challenges of causal identification. It is rather because I think a large share of our data can be thought of as hierarchical, and that proper modeling help us make the most of such data.</p>
<p>This post illustrates the advantages of Bayesian hierarchical modeling, by expanding the Metropolis sampler from my previous post to deal with more parameters. I will refer to hierarchical rather than multilevel models, as this highlights the use of hierarchical priors. The key advantage of the hierarchical approach is that it uses information across groups of observations to reduce our lower-level parameters’ sensitivity to noise. To demonstrate how this improves our estimates, this post compares a hierarchical model to a so-called unpooled model. (If you are unfamiliar with the basics of Bayesian inference, my <a href="http://www.boelstad.net/post/bayesian_statistics_introduction/">earlier posts</a> may be a better start.)</p>
<div id="hierarchical-data" class="section level3">
<h3>Hierarchical data</h3>
<p>Hierarchical modeling is relevant when we have observations that are somehow grouped. It could be that we observe several individuals within different states, or that we have multiple observations per individual. In standard econometric terms, this will often give rise to autocorrelation and heteroskedasticity, calling for estimation of group-specific means and variances. In a regression setting, one might additionally estimate how coefficients vary between groups.</p>
<p>Here, we will create a small hierarchical dataset, containing 5 noisy observations for each of 75 individuals (i.e. 375 observations in total). For a given individual <span class="math inline">\(i \in \{1, \ldots, n\}\)</span> and trial number <span class="math inline">\(j \in \{1, \ldots, m\}\)</span> we will denote the observed outcome <span class="math inline">\(y_{ij}\)</span>. In addition to <span class="math inline">\(y\)</span>, our data will contain the covariate <span class="math inline">\(x\)</span>, which also varies by <span class="math inline">\(i\)</span> and <span class="math inline">\(j\)</span>. The outcome <span class="math inline">\(y\)</span> will have normally distributed (and homoskedastic) errors, and depend linearly on <span class="math inline">\(x\)</span>, according to the following process: <span class="math inline">\(y_{ij} \sim N(\alpha_i + \beta_i x_{ij},\sigma^2)\)</span>. A key point here is that each individual will have its own true intercept, <span class="math inline">\(\alpha_i\)</span>, as well as its own coefficient on <span class="math inline">\(x\)</span>, <span class="math inline">\(\beta_i\)</span>. These individual-specific parameters are in turn drawn from their own distributions: <span class="math inline">\(\alpha \sim N(2,1)\)</span>, and <span class="math inline">\(\beta \sim N(-2,1)\)</span>.</p>
<pre class="r"><code>set.seed(1)
n = 75 # Number of individuals
m = 5 # Number of observations for per individual
alpha <- rnorm(n,2,1) # True individual-specific intercepts
beta <- rnorm(n,-2,1) # True individual-specific coefficients
x <- matrix(rnorm(n*m,0,1),ncol=m)
y <- matrix(NA,n,m)
for (i in 1:n) { for (j in 1:m) { y[i,j] <- rnorm(1,alpha[i]+beta[i]*x[i,j],2) }}</code></pre>
</div>
<div id="the-likelihood" class="section level3">
<h3>The likelihood</h3>
<p>We will use a likelihood function that accurately reflects the data-generating process. In other words, we have:</p>
<p><span class="math display">\[p(y|x,\alpha,\beta,\sigma) = \prod^n_{i=1}\prod^m_{j=1} N(y_{ij}|\alpha_i + \beta_i x_{ij}, \sigma^2)\]</span></p>
<p>However, as we increase the number of observations, some of these products will get exceedingly small, so we will need to work with log-densities to prevent computational underflows. Taking the log of the densities means we will sum the point-wise log-probabilities: <span class="math display">\[\text{ln}(p(y|x,\alpha,\beta,\sigma)) = \sum^n_{i=1}\sum^m_{j=1} \text{ln}(N(y_{ij}|\alpha_i + \beta_i x_{ij}, \sigma^2))\]</span></p>
<p>In order to keep track of the parameters when we pass their values between functions, we will pass them as a named list containing both single parameters and vectors of parameters. We will also use <code>apply</code> for somewhat efficient looping over observations:</p>
<pre class="r"><code>log.likelihood <- function(y,x,pars) {
# Loop over each row in data and its according alpha- and beta-parameter:
sum(log(apply(cbind(y,x,pars$alpha,pars$beta),1, function(mat_i) {
# Loop over each observation in this row (i.e. individual):
apply(cbind(mat_i[1:m],mat_i[(m+1):(2*m)]),1, function(dat_ij) {
# Compute the probability density for observation ij:
dnorm(dat_ij[1],mat_i[(2*m)+1]+mat_i[(2*m)+2]*dat_ij[2],pars$sigma) }) }))) }</code></pre>
</div>
<div id="priors-for-hierarchical-data" class="section level3">
<h3>Priors for hierarchical data</h3>
<p>The key question is how we model the <span class="math inline">\(\alpha\)</span>- and <span class="math inline">\(\beta\)</span>-parameters. A simple option is to do <em>complete pooling</em>, assuming the <span class="math inline">\(\alpha\)</span>’s and <span class="math inline">\(\beta\)</span>’s are equal for all individuals, and estimating a single <span class="math inline">\(\alpha\)</span> and a single <span class="math inline">\(\beta\)</span>. This is similar to running a single regression on the whole dataset, which is not particularly interesting, so we will leave this option aside here.</p>
<div id="no-pooling" class="section level4">
<h4>No pooling</h4>
<p>At the other extreme, we can give each individual their own <span class="math inline">\(\alpha\)</span>- and <span class="math inline">\(\beta\)</span>-parameter and place wide, flat priors on each: <span class="math inline">\(\alpha_i \sim U(-100,100)\)</span>, and <span class="math inline">\(\beta_i \sim U(-100,100)\)</span>. Models of this kind are often referred to as <em>no pooling</em> models, and in our case, this is similar to running a separate regression for each individual. If also add a flat prior on the variance, we get the following prior distribution: <!-- -- which we will do below to save a tiny bit of time.--></p>
<p><span class="math display">\[\begin{aligned}p(\alpha,\beta,\sigma) = &\prod^n_{i=1}\big[U(\alpha_i|-100,100) \times U(\beta_i|-100,100)\big] \times U(\sigma|0,10)\end{aligned}\]</span></p>
<p>However, we will be using the log-prior:</p>
<p><span class="math display">\[\begin{aligned}\text{ln}(p(\alpha,\beta,\sigma)) = &\sum^n_{i=1}\big[\text{ln}(U(\alpha_i|-100,100)) + \text{ln}(U(\beta_i|-100,100))\big] + \text{ln}(U(\sigma|0,10))\end{aligned}\]</span> In R, we have:</p>
<pre class="r"><code>log.prior.1 <- function(pars) {
sum(log(dunif(pars$alpha,-100,100))) +
sum(log(dunif(pars$beta,-100,100))) +
log(dunif(pars$sigma,0,10)) }</code></pre>
</div>
<div id="partial-pooling" class="section level4">
<h4>Partial pooling</h4>
<p>Between the extremes of complete pooling and no pooling, we have the option of <em>partial pooling</em>, which is typically what is meant by hierarachical modeling. Here, we start from the assumption that the parameters come from a common population distribution – which is typically self-evident. While we could specify a certain prior representing our beliefs regarding population distribution, there is a better option. If we choose a parametric density function – assuming, for instance, that the population follows a normal distribution – then we can let the parameters of that distribution be estimated from the data. Instead of fixing those <em>hyperparameters</em>, we give them their own, higher-level priors – <em>hyperpriors</em>. An interesting point about the hyperparameters is that they do not change the likelihood function – they only influence the posterior through the prior on the lower-level parameters.</p>
<p>In our case, we will assume the <span class="math inline">\(\alpha\)</span>’s and <span class="math inline">\(\beta\)</span>’s come from two distinct normal distributions: <span class="math inline">\(\alpha_i \sim N(\mu_\alpha,\sigma_\alpha^2)\)</span>, and: <span class="math inline">\(\beta_i \sim N(\mu_\beta,\sigma_\beta^2)\)</span>. We further add hyperpriors on the population parameters: <span class="math inline">\(\mu_\alpha, \mu_\beta \sim N(0,3^2)\)</span>, and: <span class="math inline">\(\sigma_\alpha, \sigma_\beta \sim U(0,10)\)</span>. For the variance, we use the same flat prior as above: <span class="math inline">\(\sigma \sim U(0,10)\)</span>. Our joint prior distribution is then:</p>
<!-- May need ot switch to mathjax for linebreaks? But it can take up to 10 times longer to render.-->
<p><span class="math display">\[\begin{aligned}p(\alpha,\beta,\mu_\alpha,\mu_\beta,\sigma_\alpha,\sigma_\beta,\sigma) = &\prod^n_{i=1}\big[N(\alpha_i|\mu_\alpha,\sigma_\alpha^2) \times N(\beta_i|\mu_\beta,\sigma_\beta^2)\big]~\times \cr[3ex]
&N(\mu_\alpha|0,3^2) \times N(\mu_\beta|0,3^2)~\times \cr[3ex]
&U(\sigma_\alpha|0,10) \times U(\sigma_\beta|0,10) \times U(\sigma|0,10)\end{aligned}\]</span></p>
<p>And the log-prior is:</p>
<p><span class="math display">\[\begin{aligned}\text{ln}(p(\alpha,\beta,\mu_\alpha,\mu_\beta,\sigma_\alpha,\sigma_\beta,\sigma)) = &\sum^n_{i=1}\big[\text{ln}(N(\alpha_i|\mu_\alpha,\sigma_\alpha^2))+\text{ln}(N(\beta_i|\mu_\beta,\sigma_\beta^2))\big]~+\cr[3ex]
&\text{ln}(N(\mu_\alpha|0,3^2))+\text{ln}(N(\mu_\beta|0,3^2))~+\cr[3ex]
&\text{ln}(U(\sigma_\alpha|0,10))+\text{ln}(U(\sigma_\beta|0,10))+\text{ln}(U(\sigma|0,10))\end{aligned}\]</span></p>
<p>Writing this up as a function in R, we have:</p>
<pre class="r"><code>log.prior.2 <- function(pars) {
sum(log(dnorm(pars$alpha,pars$mu.alpha,pars$sigma.alpha))) +
sum(log(dnorm(pars$beta,pars$mu.beta,pars$sigma.beta))) +
log(dnorm(pars$mu.alpha,0,3)) + log(dnorm(pars$mu.beta,0,3)) +
log(dunif(pars$sigma.alpha,0,10)) + log(dunif(pars$sigma.beta,0,10)) +
log(dunif(pars$sigma,0,10)) }</code></pre>
<!--It is worth noting that the hyperparameters only influence the posterior through the prior on the lower-level parameters. In other words they are not directly part of the likelihood.-->
</div>
</div>
<div id="posterior-distributions" class="section level3">
<h3>Posterior distributions</h3>
<p>We will be working with the unnormalized log-posterior density, which is the sum of the log-likelihood and the log-prior. For the unpooled model, we have:</p>
<p><span class="math display">\[\text{ln}(p(\alpha,\beta,\sigma|y,x) \propto \text{ln}(p(y|x,\alpha,\beta,\sigma))+ \text{ln}(p(\alpha,\beta,\sigma)) \]</span></p>
<pre class="r"><code>log.unnorm.posterior.1 <- function(y,x,pars) { log.likelihood(y,x,pars) + log.prior.1(pars) }</code></pre>
<p>For the hierarchical model, we have:</p>
<p><span class="math display">\[\text{ln}(p(\alpha,\beta,\mu_\alpha,\mu_\beta,\sigma_\alpha,\sigma_\beta,\sigma|y,x) \propto \text{ln}(p(y|x,\alpha,\beta,\sigma))+ \text{ln}(p(\alpha,\beta,\mu_\alpha,\mu_\beta,\sigma_\alpha,\sigma_\beta,\sigma)) \]</span></p>
<pre class="r"><code>log.unnorm.posterior.2 <- function(y,x,pars) { log.likelihood(y,x,pars) + log.prior.2(pars) }</code></pre>
</div>
<div id="setting-up-the-sampler" class="section level3">
<h3>Setting up the sampler</h3>
<p>To estimate our parameters we will use a slightly modified version of the Metropolis sampler that was introduced in the previous post. Most importantly, we will make the code a bit more general, so that it can be applied to an arbitrary number of parameters. Our proposal function now takes a list of parameters – or vectors of parameters – and returns a list with the same dimensions:</p>
<pre class="r"><code>proposals <- function(pars,proposal.sd) {
sapply(pars, function(x) { sapply(x, function(x) { x + rnorm(1,0,proposal.sd) }) })}</code></pre>
<p>To use log-probabilities as long as possible, we now calculate the acceptance ratio as a difference in log-probabilities before exponentiating. We also add a counter of accepted proposals to calculate the sampler’s average acceptance rate. The sampler code now looks as follows:</p>
<pre class="r"><code>metropolis <- function(n.it,proposal.sd,y,x,log.unnorm.posterior,starting.values) {
draws <- vector()
n.accept <- 0
current.draw <- starting.values
for (t in 1:n.it) {
proposal <- proposals(current.draw,proposal.sd)
if (sum(c(proposal$sigma.alpha,proposal$sigma.beta,proposal$sigma) <= 0) > 0) {
acceptance.ratio = 0}
else {acceptance.ratio <- exp(log.unnorm.posterior(y,x,proposal) -
log.unnorm.posterior(y,x,current.draw)) }
random.threshold <- runif(1,0,1)
if (random.threshold <= acceptance.ratio) {
current.draw <- proposal
n.accept <- n.accept + 1}
draws <- rbind(draws,unlist(current.draw)) }
return(list(draws=draws,accept.rate=n.accept/n.it)) }</code></pre>
</div>
<div id="obtaining-estimates" class="section level3">
<h3>Obtaining estimates</h3>
<p>Now, let us obtain 10,000 draws for each of our two specifications (which may take a while):</p>
<pre class="r"><code>n.it <- 10000
# Unpooled model:
starting.values.1 <- list(alpha=rnorm(n,2,.75),beta=rnorm(n,-2,.75),sigma=runif(1,1.75,2.25))
results.1 <- metropolis(n.it,proposal.sd=.125,y,x,log.unnorm.posterior.1,starting.values.1)
# Hierarchical model:
starting.values.2 <- list(alpha=rnorm(n,2,.75),beta=rnorm(n,-2,.75),
mu.alpha=rnorm(1,2,.25),mu.beta=rnorm(1,-2,.25),
sigma.alpha=runif(1,.75,1.25),sigma.beta=runif(1,.75,1.25),
sigma=runif(1,1.75,2.25))
results.2 <- metropolis(n.it,proposal.sd=.1,y,x,log.unnorm.posterior.2,starting.values.2)</code></pre>
<!--
```r
results.1$accept.rate # Acceptance rate for the unpooled model
```
```
## [1] 0.3057
```
```r
results.2$accept.rate # Acceptance rate for the hierarchical model
```
```
## [1] 0.0997
```
-->
<p>We will drop the first 2000 draws to allow the sampler to find its equilibrium distribution. (I am otherwise leaving the issue of convergence aside here, so you will just have to trust me.)</p>
<pre class="r"><code>warmup <- n.it/5
draws.subset.1 <- results.1$draws[(warmup+1):n.it,]
draws.subset.2 <- results.2$draws[(warmup+1):n.it,]</code></pre>
<p>Finally, we obtain point estimates from the models by calculating posterior means:</p>
<pre class="r"><code>unpooled.est <- colMeans(draws.subset.1)
hierarchical.est <- colMeans(draws.subset.2)</code></pre>
</div>
<div id="comparing-the-models" class="section level3">
<h3>Comparing the models</h3>
<p>Asymptotically, the unpooled and the hierarchical model will yield the same result: With an infinite number of observations per individual, the likelihood will prevail anyhow (as long as the prior does not rule out the true parameter values). However, the typical situation for hierarchical analysis is one where we have few observations per cluster (i.e. individual in our case), and then these choices matter. To illustrate the differences, the figure below plots the posterior means for the <span class="math inline">\(\alpha\)</span>’s and <span class="math inline">\(\beta\)</span>’s for each model. The estimates are shown in dark gray, while the true values are light gray:</p>
<p><img src="http://www.boelstad.net/post/Bayesian_hierarchical_modeling_files/figure-html/plotestimates-1.svg" width="700.8" style="display: block; margin: auto auto auto 0;" /></p>
<p>One might think the hierarchical model has more informative priors on the lower-level parameters, but this would not be entirely correct. The uniform priors of the unpooled models are not “uninformative” – in fact, they attribute significant prior probability to areas of the parameter space that we do not truly consider plausible. In other words, these priors can be seen as both informative and implausible. The consequence of the flat priors of the unpooled model is to pull the estimates away from each other. In frequentist terms, these estimates are unbiased, but have high sampling variance – they are very sensitive to noise, as a few large errors in the same direction are allowed to pull the estimates into implausible regions of the parameter space.</p>
<p>In contrast, the hierarchical model uses information from all individuals to estimate the population distribution from which the lower-level parameters are drawn. Compared to the unpooled model, this prior structure pulls the estimates towards the population distribution – an effect often referred to as <em>shrinkage</em>. This yields estimates that are considerably less sensitive to noise, as we require more observations and a narrower likelihood to pull the posterior probabilities into unusual regions of the parameter space. Accordingly, the plot shows hierarchical estimates that are a lot more concentrated than the unpooled ones.</p>
<p>Is one model better? If our sole criterion is for the lower-level parameters to be unbiased, then we would prefer the unpooled model. But this would typically not be a good choice, as it ignores the sampling variance and thus sensitivity to noise. A straightforward way to assess bias and variance together is to consider mean squared errors (MSE) or their root (RMSE). Here, it is useful to do so across the sets of <span class="math inline">\(\alpha\)</span>- and <span class="math inline">\(\beta\)</span>-estimates. For a given set of <span class="math inline">\(\alpha\)</span>-estimates, this means: <span class="math display">\[\text{RMSE}(\hat\alpha) = \sqrt{\frac{1}{n}\sum_{i=1}^n (\hat\alpha_i - \alpha_i)^2}\]</span> The RMSEs for each model are shown in the upper-right corner of the plot above. The values are considerably lower for the hierarchical model, which means this model on average yields estimates that are closer to the true parameter values.</p>
<pre class="r"><code># The RMSEs can be calculated as follows:
# alpha.unpooled <- unpooled.est[1:n]
# sqrt(mean((alpha.unpooled-alpha)^2)) # RMSE, alpha, unpooled model
# alpha.hierarchical <- hierarchical.est[1:n]
# sqrt(mean((alpha.hierarchical-alpha)^2)) # RMSE, alpha, hierarchical model
# beta.unpooled <- unpooled.est[(n+1):(2*n)]
# sqrt(mean((beta.unpooled-beta)^2)) # RMSE, beta, unpooled model
# beta.hierarchical <- hierarchical.est[(n+1):(2*n)]
# sqrt(mean((beta.hierarchical-beta)^2)) # RMSE, beta, hierarchical model</code></pre>
<pre class="r"><code># For a basic version of the plot above run: par(mfrow=c(1,2)); plot(alpha.unpooled,beta.unpooled,type="n",xlim=c(-5,7),ylim=c(-7,5)); for (i in 1:n) {arrows(alpha.unpooled,beta.unpooled,alpha[i],beta[i],col="lightgray",length=0)}; points(alpha,beta,col="lightgray"); points(alpha.unpooled,beta.unpooled); plot(alpha.hierarchical,beta.hierarchical,type="n",xlim=c(-5,7),ylim=c(-7,5)); for (i in 1:n) {arrows(alpha.hierarchical,beta.hierarchical,alpha[i],beta[i],col="lightgray",length=0)}; points(alpha,beta,col="lightgray"); points(alpha.hierarchical,beta.hierarchical)</code></pre>
</div>
<div id="final-notes" class="section level3">
<h3>Final notes</h3>
<p>In this post, I have left aside the issues of MCMC convergence and how to sample efficiently. The Metropolis sampler was used because it is easy to implement manually – not because it is well-suited. For an actual analysis of this kind, it would make more sense to use <a href="http://www.mc-stan.org" target="_blank">NUTS</a>. It could also be noted that while this post has focused on hierarchical modeling within a Bayesian framework – which is intuitive and straightforward – it is possible to achieve something similar within a frequentist framework – it is just not clear why you would want to.</p>
<!--"anything Bayesian can be done non-Bayesianly" (<a href="http://andrewgelman.com/2018/03/24/economist-wrote-asking-make-sense-fit-bayesian-hierarchical-models-instead-frequentist-random-effects/" target="_blank">Gelman</a>).-->
<p>If you find any mistakes in this post, please let me know!</p>
</div>
<div id="further-reading" class="section level3">
<h3>Further reading</h3>
<div class="Bib">
<p>
Gelman, Andrew, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari and Donald B Rubin. 2014. <i>Bayesian Data Analysis.</i> 3rd ed. London: Chapman & Hall/CRC Press.
</p>
<p>
McElreath, Richard. 2016. <i>Statistical Rethinking: A Bayesian Course with Examples in R and Stan.</i> London: Chapman & Hall/CRC Press.
</p>
</div>
</div>
An Introduction to Markov Chain Monte Carlo Sampling
http://www.boelstad.net/post/mcmc_sampling_introduction/
Mon, 23 Jul 2018 00:00:00 +0000jorgen.bolstad@arena.uio.no (Jørgen Bølstad)http://www.boelstad.net/post/mcmc_sampling_introduction/<!-- NB: AMAZING iillustrations of MCMC: https://chi-feng.github.io/mcmc-demo/ -->
<p>It is usually not too difficult to define priors and specify a likelihood function, which means we can calculate the unnormalized posterior for any combination of relevant parameter values. However, that is still insufficient to give us marginal posterior distributions for the parameters of interest. The grid method that was used in the previous post is not feasible for situations with a large number of parameters, and conjugate models with analytical solutions are mainly relevant for a subset of suitable problems. So what can we do? We sample from the posterior distribution!</p>
<p>This post illustrates Markov chain Monte Carlo sampling by writing a simple Metropolis sampler in R. There are some very efficient MCMC implementations already out there, so the goal of doing this manually is purely educational. (If you are unfamiliar with the basics of Bayesian inference, my <a href="http://www.boelstad.net/post/bayesian_statistics_introduction/">previous post</a> may be a better start.)</p>
<div id="defining-the-prior-likelihood-and-posterior" class="section level3">
<h3>Defining the prior, likelihood, and posterior</h3>
<p>To keep things simple, and to enable comparison, we will use the same data, model, and priors as in the previous post. That means we have three noisy observations of a single individual, which we use to estimate this individual’s latent trait, <span class="math inline">\(\mu\)</span>, as well as the standard deviation of the measurement errors, <span class="math inline">\(\sigma\)</span>.</p>
<p>To get started, we will recreate the data:</p>
<pre class="r"><code>set.seed(4) # Set seed for reproducibility
n <- 3 # Set number of observations
y <- rnorm(n,mean=1,sd=2) # Generate data, taking random draws from a normal distribution</code></pre>
<p>We exploit the fact that our individual is drawn from a population, and let our prior for <span class="math inline">\(\mu\)</span> reflect the known population distribution: <span class="math inline">\(p(\mu) \sim N(-1,1.5)\)</span>. For <span class="math inline">\(\sigma\)</span>, we use a simple, flat prior (even though other choices may be better): <span class="math inline">\(p(\sigma) \sim U(0,10)\)</span>. The joint prior is the product of these: <span class="math display">\[p(\mu,\sigma) = p(\mu)p(\sigma)\]</span> We thus define the joint prior distribution as the following function:</p>
<pre class="r"><code>prior <- function(mu,sigma) {dnorm(mu,mean=-1,sd=1.5) * dunif(sigma,0,10)}</code></pre>
<p>The likelihood function is the product of the individual likelihoods, which in turn are defined by the normal probability density function and a given pair of values for <span class="math inline">\(\mu\)</span> and <span class="math inline">\(\sigma\)</span>:</p>
<p><span class="math display">\[p(y|\mu,\sigma) = \prod^n_{i=1} N(y_i|\mu,\sigma^2)\]</span> We thus define the likelihood function as follows:</p>
<pre class="r"><code>likelihood <- function(y,mu,sigma) {prod(dnorm(y,mean=mu,sd=sigma))}</code></pre>
<p>Recall that the posterior is proportional to the product of the likelihood and the prior: <span class="math display">\[p(\mu,\sigma|y) \propto p(y|\mu,\sigma)p(\mu,\sigma)\]</span> We thus define the unnormalized posterior as the following function:</p>
<pre class="r"><code>unnorm.posterior <- function(y,mu,sigma) {likelihood(y,mu,sigma) * prior(mu,sigma)}</code></pre>
</div>
<div id="setting-up-a-metropolis-sampler" class="section level3">
<h3>Setting up a Metropolis sampler</h3>
<p>We are now able to calculate the unnormalized posterior probability for any given combination of relevant parameter values. But now what? For a high-dimensional problem (which we admittedly do <em>not</em> have here), Markov chain Monte Carlo (MCMC) sampling may be the best option for approximating the posterior distribution. MCMC can simulate draws from any distribution as long as we can provide a function whose output is proportional to the density of the relevant target distribution. That is typically the situation in Bayesian inference, as the unnormalized posterior is proportional to the actual posterior distribution. In this setting, MCMC methods will produce draws that converge to the target distribution (i.e. the posterior) as the number of draws increases.</p>
<p>The Metropolis algorithm is a straightforward example of MCMC and thus useful for understanding the basics. The algorithm runs through a number of iterations to create a sequence of parameter values, where each set of values depend on the previous set. More specifically, it takes a set of starting values (one per parameter), and proposes a set of new values by adding a random innovation drawn from a specified function (we will use a normal distribution). The proposal set can then either be accepted – and replace the starting values – or be rejected – so that the starting values are retained. Then the next iteration follows, based on whichever set of values was retained in the previous iteration.</p>
<p>The probability that a proposal set is accepted depends on a comparison of the unnormalized posterior for the set of current values and for the set of proposed values. If the proposal set has a higher unnormalized posterior, it will always be accepted. If it does not, it may still be accepted, but the probability of acceptance declines linearly with the ratio of the value for the proposal set over the value for the current set: If, for instance, the unnormalized posterior probability for the proposal set is half of that for the current set, then the proposal set has an acceptance probability of .5. The result is that the sampler (once the chain has reached its equilibrium) has acceptance probabilities equal to the target distribution, producing draws from the posterior distribution.</p>
<!--This gives acceptance probabilities equal to the target distribution.The acceptance probabilities for the unnormalized posterior will be same as for the normalized posterior, which
-->
<p>To start setting up the algorithm, let us first define our proposal function, which takes a pair of parameter values as input, and returns a new pair after adding random innovations. The innovations are drawn from a normal distribution with a mean of zero and a specified standard deviation (proposal.sd):</p>
<pre class="r"><code>proposals <- function(mu,sigma,proposal.sd) {c(mu,sigma) + rnorm(2,0,proposal.sd)}</code></pre>
<p>Now, let us set up the algorithm as a function running for a specified set of iterations (n.it), using a given proposal standard deviation, a set of starting values, and a set of data (y):</p>
<pre class="r"><code>metropolis <- function(y,n.it,proposal.sd,starting.values) {
sample <- vector()
current.draw <- starting.values
for (it in 1:n.it) {
proposal <- proposals(current.draw[1],current.draw[2],proposal.sd)
if (proposal[2] < 0) {acceptance.ratio = 0} # Reject if sigma < 0 (as prior = 0)
else {acceptance.ratio <- (unnorm.posterior(y,proposal[1],proposal[2]) /
unnorm.posterior(y,current.draw[1],current.draw[2]))}
random.threshold <- runif(1,0,1) # Threshold for accepting when ratio < 1
if (acceptance.ratio >= random.threshold) {current.draw <- proposal}
sample <- rbind(sample,current.draw)
}
return(sample)
}</code></pre>
</div>
<div id="obtaining-simulation-draws" class="section level3">
<h3>Obtaining simulation draws</h3>
<p>Let us see if the algorithm succeeds in sampling from the posterior density. First, we define a starting value for each of the parameters, and in this case we will draw these from our prior distributions:</p>
<pre class="r"><code>starting.values <- c(rnorm(1,-1,1.5),runif(1,0,10)) # The first is mu, the second sigma</code></pre>
<p>We will run the sampler for 2500 iterations, and give the innovations a standard deviation of 1:</p>
<pre class="r"><code>n.it = 2500
draws <- metropolis(y,n.it,proposal.sd=1,starting.values=starting.values)</code></pre>
<p>This gives us 2500 draws for each of the parameters, and it is useful to plot these draws over their draw-number as a sequence:</p>
<p><img src="http://www.boelstad.net/post/MCMC_sampling_introduction_files/figure-html/unnamed-chunk-9-1.svg" width="700.8" style="display: block; margin: auto auto auto 0;" /><img src="http://www.boelstad.net/post/MCMC_sampling_introduction_files/figure-html/unnamed-chunk-9-2.svg" width="700.8" style="display: block; margin: auto auto auto 0;" /></p>
<pre class="r"><code># For basic versions of these plot, run: plot(1:n.it,draws[,1],type="l"); plot(1:n.it,draws[,2],type="l") </code></pre>
<p>The plots above are encouraging, as the sequences both look stationary and highly variable. This suggests that the proposal distribution is sufficiently wide to move the sampler quickly across the parameter space, while the proposals still have a reasonable chance of being accepted.</p>
</div>
<div id="marginal-distributions" class="section level3">
<h3>Marginal distributions</h3>
<p>Ultimately, we are interested in drawing conclusions regarding the parameter values, so we want to assess the marginal posterior distributions. As noted in the previous post, the marginal posterior for <span class="math inline">\(\mu\)</span> is defined as: <span class="math display">\[ p(\mu|y) = \int p(\mu,\sigma |y)d(\sigma)\]</span> The MCMC approach makes it easy to approximate this distribution, as we can simply by ignore the other parameter(s) when summarizing the draws for the parameter we are interested in. In other words, to approximate the marginal distribution for <span class="math inline">\(\mu\)</span>, we can just estimate the density of its draws. Now, as we are using the same data and model as in the previous post, we can also compare the densities of our draws to the previously calculated densities (here represented by dark red lines):</p>
<p><img src="http://www.boelstad.net/post/MCMC_sampling_introduction_files/figure-html/unnamed-chunk-11-1.svg" width="700.8" style="display: block; margin: auto auto auto 0;" /><img src="http://www.boelstad.net/post/MCMC_sampling_introduction_files/figure-html/unnamed-chunk-11-2.svg" width="700.8" style="display: block; margin: auto auto auto 0;" /></p>
<p>That looks pretty good! And if we wanted greater accuracy, we could just increase the number of draws.</p>
<pre class="r"><code># For basic versions of these plot, run this (after also running the codes for the previous post): plot(density(draws[,1]));lines(mu.h,mu.marginal.posterior/res); plot(density(draws[,2]));lines(sigma.h,sigma.marginal.posterior/res)</code></pre>
</div>
<div id="what-could-go-wrong" class="section level3">
<h3>What could go wrong?</h3>
<p>Let us consider some potential issues that could arise. First, it should be noted that it is always a good idea to run the sampler more than once, as it helps diagnosing potential problems. Here, we will run the sampler twice, which gives us two sequences of draws for each parameter. Normally one would want to run one chain on each processor core, but we will keep it simple and one run after the other.</p>
<div id="too-large-step-size" class="section level4">
<h4>Too large step size</h4>
<p>For the sake of illustration, let us run the sampler with a much larger step size, increasing the standard deviation of the proposal distribution to 20:</p>
<pre class="r"><code>n.it = 2500
chain1 <- metropolis(y,n.it,proposal.sd=20,starting.values=c(rnorm(1,-1,1.5),runif(1,0,10)))
chain2 <- metropolis(y,n.it,proposal.sd=20,starting.values=c(rnorm(1,-1,1.5),runif(1,0,10)))
draws.2 <- data.frame(1:n.it,chain1,chain2)</code></pre>
<p>The results are as follows:</p>
<p><img src="http://www.boelstad.net/post/MCMC_sampling_introduction_files/figure-html/unnamed-chunk-14-1.svg" width="700.8" style="display: block; margin: auto auto auto 0;" /><img src="http://www.boelstad.net/post/MCMC_sampling_introduction_files/figure-html/unnamed-chunk-14-2.svg" width="700.8" style="display: block; margin: auto auto auto 0;" /></p>
<p>What is going on here? The wide proposal distribution is giving us a large number of proposals that have very low or zero posterior probability, which results in a very high rejection rate (about 99%). This is visible as flat lines in the plot. In practice, we will need a greater number of draws to get an acceptable approximation of the posterior distribution.</p>
<pre class="r"><code># For basic versions of the plots above, run: plot(1:n.it,draws.2[,2],type="l",ylim=c(min(draws.2[,c(2,4)]),max(draws.2[,c(2,4)]))); lines(1:n.it,draws.2[,4]); plot(1:n.it,draws.2[,3],type="l",ylim=c(min(draws.2[,c(3,5)]),max(draws.2[,c(3,5)]))); lines(1:n.it,draws.2[,5])</code></pre>
</div>
<div id="too-small-step-size" class="section level4">
<h4>Too small step size</h4>
<p>When the situation above occurs, a natural response is to make the proposal distribution narrower. Let us see what happens if we go a bit far and reduce the standard deviation of the innovations to 1/100:</p>
<pre class="r"><code>n.it = 2500
chain1 <- metropolis(y,n.it,proposal.sd=.01,starting.values=c(rnorm(1,-1,1.5),runif(1,0,10)))
chain2 <- metropolis(y,n.it,proposal.sd=.01,starting.values=c(rnorm(1,-1,1.5),runif(1,0,10)))
draws.3 <- data.frame(1:n.it,chain1,chain2)</code></pre>
<p>Now we get the following results:</p>
<p><img src="http://www.boelstad.net/post/MCMC_sampling_introduction_files/figure-html/unnamed-chunk-17-1.svg" width="700.8" style="display: block; margin: auto auto auto 0;" /><img src="http://www.boelstad.net/post/MCMC_sampling_introduction_files/figure-html/unnamed-chunk-17-2.svg" width="700.8" style="display: block; margin: auto auto auto 0;" /></p>
<p>What do we see here? The small step size means the chain is moving very slowly around in the parameter space. It also gives a very high acceptance rate (about 97%), because few proposals are notably worse than the draws they compete with. With this set-up, we would need an extremely high number of iterations for the chains to converge and approximate the posterior distribution.</p>
<pre class="r"><code># For basic versions of the plots above, run: plot(1:n.it,draws.3[,2],type="l",ylim=c(min(draws.3[,c(2,4)]),max(draws.3[,c(2,4)]))); lines(1:n.it,draws.3[,4]); plot(1:n.it,draws.3[,3],type="l",ylim=c(min(draws.3[,c(3,5)]),max(draws.3[,c(3,5)]))); lines(1:n.it,draws.3[,5]) </code></pre>
</div>
</div>
<div id="final-notes" class="section level3">
<h3>Final notes</h3>
<p>In this example, it easy to set a step size that makes the Metropolis algorithm work well. In more complicated settings, with a large number of potentially correlated parameters, the algorithm may be less well suited, and more advanced samplers (e.g. Hamiltonian MC/NUTS) may be preferable. It should also be noted that there are also several topics I have left aside here. The present example is so limited that we could keep things simple and not work with log probabilities. In my <a href="http://www.boelstad.net/post/bayesian_hierarchical_modeling/">next post</a>, I expand the sampler to analyze a hierarchical dataset, which requires using logs to avoid computational underflows. Other issues I have left aside include the calculation of convergence statistics and effective simulation draws, as well as the fact that these algorithms typically require a warm-up period – with appropriate step size, there is hardly need for one in this example. If you find any mistakes in this post, please let me know!</p>
</div>
<div id="further-reading" class="section level3">
<h3>Further reading</h3>
<div class="Bib">
<p>
Gelman, Andrew, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari and Donald B Rubin. 2014. <i>Bayesian Data Analysis.</i> 3rd ed. London: Chapman & Hall/CRC Press.
</p>
<p>
McElreath, Richard. 2016. <i>Statistical Rethinking: A Bayesian Course with Examples in R and Stan.</i> London: Chapman & Hall/CRC Press.
</p>
</div>
</div>
The Basics of Bayesian Inference
http://www.boelstad.net/post/bayesian_statistics_introduction/
Sun, 22 Jul 2018 00:00:00 +0000jorgen.bolstad@arena.uio.no (Jørgen Bølstad)http://www.boelstad.net/post/bayesian_statistics_introduction/<p>The goal of data analysis is typically to learn (more) about some unknown features of the world, and Bayesian inference offers a consistent framework for doing so. This framework is particularly useful when we have noisy, limited, or hierarchical data – or very complicated models. You may be aware of Bayes’ theorem, which states that <em>the posterior is proportional to the likelihood times the prior</em>. But what does that mean? This post offers a very basic introduction to key concepts in Bayesian statistics, with illustrations in R.</p>
<p>This will be a hands-on discussion, so we will start by setting up a relevant example. Say we are interested in estimating a latent trait of a single individual, and denote this trait with the Greek letter mu, <span class="math inline">\(\mu\)</span>. We have three noisy measurements of this trait, stored as the variable <span class="math inline">\(y\)</span>. We do not have data for other individuals at hand, but we know that the <span class="math inline">\(\mu\)</span>’s in the whole population are normally distributed around a mean of <span class="math inline">\(-1\)</span> with a standard deviation of <span class="math inline">\(1.5\)</span>. A key feature of the Bayesian framework is that it lets us use this information to get a less noisy estimate of <span class="math inline">\(\mu\)</span>.</p>
<p>To get started, let us create some data that fit the description above. We will let the measurement errors be normally distributed with a standard deviation of 2, and give <span class="math inline">\(\mu\)</span> the value 1, which is reasonably probable in light of the population distribution. To generate normally distributed data, we use the R function <code>rnorm</code>:</p>
<pre class="r"><code>set.seed(4) # Set seed for reproducibility
n <- 3 # Set number of observations
y <- rnorm(n,mean=1,sd=2) # Generate data, taking random draws from a normal distribution
y # Take a look at the data</code></pre>
<pre><code>## [1] 1.43350973 -0.08498514 2.78228929</code></pre>
<div id="the-prior" class="section level3">
<h3>The prior</h3>
<p>To perform a Bayesian analysis, we need to specify our prior: <span class="math inline">\(p(\mu)\)</span>. That is, our beliefs about the parameter before having examined the new data. As we are dealing with a continuous variable in this case, our prior will be a continuous probability distribution. More specifically, without any additional knowledge of the individual in question, we will let our prior reflect our knowledge about the population distribution: We believe <span class="math inline">\(\mu\)</span> is drawn from a normal distribution with mean <span class="math inline">\(-1\)</span> and standard deviation <span class="math inline">\(1.5\)</span>: <span class="math inline">\(p(\mu) \sim N(-1,1.5^2)\)</span>. (We square the standard deviation to specify the distribution in terms of its variance). <!-- -- giving the probability that the parameter lies witin any specified interval. Note: The normal distribution is typically described in terms of its variance, $\sigma^2$, not the standard deviation, $\sigma$, which is why we get $1.5^2$ #(2) The specified priors is fairly informative -- and perhaps unrealistic -- but it is useful for illustration) --></p>
<p>Our prior thus looks like this:</p>
<p><img src="http://www.boelstad.net/post/Bayesian_statistics_introduction_files/figure-html/priorillustration-1.svg" width="700.8" style="display: block; margin: auto auto auto 0;" /></p>
<p>To keep things simple and intuitive, we will assess our distributions (i.e. prior, likelihood, posterior) at equally spaced points (essentially treating them as categorical distributions). We thus define a sequence of potential values that <span class="math inline">\(\mu\)</span> could have, and for each specific value approximate the probability that <span class="math inline">\(\mu\)</span> actually lies within a narrow interval around this point. Specifically, we will divide the scale into intervals of length <span class="math inline">\(.01\)</span>.</p>
<p>To define our prior in R, we use the probability density function (PDF) of the normal distribution, <code>dnorm</code>:</p>
<pre class="r"><code>res <- 1/100 # Set resolution for the sequence of potential mu-values
mu.h <- seq(-10,10,res) # Define values at which to assess the probability density
density <- dnorm(mu.h,mean=-1,sd=1.5) # Get the density for each value in mu.h
prior <- density * res # Multiply with resolution to get probability per interval</code></pre>
<pre class="r"><code>sum(prior) # Our prior probabilities now sum to 1</code></pre>
<pre><code>## [1] 1</code></pre>
<pre class="r"><code>mu.h[1:5] # Inspect the first 5 values of mu.h</code></pre>
<pre><code>## [1] -10.00 -9.99 -9.98 -9.97 -9.96</code></pre>
<pre class="r"><code>prior[1:5] # Inspect the corresponding values of the prior</code></pre>
<pre><code>## [1] 4.050589e-11 4.215803e-11 4.387560e-11 4.566113e-11 4.751720e-11</code></pre>
<pre class="r"><code># For a basic plot of the prior density, run: plot(mu.h,prior/res,type="l",ylab="",lty=3)</code></pre>
</div>
<div id="the-likelihood" class="section level3">
<h3>The likelihood</h3>
<p>Next, we need to define the likelihood function, which gives the probability of observing our data for given a parameter value: <span class="math inline">\(p(y|\mu)\)</span>. To keep things simple, let us say we know that the procedure generating our measurements entails normally distributed errors with a standard deviation of 2. For a single observation, <span class="math inline">\(y_i\)</span>, the likelihood is then: <span class="math inline">\(p(y_i|\mu) = N(y_i|\mu,2^2)\)</span>. The likelihood for the whole dataset is the product of all the individual likelihoods: <span class="math display">\[p(y|\mu) = \prod^n_{i=1} N(y_i|\mu,2^2)\]</span></p>
<!-- (Note: We usually work with the log-likelihood to avoid the multiplication, but the present problem is sufficiently simple that we can stick with the likelihood.) -->
<p>Having specified the likelihood function, we can calculate the likelihood across the set of potential values for <span class="math inline">\(\mu\)</span>. An intuitive way to do this is to use loops:</p>
<pre class="r"><code>likelihood <- vector() # Create an empty vector
for (s in mu.h) { # Loop over each value in mu.h
temp.lik <- 1 # Create a temporary likelihood scalar
for (i in 1:n) { # Loop over each individual data point
temp.lik <- temp.lik * dnorm(y[i],mean=s,sd=2) # Multiply the individual likelihoods
}
likelihood <- c(likelihood,temp.lik) # Store the likelihood for each value in mu.h
}</code></pre>
<p>However, it is more efficient to use vectorized operations and achieve the same result:</p>
<pre class="r"><code>likelihood <- sapply(mu.h,function(s) prod(dnorm(y,s,2)) )</code></pre>
<!--, as a probability distribution does (like the prior above). As we assess the likelihood across different parameter values for a fixed set of data, the distribution does not integrate to 1 like a proper probability distribution does. -->
<p>It is worth noting we assess the likelihood across different parameter values while keeping the data fixed. This means that the likelihood is not a probability density function that necessarily integrates (or sums) to 1 (in contrast to the prior specified above). However, to plot the likelihood on the same scale as the prior and the posterior distributions, it is useful also to calculate a normalized version that does sum to 1:</p>
<pre class="r"><code>normalized.likelihood <- likelihood / sum(likelihood)</code></pre>
</div>
<div id="the-posterior" class="section level3">
<h3>The posterior</h3>
<p>Now, we can finally turn to the posterior distribution, which represents our updated beliefs about the parameter after having observed the data: <span class="math inline">\(p(\mu|y)\)</span>. As we know from Bayes’ theorem, we can obtain a distribution that is proportional (“<span class="math inline">\(\propto\)</span>”) to the posterior by multiplying the likelihood and the prior: <span class="math display">\[p(\mu|y) \propto p(y|\mu)p(\mu)\]</span> In R, this means:</p>
<pre class="r"><code>unnormalized.posterior <- likelihood * prior</code></pre>
<p>We can normalize this distribution to sum to 1 by multiplying it with a normalizing constant (although this constant will typically be unknown):</p>
<pre class="r"><code>normalizing.constant <- 1 / sum(likelihood * prior)
posterior <- unnormalized.posterior * normalizing.constant</code></pre>
<p>For each value in our set of potential <span class="math inline">\(\mu\)</span>-values, we now have a posterior probability that <span class="math inline">\(\mu\)</span> indeed lies in a .01-interval around this point. To plot the distributions as approximate probability densities, we can divide them by .01 before plotting. This gives the following result:</p>
<p><img src="http://www.boelstad.net/post/Bayesian_statistics_introduction_files/figure-html/unnamed-chunk-6-1.svg" width="700.8" style="display: block; margin: auto auto auto 0;" /></p>
<p>What do we see here? Because we only have three observations for the individual in question, we get a fairly wide likelihood, which allows the posterior to be drawn towards the prior. Put differently, we do not have enough information to conclude that <span class="math inline">\(\mu\)</span> is as far from the population mean as the likelihood would suggest. If we wanted greater certainty, we could go collect more data and get a narrower likelihood distribution.</p>
<pre class="r"><code># For a basic version of the plot above, run: plot(mu.h,posterior/res,type="l",ylab=""); lines(mu.h,normalized.likelihood/res,lty=2); lines(mu.h,prior/res,lty=3)</code></pre>
</div>
<div id="estimating-two-parameters" class="section level2">
<h2>Estimating two parameters</h2>
<p>Above, we simplified by assuming that we knew the standard deviation of the errors in our data. That is not a very common situation, so let us try to estimate both the mean, <span class="math inline">\(\mu\)</span>, and the standard deviation, <span class="math inline">\(\sigma\)</span>. The standard deviation must be positive, so we will consider its probability distribution over a set of positive values. To keep things simple, let us give <span class="math inline">\(\sigma\)</span> a flat prior over the range from 0 to 10, using a uniform distribution: <span class="math inline">\(p(\sigma) \sim U(0,10)\)</span>. To do this in R, we use the density function <code>dunif</code>:</p>
<pre class="r"><code>sigma.h <- seq(0,10,res) # Define values at which to assess the probabilities for sigma
sigma.density <- dunif(sigma.h,min(sigma.h),max(sigma.h)) # Density-values for a flat prior
sigma.prior <- sigma.density * res # Multiply with res to get probability per interval
mu.prior <- prior # Use the same prior for mu as in the previous example</code></pre>
<div id="the-joint-prior" class="section level3">
<h3>The joint prior</h3>
<p>Now that we have specified priors for <span class="math inline">\(\mu\)</span> and <span class="math inline">\(\sigma\)</span>, we can also specify their joint prior distribution, <span class="math inline">\(p(\mu,\sigma)\)</span>. For a given pair of potential values – one for <span class="math inline">\(\mu\)</span> and one for <span class="math inline">\(\sigma\)</span> – this gives us the prior probability that <span class="math inline">\(\mu\)</span> and <span class="math inline">\(\sigma\)</span> are each (approximately) equal to their respective value in this pair. Because we have specified the priors for <span class="math inline">\(\mu\)</span> and <span class="math inline">\(\sigma\)</span> independently (i.e. no covariation), the joint prior distribution is in this case simply: <span class="math display">\[p(\mu,\sigma) = p(\mu)p(\sigma)\]</span> To get the joint prior in R, we can thus write:</p>
<pre class="r"><code>joint.prior <- sapply(mu.prior,function(mu.s) { # Using sapply's instead of loops
sapply(sigma.prior,function(sigma.s) { mu.s*sigma.s })})</code></pre>
<p>This gives us a two-dimensional grid of probabilities, representing all possible combinations of <span class="math inline">\(\mu\)</span> and <span class="math inline">\(\sigma\)</span> values that we are considering. Note, however, that the columns in this grid are constant, due to the flat prior on <span class="math inline">\(\sigma\)</span>:</p>
<pre class="r"><code>joint.prior[1:5,1:5] # Inspect the first 5x5 entries</code></pre>
<pre><code>## [,1] [,2] [,3] [,4] [,5]
## [1,] 4.050589e-14 4.215803e-14 4.38756e-14 4.566113e-14 4.75172e-14
## [2,] 4.050589e-14 4.215803e-14 4.38756e-14 4.566113e-14 4.75172e-14
## [3,] 4.050589e-14 4.215803e-14 4.38756e-14 4.566113e-14 4.75172e-14
## [4,] 4.050589e-14 4.215803e-14 4.38756e-14 4.566113e-14 4.75172e-14
## [5,] 4.050589e-14 4.215803e-14 4.38756e-14 4.566113e-14 4.75172e-14</code></pre>
</div>
<div id="the-joint-likelihood" class="section level3">
<h3>The joint likelihood</h3>
<p>Our likelihood function still looks much like before, but it now gives the probability of the data given two parameters instead of one. For a single observation, <span class="math inline">\(y_i\)</span>, the likelihood is now: <span class="math inline">\(p(y_i|\mu,\sigma) = N(y_i|\mu,\sigma^2)\)</span>. As before, the likelihood for the whole dataset is the product of all the individual likelihoods:</p>
<p><span class="math display">\[p(y|\mu,\sigma) = \prod^n_{i=1} N(y_i|\mu,\sigma^2)\]</span></p>
<p>If we calculate the likelihood across the possible combinations of values for <span class="math inline">\(\mu\)</span> and <span class="math inline">\(\sigma\)</span>, we now get a two-dimensional grid. To do so in R, we run:</p>
<pre class="r"><code>joint.likelihood <- sapply(mu.h,function(mu.s) { # Sapply's instead of loops
sapply(sigma.h,function(sigma.s) { prod(dnorm(y,mu.s,sigma.s)) })})
normalized.joint.likelihood <- joint.likelihood / sum(joint.likelihood) # Normalize </code></pre>
</div>
<div id="the-joint-posterior" class="section level3">
<h3>The joint posterior</h3>
<p>The posterior distribution is now a joint distribution for our two parameters. Adapting Bayes’ theorem to this case, we get:</p>
<p><span class="math display">\[p(\mu,\sigma|y) \propto p(y|\mu,\sigma)p(\mu,\sigma)\]</span></p>
<p>This is easy to calculate in R:</p>
<pre class="r"><code>unnormalized.joint.posterior <- joint.likelihood * joint.prior
normalizing.constant <- 1 / sum(joint.likelihood * joint.prior)
joint.posterior <- unnormalized.joint.posterior * normalizing.constant</code></pre>
<p>Instead of looking directly at the resulting two-dimensional set of probabilities, let us use a contour plot:</p>
<p><img src="http://www.boelstad.net/post/Bayesian_statistics_introduction_files/figure-html/unnamed-chunk-11-1.svg" width="700.8" style="display: block; margin: auto auto auto 0;" /></p>
<p>What do we see here? There is some covariation between the parameters in the joint posterior distribution: When <span class="math inline">\(\sigma\)</span> is large, the likelihood gets wider, so the posterior probabilities for <span class="math inline">\(\mu\)</span> get drawn more strongly towards the prior.</p>
<pre class="r"><code># For a basic version of the plot above, run: contour(mu.h,sigma.h,t(joint.posterior))</code></pre>
</div>
<div id="marginal-distributions" class="section level3">
<h3>Marginal distributions</h3>
<p>While the joint posterior may be useful, we often want to draw conclusions about the values of a given parameter without reference to the other parameter(s) in the model. The way to obtain a posterior distribution for a single parameter – a marginal posterior distribution – is to average the joint distribution over the parameter(s) we want to leave aside. In other words, to get the marginal posterior for <span class="math inline">\(\mu\)</span>, <span class="math inline">\(p(\mu|y)\)</span>, we integrate the joint posterior over <span class="math inline">\(\sigma\)</span>: <span class="math display">\[ p(\mu|y) = \int p(\mu,\sigma |y)d(\sigma)\]</span> In our case, we have already simplified this task by effectively turning our continuous distributions into categorical ones. This makes it easy to approximate the integral by summing the posterior probabilities in our grid for each potential value of <span class="math inline">\(\sigma\)</span>. Because our grid has a row for each potential value of <span class="math inline">\(\sigma\)</span>, we sum the probabilities in each column (across all rows) to get the marginal probabilities for <span class="math inline">\(\mu\)</span>:</p>
<pre class="r"><code>mu.marginal.posterior <- colSums(joint.posterior)</code></pre>
<p>We can also do the same for the normalized likelihood:</p>
<pre class="r"><code>mu.marginal.norm.likelihood <- colSums(normalized.joint.likelihood)</code></pre>
<!-- #Note: Because the joint distributions were already normalized, the marginal distributions have the same characteristic. -->
<p>The resulting distributions look like this:</p>
<p><img src="http://www.boelstad.net/post/Bayesian_statistics_introduction_files/figure-html/unnamed-chunk-15-1.svg" width="700.8" style="display: block; margin: auto auto auto 0;" /></p>
<p>We see that estimating <span class="math inline">\(\sigma\)</span> and giving it a flat prior has resulted in a likelihood with much fatter tails than earlier, which in turn has changed the shape of the posterior distribution for <span class="math inline">\(\mu\)</span>.</p>
<pre class="r"><code># For a basic version of this plot, run: plot(mu.h,mu.marginal.norm.likelihood/res,type="l",ylab="",lty=2); lines(mu.h,mu.marginal.posterior/res,lty=1); lines(mu.h,mu.prior/res,lty=3)</code></pre>
<p>For <span class="math inline">\(\sigma\)</span>, the marginal distributions are as follows:</p>
<pre class="r"><code>sigma.marginal.posterior <- rowSums(joint.posterior)
sigma.marginal.norm.likelihood <- rowSums(normalized.joint.likelihood)</code></pre>
<p><img src="http://www.boelstad.net/post/Bayesian_statistics_introduction_files/figure-html/plots-1.svg" width="700.8" style="display: block; margin: auto auto auto 0;" /></p>
<!--An interesting point to note here is that the flat prior on $\sigma$ is not entirely innocent: It flattens the posterior. A different prior, attributing lower probabilities to very large and very small values might have been more appropriate.-->
<pre class="r"><code># For a basic version of this plot, run: plot(sigma.h,sigma.marginal.norm.likelihood/res,type="l",ylab="",lty=2); lines(sigma.h,sigma.marginal.posterior/res,lty=1); lines(sigma.h,sigma.prior/res,lty=3)</code></pre>
</div>
<div id="final-notes" class="section level3">
<h3>Final notes</h3>
<p>The approach used here (calculating joint probability distributions over a grid) was chosen because it permits illustrating key Bayesian concepts without using too much math. However, the grid-approach is not one we would typically use in actual analyses. One reason is that simple models like the one we have here often can be solved analytically by using conjugate priors (which can be very neat). Another reason is that the grid-approach is ill-suited for more complicated problems, as the number of elements in the grid increases exponentially with the number of parameters: With two parameters, and considering <span class="math inline">\(1,000\)</span> potential values for each, we get a grid with <span class="math inline">\(1,000^2 = 1,000,000\)</span> elements. With <span class="math inline">\(50\)</span> parameters, we get: <span class="math inline">\(1,000^{50}\)</span>. For such problems, we would normally sample from the posterior distribution using a form of Markov Chain Monte Carlo (MCMC) – which is the topic of my <a href="http://www.boelstad.net/post/mcmc_sampling_introduction/">next post</a>. For actual MCMC analyses, I recommend having a look at Stan: <a href="http://www.mc-stan.org" target="_blank">http://www.mc-stan.org</a>.</p>
<p>If you find any mistakes in this post, please let me know!</p>
</div>
<div id="further-reading" class="section level3">
<h3>Further reading</h3>
<div class="Bib">
<p>
Gelman, Andrew, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari and Donald B Rubin. 2014. <i>Bayesian Data Analysis.</i> 3rd ed. London: Chapman & Hall/CRC Press.
</p>
<p>
McElreath, Richard. 2016. <i>Statistical Rethinking: A Bayesian Course with Examples in R and Stan.</i> London: Chapman & Hall/CRC Press.
</p>
</div>
</div>
</div>