The Basics of Bayesian Inference

Evaluating continuous distributions over a grid in R

The goal of data analysis is typically to learn (more) about some unknown features of the world, and Bayesian inference offers a consistent framework for doing so. This framework is particularly useful when we have noisy, limited, or hierarchical data – or very complicated models. You may be aware of Bayes’ theorem, which states that the posterior is proportional to the likelihood times the prior. But what does that mean? This post offers a very basic introduction to key concepts in Bayesian statistics, with illustrations in R.

This will be a hands-on discussion, so we will start by setting up a relevant example. Say we are interested in estimating a latent trait of a single individual, and denote this trait with the Greek letter mu, \(\mu\). We have three noisy measurements of this trait, stored as the variable \(y\). We do not have data for other individuals at hand, but we know that the \(\mu\)’s in the whole population are normally distributed around a mean of \(-1\) with a standard deviation of \(1.5\). A key feature of the Bayesian framework is that it lets us use this information to get a less noisy estimate of \(\mu\).

To get started, let us create some data that fit the description above. We will let the measurement errors be normally distributed with a standard deviation of 2, and give \(\mu\) the value 1, which is reasonably probable in light of the population distribution. To generate normally distributed data, we use the R function rnorm:

set.seed(4)                 # Set seed for reproducibility
n <- 3                      # Set number of observations
y <- rnorm(n,mean=1,sd=2)   # Generate data, taking random draws from a normal distribution
y                           # Take a look at the data
## [1]  1.43350973 -0.08498514  2.78228929

The prior

To perform a Bayesian analysis, we need to specify our prior: \(p(\mu)\). That is, our beliefs about the parameter before having examined the new data. As we are dealing with a continuous variable in this case, our prior will be a continuous probability distribution. More specifically, without any additional knowledge of the individual in question, we will let our prior reflect our knowledge about the population distribution: We believe \(\mu\) is drawn from a normal distribution with mean \(-1\) and standard deviation \(1.5\): \(p(\mu) \sim N(-1,1.5^2)\). (We square the standard deviation to specify the distribution in terms of its variance).

Our prior thus looks like this:

To keep things simple and intuitive, we will assess our distributions (i.e. prior, likelihood, posterior) at equally spaced points (essentially treating them as categorical distributions). We thus define a sequence of potential values that \(\mu\) could have, and for each specific value approximate the probability that \(\mu\) actually lies within a narrow interval around this point. Specifically, we will divide the scale into intervals of length \(.01\).

To define our prior in R, we use the probability density function (PDF) of the normal distribution, dnorm:

res <- 1/100                # Set resolution for the sequence of potential mu-values
mu.h <- seq(-10,10,res)     # Define values at which to assess the probability density
density <- dnorm(mu.h,mean=-1,sd=1.5) # Get the density for each value in mu.h
prior <- density * res      # Multiply with resolution to get probability per interval
sum(prior)                  # Our prior probabilities now sum to 1
## [1] 1
mu.h[1:5]                   # Inspect the first 5 values of mu.h
## [1] -10.00  -9.99  -9.98  -9.97  -9.96
prior[1:5]                  # Inspect the corresponding values of the prior
## [1] 4.050589e-11 4.215803e-11 4.387560e-11 4.566113e-11 4.751720e-11
# For a basic plot of the prior density, run: plot(mu.h,prior/res,type="l",ylab="",lty=3)

The likelihood

Next, we need to define the likelihood function, which gives the probability of observing our data for given a parameter value: \(p(y|\mu)\). To keep things simple, let us say we know that the procedure generating our measurements entails normally distributed errors with a standard deviation of 2. For a single observation, \(y_i\), the likelihood is then: \(p(y_i|\mu) = N(y_i|\mu,2^2)\). The likelihood for the whole dataset is the product of all the individual likelihoods: \[p(y|\mu) = \prod^n_{i=1} N(y_i|\mu,2^2)\]

Having specified the likelihood function, we can calculate the likelihood across the set of potential values for \(\mu\). An intuitive way to do this is to use loops:

likelihood <- vector()      # Create an empty vector
for (s in mu.h) {           # Loop over each value in mu.h
  temp.lik <- 1             # Create a temporary likelihood scalar
  for (i in 1:n) {          # Loop over each individual data point
    temp.lik <- temp.lik * dnorm(y[i],mean=s,sd=2) # Multiply the individual likelihoods
  }
  likelihood <- c(likelihood,temp.lik) # Store the likelihood for each value in mu.h
}

However, it is more efficient to use vectorized operations and achieve the same result:

likelihood <- sapply(mu.h,function(s) prod(dnorm(y,s,2)) )

It is worth noting we assess the likelihood across different parameter values while keeping the data fixed. This means that the likelihood is not a probability density function that necessarily integrates (or sums) to 1 (in contrast to the prior specified above). However, to plot the likelihood on the same scale as the prior and the posterior distributions, it is useful also to calculate a normalized version that does sum to 1:

normalized.likelihood <- likelihood / sum(likelihood)

The posterior

Now, we can finally turn to the posterior distribution, which represents our updated beliefs about the parameter after having observed the data: \(p(\mu|y)\). As we know from Bayes’ theorem, we can obtain a distribution that is proportional (“\(\propto\)”) to the posterior by multiplying the likelihood and the prior: \[p(\mu|y) \propto p(y|\mu)p(\mu)\] In R, this means:

unnormalized.posterior <- likelihood * prior

We can normalize this distribution to sum to 1 by multiplying it with a normalizing constant (although this constant will typically be unknown):

normalizing.constant <- 1 / sum(likelihood * prior)
posterior <- unnormalized.posterior * normalizing.constant

For each value in our set of potential \(\mu\)-values, we now have a posterior probability that \(\mu\) indeed lies in a .01-interval around this point. To plot the distributions as approximate probability densities, we can divide them by .01 before plotting. This gives the following result:

What do we see here? Because we only have three observations for the individual in question, we get a fairly wide likelihood, which allows the posterior to be drawn towards the prior. Put differently, we do not have enough information to conclude that \(\mu\) is as far from the population mean as the likelihood would suggest. If we wanted greater certainty, we could go collect more data and get a narrower likelihood distribution.

# For a basic version of the plot above, run: plot(mu.h,posterior/res,type="l",ylab=""); lines(mu.h,normalized.likelihood/res,lty=2); lines(mu.h,prior/res,lty=3)

Estimating two parameters

Above, we simplified by assuming that we knew the standard deviation of the errors in our data. That is not a very common situation, so let us try to estimate both the mean, \(\mu\), and the standard deviation, \(\sigma\). The standard deviation must be positive, so we will consider its probability distribution over a set of positive values. To keep things simple, let us give \(\sigma\) a flat prior over the range from 0 to 10, using a uniform distribution: \(p(\sigma) \sim U(0,10)\). To do this in R, we use the density function dunif:

sigma.h <- seq(0,10,res)    # Define values at which to assess the probabilities for sigma
sigma.density <- dunif(sigma.h,min(sigma.h),max(sigma.h)) # Density-values for a flat prior
sigma.prior <- sigma.density * res # Multiply with res to get probability per interval
mu.prior <- prior           # Use the same prior for mu as in the previous example

The joint prior

Now that we have specified priors for \(\mu\) and \(\sigma\), we can also specify their joint prior distribution, \(p(\mu,\sigma)\). For a given pair of potential values – one for \(\mu\) and one for \(\sigma\) – this gives us the prior probability that \(\mu\) and \(\sigma\) are each (approximately) equal to their respective value in this pair. Because we have specified the priors for \(\mu\) and \(\sigma\) independently (i.e. no covariation), the joint prior distribution is in this case simply: \[p(\mu,\sigma) = p(\mu)p(\sigma)\] To get the joint prior in R, we can thus write:

joint.prior <- sapply(mu.prior,function(mu.s) { # Using sapply's instead of loops
                sapply(sigma.prior,function(sigma.s) { mu.s*sigma.s })})

This gives us a two-dimensional grid of probabilities, representing all possible combinations of \(\mu\) and \(\sigma\) values that we are considering. Note, however, that the columns in this grid are constant, due to the flat prior on \(\sigma\):

joint.prior[1:5,1:5]        # Inspect the first 5x5 entries
##              [,1]         [,2]        [,3]         [,4]        [,5]
## [1,] 4.050589e-14 4.215803e-14 4.38756e-14 4.566113e-14 4.75172e-14
## [2,] 4.050589e-14 4.215803e-14 4.38756e-14 4.566113e-14 4.75172e-14
## [3,] 4.050589e-14 4.215803e-14 4.38756e-14 4.566113e-14 4.75172e-14
## [4,] 4.050589e-14 4.215803e-14 4.38756e-14 4.566113e-14 4.75172e-14
## [5,] 4.050589e-14 4.215803e-14 4.38756e-14 4.566113e-14 4.75172e-14

The joint likelihood

Our likelihood function still looks much like before, but it now gives the probability of the data given two parameters instead of one. For a single observation, \(y_i\), the likelihood is now: \(p(y_i|\mu,\sigma) = N(y_i|\mu,\sigma^2)\). As before, the likelihood for the whole dataset is the product of all the individual likelihoods:

\[p(y|\mu,\sigma) = \prod^n_{i=1} N(y_i|\mu,\sigma^2)\]

If we calculate the likelihood across the possible combinations of values for \(\mu\) and \(\sigma\), we now get a two-dimensional grid. To do so in R, we run:

joint.likelihood <- sapply(mu.h,function(mu.s) { # Sapply's instead of loops
                      sapply(sigma.h,function(sigma.s) { prod(dnorm(y,mu.s,sigma.s)) })})
normalized.joint.likelihood <- joint.likelihood / sum(joint.likelihood) # Normalize 

The joint posterior

The posterior distribution is now a joint distribution for our two parameters. Adapting Bayes’ theorem to this case, we get:

\[p(\mu,\sigma|y) \propto p(y|\mu,\sigma)p(\mu,\sigma)\]

This is easy to calculate in R:

unnormalized.joint.posterior <- joint.likelihood * joint.prior
normalizing.constant <- 1 / sum(joint.likelihood * joint.prior)
joint.posterior <- unnormalized.joint.posterior * normalizing.constant

Instead of looking directly at the resulting two-dimensional set of probabilities, let us use a contour plot:

What do we see here? There is some covariation between the parameters in the joint posterior distribution: When \(\sigma\) is large, the likelihood gets wider, so the posterior probabilities for \(\mu\) get drawn more strongly towards the prior.

# For a basic version of the plot above, run: contour(mu.h,sigma.h,t(joint.posterior))

Marginal distributions

While the joint posterior may be useful, we often want to draw conclusions about the values of a given parameter without reference to the other parameter(s) in the model. The way to obtain a posterior distribution for a single parameter – a marginal posterior distribution – is to average the joint distribution over the parameter(s) we want to leave aside. In other words, to get the marginal posterior for \(\mu\), \(p(\mu|y)\), we integrate the joint posterior over \(\sigma\): \[ p(\mu|y) = \int p(\mu,\sigma |y)d(\sigma)\] In our case, we have already simplified this task by effectively turning our continuous distributions into categorical ones. This makes it easy to approximate the integral by summing the posterior probabilities in our grid for each potential value of \(\sigma\). Because our grid has a row for each potential value of \(\sigma\), we sum the probabilities in each column (across all rows) to get the marginal probabilities for \(\mu\):

mu.marginal.posterior <- colSums(joint.posterior)

We can also do the same for the normalized likelihood:

mu.marginal.norm.likelihood <- colSums(normalized.joint.likelihood)

The resulting distributions look like this:

We see that estimating \(\sigma\) and giving it a flat prior has resulted in a likelihood with much fatter tails than earlier, which in turn has changed the shape of the posterior distribution for \(\mu\).

# For a basic version of this plot, run: plot(mu.h,mu.marginal.norm.likelihood/res,type="l",ylab="",lty=2); lines(mu.h,mu.marginal.posterior/res,lty=1); lines(mu.h,mu.prior/res,lty=3)

For \(\sigma\), the marginal distributions are as follows:

sigma.marginal.posterior <- rowSums(joint.posterior)
sigma.marginal.norm.likelihood <- rowSums(normalized.joint.likelihood)

# For a basic version of this plot, run: plot(sigma.h,sigma.marginal.norm.likelihood/res,type="l",ylab="",lty=2); lines(sigma.h,sigma.marginal.posterior/res,lty=1); lines(sigma.h,sigma.prior/res,lty=3)

Final notes

The approach used here (calculating joint probability distributions over a grid) was chosen because it permits illustrating key Bayesian concepts without using too much math. However, the grid-approach is not one we would typically use in actual analyses. One reason is that simple models like the one we have here often can be solved analytically by using conjugate priors (which can be very neat). Another reason is that the grid-approach is ill-suited for more complicated problems, as the number of elements in the grid increases exponentially with the number of parameters: With two parameters, and considering \(1,000\) potential values for each, we get a grid with \(1,000^2 = 1,000,000\) elements. With \(50\) parameters, we get: \(1,000^{50}\). For such problems, we would normally sample from the posterior distribution using a form of Markov Chain Monte Carlo (MCMC) – which is the topic of my next post. For actual MCMC analyses, I recommend having a look at Stan: http://www.mc-stan.org.

If you find any mistakes in this post, please let me know!

Further reading

Gelman, Andrew, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari and Donald B Rubin. 2014. Bayesian Data Analysis. 3rd ed. London: Chapman & Hall/CRC Press.

McElreath, Richard. 2016. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. London: Chapman & Hall/CRC Press.


comments powered by Disqus