Posts on Jørgen Bølstad
https://www.boelstad.net/post/
Recent content in Posts on Jørgen BølstadHugo -- gohugo.ioj.b [AT] post.harvard.edu (Jørgen Bølstad)j.b [AT] post.harvard.edu (Jørgen Bølstad)Wed, 20 Jan 2021 00:00:00 +0000A New Website Is Born: Estimite.com
https://www.boelstad.net/post/2021_norwegian_election_forecast/
Wed, 20 Jan 2021 00:00:00 +0000j.b [AT] post.harvard.edu (Jørgen Bølstad)https://www.boelstad.net/post/2021_norwegian_election_forecast/<p>Bayesian modeling has proven its usefulness for poll aggregation and election forecasting – for instance through <a href="https://fivethirtyeight.com">FiveThirtyEight</a>. However, in Norway, media coverage of public opinion trends still tends to focus on a single poll at a time, or a simple average of polls at best. I was convinced it would be possible to do better, and after about three times as many long days and nights as I thought it would take, the result is finally live – both in <a href="https://www.estimite.com/no/post/stortingsvalget-2021/">Norwegian</a> and <a href="https://www.estimite.com/post/norwegian-election-2021/">English</a> – at <a href="https://www.estimite.com/">Estimite.com</a>.</p>
<div id="notes-on-the-approach" class="section level3">
<h3>Notes on the approach</h3>
<p>I have landed on a fully automated production process, where a script checks for new data, refits the model if necessary, and then rebuilds and publishes the website. This is all done using R, Stan, and a few shell scripts. Here is a brief description of the model:</p>
<ul>
<li><p>This is a state space model where center logratio transformed latent trends are given multivariate <em>t</em>-distributed innovations. The multivariate <em>t</em> has been given a non-centered specification through Cholesky factorization, which makes a notable difference.</p></li>
<li><p>The model estimates and corrects for average polling bias relative to previous elections. This contrasts with Nate Silver’s models at FiveThirtyEight, which assume that house effects have a mean of zero – I never fully bought that idea. Looking at Norwegian data, there seems to be a consistent pattern across elections, and ignoring it might be risky. At the same time, correcting for these biases carries its own risk, as it assumes polling companies do not fix their methodology from one election to the next.</p></li>
<li><p>The model estimates latent trends in party support at the national level, and predicts latent support for each electoral district based on previous election results. The predictions are restricted to sum up to the national trend. Votes at the district-level are predicted from the district-level trends using a Dirichlet-Multinomial distribution – the same kind of distribution as the one assumed to generate the polling data.</p></li>
<li><p>Predictions for future time points are based on: (1) Vote shares at the last election, (2) the degree of change in each party’s vote share from one election to the next, (3) current party support, (4), the degree of change in each party’s support from one week to the next, and (5) the degree of covariation in how support for each possible pair of parties evolves on a weekly basis.</p></li>
</ul>
</div>
<div id="conclusions" class="section level3">
<h3>Conclusions</h3>
<ol style="list-style-type: decimal">
<li><p>Stan is great for this, because it lets us define generated quantities, and get all kinds of relevant outputs from a single, joint simulation.</p></li>
<li><p>You should not underestimate how many technical challenges such projects entail and how much time it takes to fix them – especially if you care about details.</p></li>
</ol>
</div>
The Problem with Google’s Journal Rankings
https://www.boelstad.net/post/problem_google_journal_ranking/
Tue, 25 Feb 2020 00:00:00 +0000j.b [AT] post.harvard.edu (Jørgen Bølstad)https://www.boelstad.net/post/problem_google_journal_ranking/<p>I keep seeing Google’s journal rankings being used to measure research quality, and I want to explain why I think this is a bad idea. One could, of course, argue that citations do not reflect research quality anyway and therefore are of little relevance, but that is not the point I want to make here. I will simply run a small set of simulations to illustrate a key problem with how Google’s h-index is being used.</p>
<div id="the-h-index" class="section level3">
<h3>The h-index</h3>
<p>Google calculates the h-index for scholars and journals alike. For a given set of articles, the h-index is “the largest number h such that h articles have at least h citations each”. This means we can calculate it as follows:</p>
<pre class="r"><code>h.index = function(citations) {
citations = citations[order(citations, decreasing = TRUE)]
tail(which(citations >= seq_along(citations)), 1)
}</code></pre>
<p>To rank journals, Google uses the h5-index, which is an h-index based on the articles published by a journal in the last 5 complete years. If we were solely interested in a <em>journal’s</em> impact over these 5 years, this could possibly be an acceptable measure. However, Google’s ranking of journals is often assumed to reflect the general quality or impact of the <em>articles</em> that are published in these journals – which is a very different thing.</p>
</div>
<div id="simulation-setup" class="section level3">
<h3>Simulation setup</h3>
<p>To see the distinction, let’s run a small set of simulations comparing three journals that differ in the number of articles they publish:</p>
<ul>
<li>Journal 1 publishes 12 issues per year and 15 articles (or other citable items) per issue, which over 5 years gives 900 articles.</li>
<li>Journal 2 publishes 10 issues per year and 8 articles (or other citable items) per issue, which over 5 years gives 400 articles.</li>
<li>Journal 3 publishes 4 issues per year and 8 articles (or other citable items) per issue, which over 5 years gives 160 articles.</li>
</ul>
<p>In addition, the journals differ in the average impact of the articles they publish:</p>
<ul>
<li>Journal 1 and 2 publish articles with the same tendency to get cited.</li>
<li>Journal 3 publishes articles with a higher tendency to get cited.</li>
</ul>
<p>Here’s a function to generate data that fit this description:</p>
<pre class="r"><code>gen.dat <- function() {
list(journal1 = rnbinom(5 * 12 * 15, 1, prob = .05),
journal2 = rnbinom(5 * 10 * 8, 1, prob = .05),
journal3 = rnbinom(5 * 4 * 8, 1, prob = .04))
}</code></pre>
<p>Now, let’s generate 1000 datasets and calculate the average number of citations per article as well as the h-index for each journal:</p>
<pre class="r"><code>results <- vector()
for (i in 1:1000) {
x <- gen.dat()
results <- rbind(results, cbind(sapply(x, h.index), sapply(x, mean)))
}</code></pre>
</div>
<div id="results" class="section level3">
<h3>Results</h3>
<p>In line with the data-generating process above, the plot below shows that Journal 3 publishes articles with a higher tendency to get cited. However, if we turn to the h5-index, Journal 1 scores the highest, followed by Journal 2. As noted above, Journal 2 publishes articles with the same average impact as Journal 1, but it publishes fewer of them, and this results in a lower h5-index. Journal 3, which publishes articles with the highest average impact, gets the lowest h5-index due to its lower publication volume.</p>
<p><img src="https://www.boelstad.net/post/Problem_Google_journal_ranking_files/figure-html/unnamed-chunk-2-1.svg" width="705.6" style="display: block; margin: auto auto auto 0;" /></p>
</div>
<div id="final-notes" class="section level3">
<h3>Final notes</h3>
<p>If we want to assess the impact or quality of the articles published by a journal, it makes little sense to use a measure that gets inflated (or constrained) by the number of articles the journal publishes. In fact, publishing fewer articles should (all else equal) make a journal more selective and thus increase the quality of what it publishes.</p>
<p>To see how absurd this could possibly get, imagine we created a journal that at the end of each decade published one article presenting the single most important discovery of that decade. These articles might rack up many thousands of citations each, and the journal could be the most prestigious in the world, but its h5-index would never exceed 1. It would be beaten by pretty much any other journal. At the other extreme, a good strategy for achieving a high h5-index is to publish as much as possible and hope something sticks – not exactly the definition of a high-quality journal.</p>
<p>If we are looking for a measure of the average impact of the articles in a given journal, good old-fashioned impact factors are more valid. But even better is Scimago’s approach, as they weight the citations based on where they originate. Then we are getting closer to something meaningful, although I do believe we should be using data covering longer time periods than 2-3 years: If we are trying to capture something we believe is fairly stable, and our indicator shifts around every year, this suggests we are capturing a lot of noise, doesn’t it?</p>
</div>
How Efficient is Stan Compared to JAGS?
https://www.boelstad.net/post/stan_vs_jags_speed/
Wed, 02 Jan 2019 00:00:00 +0000j.b [AT] post.harvard.edu (Jørgen Bølstad)https://www.boelstad.net/post/stan_vs_jags_speed/<p>For a good while JAGS was your best bet if you wanted to do MCMC on a distribution of your own choosing. Then Stan came along, potentially replacing JAGS as the black-box sampler of choice for many Bayesians. But how do they compare in terms of performance? The obvious answer is: It depends. In fact, the question is nearly impossible to answer properly, as any comparison will be conditional on the data, model specifications, test criteria, and more. Nevertheless, this post offers a small simulation study.</p>
<p>The approach taken here is to compare the models in terms of seconds required to produce an effective simulation draw for the parameter with the least effective draws. Simulation time is taken to exclude compilation time, but include adaptation and sufficient warm-up for the chains to get into the typical set. The samplers are tested in a hierarchical setting, using six different models. The models differ in terms of pooling, conjugacy, and centered parameter specifications. The results are roughly as expected: JAGS exhibits both blazing fast and incredibly slow sampling, while Stan delivers somewhat more stable performance, being relatively efficient also in harder scenarios.</p>
<div id="data" class="section level3">
<h3>Data</h3>
<p>This mini-study generates hierarchical datasets, containing <span class="math inline">\(m\)</span> noisy observations for each of <span class="math inline">\(n\)</span> individuals. For a given individual <span class="math inline">\(i \in \{1, \ldots, n\}\)</span> and trial number <span class="math inline">\(j \in \{1, \ldots, m\}\)</span>, we have the observed outcome <span class="math inline">\(y_{ij}\)</span>. In addition to <span class="math inline">\(y\)</span>, the data contain the covariate <span class="math inline">\(x\)</span>, which also varies by <span class="math inline">\(i\)</span> and <span class="math inline">\(j\)</span>. The outcome <span class="math inline">\(y\)</span> will have normally distributed, homoskedastic errors, and depend linearly on <span class="math inline">\(x\)</span>, according to the following process: <span class="math inline">\(y_{ij} \sim N(\alpha_i + \beta_i x_{ij},\sigma^2)\)</span>. Each individual will have its own true intercept, <span class="math inline">\(\alpha_i\)</span>, as well as its own coefficient on <span class="math inline">\(x\)</span>, <span class="math inline">\(\beta_i\)</span>. These individual-specific parameters are in turn drawn from their own distributions: <span class="math inline">\(\alpha \sim N(2,1)\)</span>, and <span class="math inline">\(\beta \sim N(-2,1)\)</span>. The code to define the data-generating process (and load necessary packages) is:</p>
<pre class="r"><code>library(rstan)
library(rjags)
library(MASS) # To sample from multivariate normal
library(dclone) # To run JAGS in parallel
set.seed(1)
create.data <- function(n = 500, m = 5, covar = 0, xmean = 0) {
Sigma <- matrix(c(1, covar, covar, 1), 2, 2)
coefs <- mvrnorm(n, mu = c(2, -2), Sigma)
x <- matrix(rnorm(n * m, xmean, 1), ncol = m)
y <- matrix(NA, n, m)
for (i in 1:n) { for (j in 1:m) { y[i, j] <- rnorm(1, coefs[i, 1] + coefs[i, 2] * x[i, j], 2) } }
dat <- list(n = n, m = m, x = x, y = y)
return(dat)
}</code></pre>
<div id="posterior-parameter-correlations" class="section level4">
<h4>Posterior parameter correlations</h4>
<p>A key issue for MCMC samplers is how strong posterior correlations there are between the parameters, which in turn depends on both the data and the model specification. Strong correlations mean the parameters are hard to separate, posing a more difficult task for the samplers. JAGS, which relies on either Gibbs sampling or Metropolis-Hastings, is likely to suffer under such circumstances, while Stan might be expected perform better. With the present data, we can minimize posterior correlations by centering <span class="math inline">\(x\)</span>. Conversely, we can create correlations by moving its mean away from zero. The larger the mean relative to the standard deviation, the harder it will be to separate the constants from the coefficients. The tests below entail two scenarios that differ in this respect. One scenario with <strong>“weak correlations”</strong>: <span class="math inline">\(x_{ij} \sim N(0,1)\)</span>, and one with <strong>“strong correlations”</strong>: <span class="math inline">\(x_{ij} \sim N(2,1)\)</span>.</p>
</div>
</div>
<div id="models" class="section level3">
<h3>Models</h3>
<p>We will use a likelihood function that accurately reflects the data-generating process. In other words, we have:</p>
<p><span class="math display">\[p(y|x,\alpha,\beta,\sigma) = \prod^n_{i=1}\prod^m_{j=1} N(y_{ij}|\alpha_i + \beta_i x_{ij}, \sigma^2)\]</span></p>
<p>We will test five different models to see how the results differ across key specification choices. The main choices are: whether to do some pooling of individual coefficients, whether to use conjugate priors, and whether to use non-centered parameter specifications.</p>
<div id="hierarchical-weakly-informative-fully-conjugate-centered-and-non-centered" class="section level4">
<h4>Hierarchical, weakly informative, fully conjugate (centered and non-centered)</h4>
<p>We will start with a model doing partial pooling, assuming the <span class="math inline">\(\alpha\)</span>’s and <span class="math inline">\(\beta\)</span>’s are drawn from common population distributions: <span class="math display">\[\alpha_i \sim N(\mu_\alpha,\sigma_\alpha^2), ~~~ \beta_i \sim N(\mu_\beta,\sigma_\beta^2)\]</span> This model is fully conjugate, with normal priors on the means and inverse Gamma priors on the variances. (Note that the last parameter in the inverse gamma here is a scale and not a rate parameter.) The hyperpriors and the prior on <span class="math inline">\(\sigma\)</span> are set to be weakly informative: <span class="math display">\[\mu_\alpha,\mu_\beta \sim N(0,3^2), ~~~ \sigma_\alpha^2,\sigma_\beta^2,\sigma^2 \sim \text{Inv-Gamma}(1.5,2)\]</span></p>
<p>In some situations (with little data and/or correlated parameters), one can achieve more efficient and accurate sampling by using a non-centered parameterization (see Papaspiliopoulos et al. 2007, Betancourt and Girolami 2013). For the <span class="math inline">\(\alpha\)</span>-parameters, this can be done by specifying <span class="math inline">\(\alpha_i = \mu_\alpha + \sigma_\alpha \alpha^*_i\)</span>, where <span class="math inline">\(\alpha^*_i \sim N(0,1)\)</span>. The tests below entail both a centered and non-centered version of the conjugate model. (It is tricky to do non-centering without breaking JAGS’ ability to exploit the conjugacy, so the non-centered JAGS version only separates out the common mean).</p>
<!--#### Hierarchical, weakly informative, partly conjugate, non-centered model
Same as partly conjugate, but non-centered alpha and beta.-->
</div>
<div id="hierarchical-weakly-informative-partly-conjugate-centered-and-non-centered" class="section level4">
<h4>Hierarchical, weakly informative, partly conjugate (centered and non-centered)</h4>
<!--The next model is very similar, but only partly conjugate, as the t-distributions are replaced with normal distributions.-->
<p>The next model is very similar, but only partly conjugate, as the inverse Gamma priors on the variances are replaced with gamma priors on the standard deviations. The new priors are set to be approximately as informative as the inverse Gamma’s (although they are necessarily different). The model is then:</p>
<p><span class="math display">\[\alpha_i \sim N(\mu_\alpha,\sigma_\alpha^2), ~~~ \beta_i \sim N(\mu_\beta,\sigma_\beta^2)\]</span></p>
<p><span class="math display">\[\mu_\alpha,\mu_\beta \sim N(0,3^2), ~~~ \sigma_\alpha,\sigma_\beta,\sigma \sim \text{Gamma}(2,.5)\]</span></p>
<p>(Note that the last parameter in the gamma is a rate parameter, i.e. inverse scale).</p>
</div>
<div id="hierarchical-weakly-informative-non-conjugate-centered" class="section level4">
<h4>Hierarchical, weakly informative, non-conjugate, centered</h4>
<!--The next is a non-conjugate partially pooled model, with weakly informative priors. To avoid conjugacy, we use student-t distributions with 100 degrees of freedom for the means, and Gamma-distributions for the standard deviations. -->
<p>The next model is completely non-conjugate, replacing the normal distributions with student-t distributions with 100 degrees of freedom. This is practically the same as the normal, but should prevent JAGS from using Gibbs sampling.</p>
<p><span class="math display">\[\alpha_i \sim \text{Student-t}(100,\mu_\alpha,\sigma_\alpha^2), ~~~ \beta_i \sim \text{Student-t}(100,\mu_\beta,\sigma_\beta^2)\]</span></p>
<p><span class="math display">\[\mu_\alpha,\mu_\beta \sim \text{Student-t}(100,0,3^2), ~~~ \sigma_\alpha,\sigma_\beta,\sigma \sim \text{Gamma}(2,.5)\]</span></p>
</div>
<div id="unpooled-with-uniform-priors" class="section level4">
<h4>Unpooled with uniform priors</h4>
<!--The simplest model entails no pooling and has uniform priors on all parameters (means and SD's): -->
<p>The final model is an unpooled model with uniform priors on all parameters. This model does not make too much sense here (as we know the individual parameters are drawn form a common distribution), but it may serve to illustrate performance in more difficult circumstances. The model is potentially challenging in that there is no conjugacy, no pooling of information across units, and no other help from the priors, which results in larger posterior variances (and potentially covariances): <span class="math display">\[\alpha_i,\beta_i \sim U(-100,100), ~~~ \sigma \sim U(0,100)\]</span></p>
<p>This model only takes a few lines of Stan code, as we get uniform priors by not specifying anything (except relevant limits).</p>
</div>
</div>
<div id="approach" class="section level3">
<h3>Approach</h3>
<p>A key question is how to compare the samplers in way that is both fair and relevant for actual applications. As MCMC draws tend to be auto-correlated, the nominal number of simulation draws is less interesting than the number of effective draws (adjusting for auto-correlation). Furthermore, if we decide on a lower limit for how many effective draws we require, we will most likely want to apply this requirement to all parameters. In other words, the key question is how efficiently we are sampling the parameter with the fewest effective draws (after the chains have found the typical set). Finally, the most relevant measure of efficiency is probably time per effective draw.</p>
<!--mention Ac - > n_eff. after convergence.-->
<p>However, it is not clear what is the most relevant specification of time-consumption either. Stan needs to compile models before running, while JAGS does not. Fortunately, Stan models only need to be compiled once, and it usually takes a minute or two. If you are using an already-compiled model, there is no loss of time. Similarly, if your model is sufficiently complicated to require a long run, compilation time is essentially irrelevant. There is really only one situation in which compilation-time counts, and that is when you are developing a new model, testing it on a small dataset to make things faster, but still having to re-compile for every little change you to make. Given the points above, however, I think it is most reasonable to compare run-times without including compilation-time.</p>
<p>Another issue is whether to include adaptation and warm-up time. I think it is reasonable to include this, as these phases are typically required for any analysis. If a sampler takes very long to converge to the target distribution, this is a relevant cost of using it. Including these phases may, however, introduce some inaccuracy: As we take more post-warm-up draws, the share of time spent on warm-up decreases. This makes the results depend on the length of the post-warm-up run, which is clearly not ideal, but I am still sticking to this approach for now. (I suppose the best approach might be to set the run-length for each model aiming for a specific minimum number of effective draws for all parameters, but this might take extremely long in the slowest instances.)</p>
<p>The models have been fit to 200 datasets for each of the two posterior-correlation scenarios. The datasets have been kept fairly small: <span class="math inline">\(n = 250\)</span> and <span class="math inline">\(m=5\)</span>. The patterns reported below seem to remain when I scale up the data, but further testing may be called for (as the speed of JAGS may suffer more as the scale increases). For each software, I use three out of four available processor cores, and run one parallel chain on each core. To calculate split Rhats and the number of effective draws, I am using the specifications from Gelman et al. 2014 (BDA3). Setting the length of the warm-up phase is a bit tricky, as a longer warm-up is required for the more challenging combinations of data and models. By trial and error, I have set specific warm-up lengths for each model in each scenario so that the models typically converge. In the most challenging situations, the models will sometimes still not fully converge in time, and in these instances the results are not reported. (This may slightly favor models that are performing poorly, but it should not change the general patterns.)</p>
</div>
<div id="results" class="section level3">
<h3>Results</h3>
<p>The results are shown on a logged scale in the boxplot below. The scenario with weak posterior correlations offers ideal conditions, and when JAGS is given a fully conjugate hierarchical model, it is sampling very fast here – showing the fastest performance in this test. Stan is also fast in this situation, but does take about 3 times as long JAGS. The non-centered parameterization appears to slow down both samplers slightly, suggesting it adds to the computational load without much gain. When we move to the partly conjugate models, JAGS gets notably slower, and about as fast as Stan, whether we use the centered or non-centered specification.</p>
<p>Moving from a partly conjugate to completely non-conjugate model (replacing normals with t-distributions), makes JAGS about three times slower still, while Stan retains its performance, being several times faster. Turning to the unpooled models with uniform priors, Stan gets slightly faster, while JAGS gets a bit slower – now sampling nearly five times slower than Stan.</p>
<p>As the posterior correlations increase, the patterns are similar, but with some notable changes: JAGS gets comparatively slower for the fully conjugate models, but still delivers some of the fastest performance. It is now on par with Stan’s performance for the non-centered specification. If we look at the partly conjugate models, non-centering speeds up Stan seven-fold in this scenario, while JAGS sees no such benefit. Finally, for the unpooled model with uniform priors, Stan is pretty fast, while JAGS is very slow – in fact, JAGS is 65 times slower than Stan here.</p>
<p>In sum, JAGS’s performance varies a lot: It is about 765 times faster using a fully conjugate, centered model in the easy scenario than using an unpooled model in the harder scenario. For Stan, the equivalent factor is 4.</p>
<p><img src="https://www.boelstad.net/post/Stan_vs_JAGS_speed_files/figure-html/unnamed-chunk-1-1.svg" width="705.6" style="display: block; margin: auto auto auto 0;" /></p>
<p>Another question is how well the samplers are exploring the target distribution. The paper on NUTS by Hoffman and Gelman (2014) shows that NUTS can perform much better than Gibbs sampling and Random Walk Metropolis in exploring difficult posteriors with correlated parameters. At the most extreme, the authors use a highly correlated 250-dimensional distribution, whereas the data used here mainly entail pairwise parameter correlations (between pairs of <span class="math inline">\(\alpha\)</span>’s and <span class="math inline">\(\beta\)</span>’s, and between their hyperparameters in the partially pooled models). The plots below show 1,000 randomly selected draws of <span class="math inline">\(\alpha_1\)</span> and <span class="math inline">\(\beta_1\)</span> from a single trial using unpooled models with uniform priors. Visually, the samples produced by JAGS and Stan are very similar here. It is just that JAGS takes much longer to converge and produce effective draws.</p>
<p><img src="https://www.boelstad.net/post/Stan_vs_JAGS_speed_files/figure-html/unnamed-chunk-2-1.svg" width="700.8" style="display: block; margin: auto auto auto 0;" /></p>
</div>
<div id="final-notes" class="section level3">
<h3>Final notes</h3>
<p>A key lesson to draw from these results is that model specification matters. For instance, a non-centered specification can be of great help to Stan, and when this is the case, there is no reason not to go this route. For JAGS, it is obviously a good idea to use conjugate priors wherever possible, and thus avoid turning JAGS into Just Another Metropolis Sampler. Another point to note is that centering the predictor <span class="math inline">\(x\)</span> would have turned the scenario with high posterior correlations into one with low correlations – we could actually have avoided the posterior correlations altogether.</p>
<p>In addition to efficiency, there are other considerations that may be relevant for choosing between JAGS and Stan. Stan has a large development team and a very active user group, which makes it easy to solicit expert advice. Stan is also not reliant on a specific type of priors (conjugacy) and can handle a wide range of models. Furthermore, Stan is likely explore complicated posterior distributions better, and thus give more accurate results (see references below). While JAGS can be very fast for certain well-suited problems, Stan may offer the best all-round performance, especially if you are analyzing large datasets using complicated models.</p>
<!--If you find any mistakes, or have suggestions for improving this post, please use the comment section at the bottom.-->
</div>
<div id="how-to-cite" class="section level3">
<h3>How to cite</h3>
<p>
This material can be cited as follows:
</p>
<div class="Bib">
<p>
Bølstad, Jørgen. 2019. “How Efficient is Stan Compared to JAGS? Conjugacy, Pooling, Centering, and Posterior Correlations”. <i>Playing with Numbers: Notes on Bayesian Statistics</i>. www.boelstad.net/post/stan_vs_jags_speed/.
</p>
</div>
<p>Here is a BibTex-formatted reference, which should work when using the natbib and hyperref packages together with a suitable Latex style:</p>
<pre class="text"><code>@Misc{boelstad:2019a,
Author = {B{\o}lstad, J{\o}rgen},
Title = {How Efficient is Stan Compared to JAGS? Conjugacy, Pooling, Centering, and Posterior Correlations},
Howpublished = {Playing with Numbers: Notes on Bayesian Statistics},
Url = {\url{http://www.boelstad.net/post/stan_vs_jags_speed/}},
Year = {2019},
Month = {January 2}}</code></pre>
</div>
<div id="further-reading" class="section level3">
<h3>Further reading</h3>
<div class="Bib">
<p>
Betancourt, Michael, and Mark Girolami. 2013. “<a href="https://arxiv.org/abs/1312.0906">Hamiltonian Monte Carlo for hierarchical models</a>”. arXiv:1312.0906.
</p>
<p>
Betancourt, Michael. 2017. “<a href="https://arxiv.org/abs/1701.02434">A conceptual introduction to Hamiltonian Monte Carlo</a>”. arXiv:1701.02434v2.
</p>
<p>
Gelman, Andrew, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari and Donald B Rubin. 2014. <i>Bayesian Data Analysis.</i> 3rd ed. London: Chapman & Hall/CRC Press.
</p>
<p>
Hoffman, Matthew D., and Andrew Gelman. 2014. “<a href="http://www.stat.columbia.edu/~gelman/research/published/nuts.pdf">The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo</a>”. <i>Journal of Machine Learning Research</i>, 15(1), pp.1593-1623.
</p>
<p>
Neal, Radford M. 2011. “<a href="http://www.mcmchandbook.net/HandbookChapter5.pdf">MCMC Using Hamiltonian Dynamics</a>”. In Steve Brooks, Andrew Gelman, Galin L. Jones, Xiao-Li Meng. <i>Handbook of Markov Chain Monte Carlo</i>. London: Chapman & Hall/CRC Press.
</p>
<p>
Papaspiliopoulos, Omiros, Gareth O. Roberts, and Martin Sköld. 2007. “<a href="https://arxiv.org/pdf/0708.3797.pdf">A general framework for the parametrization of hierarchical models</a>”. <i>Statistical Science</i>, 59-73.
</p>
</div>
</div>
<div id="appendix-model-codes" class="section level3">
<h3>Appendix: Model codes</h3>
<pre class="r"><code>stan.unpooled.code <- '
data {
int<lower=0> n;
int<lower=0> m;
vector[m] x[n];
vector[m] y[n];
}
parameters {
vector[n] alpha;
vector[n] beta;
real<lower=0> sigma;
}
model {
for (i in 1:n)
y[i] ~ normal(alpha[i] + beta[i] * x[i], sigma);
}
'
jags.unpooled.code <- function() {
for (i in 1:n) {
for (j in 1:m) {
y[i, j] ~ dnorm(alpha[i] + beta[i] * x[i, j], tau)
}
alpha[i] ~ dunif(-100, 100)
beta[i] ~ dunif(-100, 100)
}
tau <- pow(sigma, -2)
sigma ~ dunif(0, 100)
}</code></pre>
<pre class="r"><code>stan.nonconjugate.code <- '
data {
int<lower=0> n;
int<lower=0> m;
vector[m] x[n];
vector[m] y[n];
}
parameters {
vector[n] alpha;
vector[n] beta;
real mu_alpha;
real mu_beta;
real<lower=0> sigma_alpha;
real<lower=0> sigma_beta;
real<lower=0> sigma;
}
model {
for (i in 1:n)
y[i] ~ normal(alpha[i] + beta[i] * x[i], sigma);
alpha ~ student_t(100, mu_alpha, sigma_alpha);
beta ~ student_t(100, mu_beta, sigma_beta);
mu_alpha ~ student_t(100, 0, 3);
mu_beta ~ student_t(100, 0, 3);
sigma_alpha ~ gamma(2, .5);
sigma_beta ~ gamma(2, .5);
sigma ~ gamma(2, .5);
}
'
jags.nonconjugate.code <- function() {
for (i in 1:n) {
for (j in 1:m) {
y[i, j] ~ dnorm(alpha[i] + beta[i] * x[i, j], tau)
}
alpha[i] ~ dt(mu_alpha, tau_alpha, 100)
beta[i] ~ dt(mu_beta, tau_beta, 100)
}
mu_alpha ~ dt(0, pow(3, -2),100)
mu_beta ~ dt(0, pow(3, -2),100)
tau_alpha <- pow(sigma_alpha, -2)
sigma_alpha ~ dgamma(2, .5)
tau_beta <- pow(sigma_beta, -2)
sigma_beta ~ dgamma(2, .5)
tau <- pow(sigma, -2)
sigma ~ dgamma(2, .5)
}</code></pre>
<pre class="r"><code>stan.partconjugate.code <- '
data {
int<lower=0> n;
int<lower=0> m;
vector[m] x[n];
vector[m] y[n];
}
parameters {
vector[n] alpha;
vector[n] beta;
real mu_alpha;
real mu_beta;
real<lower=0> sigma_alpha;
real<lower=0> sigma_beta;
real<lower=0> sigma;
}
model {
for (i in 1:n)
y[i] ~ normal(alpha[i] + beta[i] * x[i], sigma);
alpha ~ normal(mu_alpha, sigma_alpha);
beta ~ normal(mu_beta, sigma_beta);
mu_alpha ~ normal(0, 3);
mu_beta ~ normal(0, 3);
sigma_alpha ~ gamma(2, .5);
sigma_beta ~ gamma(2, .5);
sigma ~ gamma(2, .5);
}
'
jags.partconjugate.code <- function() {
for (i in 1:n) {
for (j in 1:m) {
y[i, j] ~ dnorm(alpha[i] + beta[i] * x[i, j], tau)
}
alpha[i] ~ dnorm(mu_alpha, tau_alpha)
beta[i] ~ dnorm(mu_beta, tau_beta)
}
mu_alpha ~ dnorm(0, pow(3, -2))
mu_beta ~ dnorm(0, pow(3, -2))
tau_alpha <- pow(sigma_alpha, -2)
sigma_alpha ~ dgamma(2, .5)
tau_beta <- pow(sigma_beta, -2)
sigma_beta ~ dgamma(2, .5)
tau <- pow(sigma, -2)
sigma ~ dgamma(2, .5)
}</code></pre>
<pre class="r"><code>stan.noncentered.code <- '
data {
int<lower=0> n;
int<lower=0> m;
vector[m] x[n];
vector[m] y[n];
}
parameters {
vector[n] alpha_raw;
vector[n] beta_raw;
real mu_alpha;
real mu_beta;
real<lower=0> sigma_alpha;
real<lower=0> sigma_beta;
real<lower=0> sigma;
}
transformed parameters {
vector[n] alpha = mu_alpha + sigma_alpha * alpha_raw;
vector[n] beta = mu_beta + sigma_beta * beta_raw;
}
model {
for (i in 1:n)
y[i] ~ normal(alpha[i] + beta[i] * x[i], sigma);
alpha_raw ~ normal(0, 1);
beta_raw ~ normal(0, 1);
mu_alpha ~ normal(0, 3);
mu_beta ~ normal(0, 3);
sigma_alpha ~ gamma(2, .5);
sigma_beta ~ gamma(2, .5);
sigma ~ gamma(2, .5);
}
'
jags.noncentered.code <- function() {
for (i in 1:n) {
for (j in 1:m) {
y[i, j] ~ dnorm(alpha[i] + beta[i] * x[i, j], tau)
}
alpha[i] = mu_alpha + sigma_alpha * alpha_raw[i]
beta[i] = mu_beta + sigma_beta * beta_raw[i]
alpha_raw[i] ~ dnorm(0, 1)
beta_raw[i] ~ dnorm(0, 1)
}
mu_alpha ~ dnorm(0, pow(3, -2))
mu_beta ~ dnorm(0, pow(3, -2))
sigma_alpha ~ dgamma(2, .5)
sigma_beta ~ dgamma(2, .5)
tau <- pow(sigma, -2)
sigma ~ dgamma(2, .5)
}</code></pre>
<pre class="r"><code>stan.conjugate.code <- '
data {
int<lower=0> n;
int<lower=0> m;
vector[m] x[n];
vector[m] y[n];
}
parameters {
vector[n] alpha;
vector[n] beta;
real mu_alpha;
real mu_beta;
real<lower=0> sigma_alpha_sq;
real<lower=0> sigma_beta_sq;
real<lower=0> sigma_sq;
}
transformed parameters {
real sigma_alpha = sqrt(sigma_alpha_sq);
real sigma_beta = sqrt(sigma_beta_sq);
real sigma = sqrt(sigma_sq);
}
model {
for (i in 1:n)
y[i] ~ normal(alpha[i] + beta[i] * x[i], sigma);
alpha ~ normal(mu_alpha, sigma_alpha);
beta ~ normal(mu_beta, sigma_beta);
mu_alpha ~ normal(0, 3);
mu_beta ~ normal(0, 3);
sigma_alpha_sq ~ inv_gamma(1.5, 2);
sigma_beta_sq ~ inv_gamma(1.5, 2);
sigma_sq ~ inv_gamma(1.5, 2);
}
'
jags.conjugate.code <- function() {
for (i in 1:n) {
for (j in 1:m) {
y[i, j] ~ dnorm(alpha[i] + beta[i] * x[i, j], tau)
}
alpha[i] ~ dnorm(mu_alpha, tau_alpha)
beta[i] ~ dnorm(mu_beta, tau_beta)
}
mu_alpha ~ dnorm(0, pow(3, -2))
mu_beta ~ dnorm(0, pow(3, -2))
tau_alpha ~ dgamma(1.5, .5)
tau_beta ~ dgamma(1.5, .5)
tau ~ dgamma(1.5, .5)
}</code></pre>
<pre class="r"><code>stan.conj.noncentered.code <- '
data {
int<lower=0> n;
int<lower=0> m;
vector[m] x[n];
vector[m] y[n];
}
parameters {
vector[n] alpha_raw;
vector[n] beta_raw;
real mu_alpha;
real mu_beta;
real<lower=0> sigma_alpha_sq;
real<lower=0> sigma_beta_sq;
real<lower=0> sigma_sq;
}
transformed parameters {
real sigma_alpha = sqrt(sigma_alpha_sq);
real sigma_beta = sqrt(sigma_beta_sq);
real sigma = sqrt(sigma_sq);
vector[n] alpha = mu_alpha + sigma_alpha * alpha_raw;
vector[n] beta = mu_beta + sigma_beta * beta_raw;
}
model {
for (i in 1:n)
y[i] ~ normal(alpha[i] + beta[i] * x[i], sigma);
alpha_raw ~ normal(0, 1);
beta_raw ~ normal(0, 1);
mu_alpha ~ normal(0, 3);
mu_beta ~ normal(0, 3);
sigma_alpha_sq ~ inv_gamma(1.5, 2);
sigma_beta_sq ~ inv_gamma(1.5, 2);
sigma_sq ~ inv_gamma(1.5, 2);
}
'
jags.conj.noncentered.code <- function() {
for (i in 1:n) {
for (j in 1:m) {
y[i, j] ~ dnorm(mu_alpha + alpha[i] + (mu_beta + beta[i]) * x[i, j], tau)
}
alpha[i] ~ dnorm(0, tau_alpha)
beta[i] ~ dnorm(0, tau_beta)
}
mu_alpha ~ dnorm(0, pow(3, -2))
mu_beta ~ dnorm(0, pow(3, -2))
tau_alpha ~ dgamma(1.5, .5)
tau_beta ~ dgamma(1.5, .5)
tau ~ dgamma(1.5, .5)
}</code></pre>
</div>
<div id="appendix-code-to-fit-the-models" class="section level3">
<h3>Appendix: Code to fit the models</h3>
<pre class="r"><code>stan.unpooled <- stan_model(model_code = stan.unpooled.code)
stan.nonconjugate <- stan_model(model_code = stan.nonconjugate.code)
stan.partconjugate <- stan_model(model_code = stan.partconjugate.code)
stan.noncentered <- stan_model(model_code = stan.noncentered.code)
stan.conjugate <- stan_model(model_code = stan.conjugate.code)
stan.conj.noncentered <- stan_model(model_code = stan.conj.noncentered.code)</code></pre>
<pre class="r"><code>source("https://avehtari.github.io/rhat_reff/monitornew.R") ## File with updated (BDA3) code for split Rhat and N_eff.
get.stats <- function(sims, time) {
eff.draws <- apply(sims, 3, ess_rfun)
min.eff.draws <- min(eff.draws)
mean.eff.draws <- mean(eff.draws)
rhat <- max(apply(sims, 3, function(x) rhat_rfun(split_chains(x))))
res <- c(rhat, time, mean.eff.draws, min.eff.draws, time / min.eff.draws)
return(res)
}
stan.mods <- list(stan.unpooled, stan.nonconjugate, stan.partconjugate,
stan.noncentered, stan.conjugate, stan.conj.noncentered)
jags.mods <- list(jags.unpooled.code, jags.nonconjugate.code, jags.partconjugate.code,
jags.noncentered.code, jags.conjugate.code, jags.conj.noncentered.code)
jags.pars <- list(c('alpha', 'beta', 'sigma'),
c('alpha', 'beta', 'sigma', 'mu_alpha', 'mu_beta', 'sigma_alpha', 'sigma_beta'),
c('alpha', 'beta', 'sigma', 'mu_alpha', 'mu_beta', 'sigma_alpha', 'sigma_beta'),
c('alpha', 'beta', 'sigma', 'mu_alpha', 'mu_beta', 'sigma_alpha', 'sigma_beta'),
c('alpha', 'beta', 'tau', 'mu_alpha', 'mu_beta', 'tau_alpha', 'tau_beta'),
c('alpha', 'beta', 'tau', 'mu_alpha', 'mu_beta', 'tau_alpha', 'tau_beta'))
jags.warmup <- rbind(c(rep(3000, 6)), c(20000, 7000, 3000, 3000, 3000, 3000))
stan.warmup <- rbind(c(rep(1000, 6)), c(1000, 2000, 1000, 1000, 1000, 1000))
ns = c(250)
ms = c(5)
xmeans <- c(0, 2)
n.trials <- 100
n.cores <- 3
results <- vector()
for (t in 1:n.trials) {
for(xm in 1:length(xmeans)) {
for (n in ns) {
for (m in ms) {
dat <- create.data(n, m, covar=0, xmean=xmeans[xm])
for (md in 1:length(jags.mods)) {
ptm <- proc.time()
cl <- makePSOCKcluster(n.cores)
tmp <- clusterEvalQ(cl, library(dclone))
fit1 <- jags.parfit(cl = cl, data = dat, params = jags.pars[[md]],
model = jags.mods[[md]], n.chains = n.cores,
n.adapt = 100, n.update = jags.warmup[xm,md], n.iter = 4000, thin = 2)
stopCluster(cl)
time <- proc.time() - ptm
sims <- aperm(as.array(fit1), c(1, 3, 2))
res <- cbind(n, m, xmeans[xm], t, 1, md, rbind(get.stats(sims, time[3])))
results <- rbind(results, res)
}
for (md in 1:length(stan.mods)) {
ptm <- proc.time()
fit2 <- sampling(stan.mods[[md]], data = dat, chains = n.cores, cores = n.cores,
warmup = stan.warmup[xm,md], iter = stan.warmup[xm, md] + 2000)
time <- proc.time() - ptm
res <- cbind(n, m, xmeans[xm], t, 2, md, rbind(get.stats(extract(fit2, permuted = F), time[3])))
results <- rbind(results, res)
}
}
}
}
print(paste("Finished trial ", t, " of ", n.trials, ", at ", date(), sep=""))
}
results <- data.frame(results)
names(results)[5] <- 'sampler'
names(results)[3] <- 'xm'
names(results)[7:11] <- c('rhat', 'time', 'mean.size', 'min.size', 'time.per.draw')
results[, 5] <- factor(2 - results[, 5], labels = c('Stan', 'JAGS'))
results[, 6] <- factor(results[, 6], labels = c('Unpooled, uniform priors', 'Hierarchical, weakly informative, non-conjugate, centered', 'Hierarchical, weakly informative, partly conjugate, centered', 'Hierarchical, weakly inf., partly conjugate, non-centered', 'Hierarchical, weakly informative, fully conjugate, centered', 'Hierarchical, weakly informative, fully conjugate, non-centered'))
results$n <- factor(results$n)
levels(results$n) <- paste("n = ", levels(results$n), sep="")
results$xm <- factor(results$xm)
levels(results$xm) <- c("Low posterior\n parameter correlations", "High posterior\n parameter correlations")</code></pre>
<!--```{r}
#pander(sessionInfo())
#sessionInfo()
# I a using Harvard citation style + full names, and quotation marks around titles...
```-->
</div>