Bayesian modeling has proven its usefulness for poll aggregation and election forecasting – for instance through FiveThirtyEight. However, in Norway, media coverage of public opinion trends still tends to focus on a single poll at a time, or a simple average of polls at best. I was convinced it would be possible to do better, and after about three times as many long days and nights as I thought it would take, the result is finally live – both in Norwegian and English – at Estimite.com.
Notes on the approach
I have landed on a fully automated production process, where a script checks for new data, refits the model if necessary, and then rebuilds and publishes the website. This is all done using R, Stan, and a few shell scripts. Here is a brief description of the model:
This is a state space model where center logratio transformed latent trends are given multivariate t-distributed innovations. The multivariate t has been given a non-centered specification through Cholesky factorization, which makes a notable difference.
The model estimates and corrects for average polling bias relative to previous elections. This contrasts with Nate Silver’s models at FiveThirtyEight, which assume that house effects have a mean of zero – I never fully bought that idea. Looking at Norwegian data, there seems to be a consistent pattern across elections, and ignoring it might be risky. At the same time, correcting for these biases carries its own risk, as it assumes polling companies do not fix their methodology from one election to the next.
The model estimates latent trends in party support at the national level, and predicts latent support for each electoral district based on previous election results. The predictions are restricted to sum up to the national trend. Votes at the district-level are predicted from the district-level trends using a Dirichlet-Multinomial distribution – the same kind of distribution as the one assumed to generate the polling data.
Predictions for future time points are based on: (1) Vote shares at the last election, (2) the degree of change in each party’s vote share from one election to the next, (3) current party support, (4), the degree of change in each party’s support from one week to the next, and (5) the degree of covariation in how support for each possible pair of parties evolves on a weekly basis.
Stan is great for this, because it lets us define generated quantities, and get all kinds of relevant outputs from a single, joint simulation.
You should not underestimate how many technical challenges such projects entail and how much time it takes to fix them – especially if you care about details.