--- title: "8 Schools Stan Improved" author: "Russell Almond" date: "1/31/2019" output: html_notebook editor_options: chunk_output_type: inline --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # The 8 Schools Problem This is the classic eight schools example from Rubin(1981)[^1]. (It is also found in Chapter 5 of Gelman et al., 2014 [^2].) The story is that 8 different schools experimented with an SAT coaching experiment. The performance gains of the coached students were compared to students on a weight list control. Separate estimates were obtained for each school, but because the size of the schools differed, the standard errors differed as well. Here are the data: ```{r data} Schools <- data.frame(row.names=c("A","B","C","D","E","F","G","H"), effect = c(28.39,7.94,-2.75,6.82,-.64,.63,18.01,12.16), see = c(14.9, 10.2, 16.3, 11.0, 9.4, 11.4, 10.4, 17.6)) Schools ``` Lets start by calculating a weighted average effect. I'll weight each case by the precision (one over the square of the see). ```{r grand mean} Schools$w <- 1/Schools$see^2 Schools.mean <- sum(Schools$w*Schools$effect)/sum(Schools$w) Schools.mean ``` Here is a plot of the data. ```{r data Plot} ord <- order(Schools$effect) plot(Schools$effect[ord[c(1,8)]]+c(-2,2)*Schools$see[ord[c(1,8)]], c(nrow(Schools),1),main = "8 Schools data.",type="n",yaxt="n", xlab="Effect Size",ylab="School") points(Schools$effect[ord],nrow(Schools):1,pch=rownames(Schools)[ord]) segments(Schools$effect[ord]-2*Schools$see[ord],nrow(Schools):1, Schools$effect[ord]+2*Schools$see[ord],nrow(Schools):1) abline(v=Schools.mean,col="blue") ``` ## Model Setup in Stan. First we need to load the packages. The `rstan` package runs stan from R. The `shinystan` package gives us a browser for the results. ```{r} library(rstan) library(shinystan) library(parallel) # For using multiple chains options(mc.cores=5) ``` First we set up the Stan model, and put it into a variable called `school8` (note the `output.var="school8"` in stan block tag) ```{stan output.var="school8.stan"} data { int J; // number of schools real y[J]; // estimated treatement effects real sigma[J]; //s.e. of effects. } parameters { real mu; real tau; vector[J] eta; } transformed parameters { vector [J] theta; theta = mu + eta*tau; } model { eta ~ normal(0,1); //target += normal_lpdf(theta|mu,tau); y ~ normal(theta,sigma); //target += normal_lpdf(effect|theta,see); } ``` ### Parts of the stan code. Almost all models will have a `data`, `parameters` and `model` block. They could have others as well (common ones are `transformed data` to do pre-calculations and `transformed parameters` to recast the model). #### Data In a stan model, _data_ refers to values that don't change over the course of the MCMC loop. This tends to be one of three things: * The observed data values. * Fixed hyperparameters for prior distributions. * Structural hyperparamters (e.g., sample size, number of groups). In stan (like C++) all variables must be declared. They are generally either `int` or `real` (here `vector` is shorthand for a vector of real values). The modifiers and can be used to constrain the inputs. In the data field this is just used to to type checking. #### Parameters The parameters are the values that stan will try to estimate. These include both latent variables and ordinary parameters and hyperparameters. In stan, parameters should all be real. Non-continuous parameters don't work with the Hamiltonian Monte Carlo. (There are tricks for dealing with common cases like mixture models in the stan examples.) This is called `lp__` (log pdf) in the output. Note carefully the use of the lower and upper bounds. It is important for stan to know when a parameter is restricted to say positive values (i.e., a scale parameter) as it needs to constrain the space for the sampler. #### Model This section gives the distribution for all of the parameters. First I give the BUGS-like way of doing this. This is to use the `~` operator to give the distribution. The names are slightly different in stan and R, so the rstan function `lookup` can help you find the stan function corresponding to the R function. ```{r lookup, echo=TRUE} lookup(dnorm) lookup(dt) ``` Note that what stan is actually doing in the model block is calculating the log p.d.f. Thus, the commented out expressions are an alternative to the `~` notation. Finally, note that this is an incomplete Bayesian model as there are no distributions for `mu` or `tau`. In stan this means we are implicitly putting a uniform prior on `mu` and `log(tau)`; the latter is transformed so that it will always be positive. These are improper priors, but stan will be fine as long as the posterior is proper. ## Data preparation Build a list which contains the data as elements using the names in the data section of the model. ```{r School 8 Data} school8.dat <- list( J = nrow(Schools), y = Schools$effect, sigma=Schools$see) ``` ## Running Stan There are two funcitons to start the sampling in stan. The first one is `stan(file=XXX,data=YYY,...)`. This assumes that the stan model is in a file. However, with R markdown, we already saved the model in an object so we can use `sampling(model,data=YYY,...)`. By the way, this same function can be used to make additional samples after we have sampled for a while. The mc.cores should work well for Unix (Linux and Mac OS), I'm not so sure about Windows. ```{r Run Stan-sampling} #options(mc.cores = parallel::detectCores()-1) school8.fit1 <- sampling( school8.stan, # The model data = school8.dat, # The data chains = 5, warmup = 1000, iter = 2000, refresh=1000 # Show progress ) summary(school8.fit1) ``` Alternate style using external file. ```{r Run Stan-stan} options(mc.cores = parallel::detectCores()-1) school8.fit1 <- stan( file="Bayesian/school8.stan", # The model data = school8.dat, # The data chains = 5, warmup = 1000, iter = 2000, refresh=100 # Show progress ) summary(school8.fit1) ``` ## Summaries using the base Stan functions Printing gives summaries of the posterior for the specified parameters. Use the `pars` argument to select what to print. ```{r print stanfit} print(school8.fit1,pars=c("theta","mu","tau","lp__"), probs=c(.1,.5,.9)) ``` The `plot` function gives 50% and 95% intervals. ```{r stan plot} plot(school8.fit1,pars=c("theta","mu","tau")) ``` Traceplot shows convergence (note stan and coda have slightly different traceplot functions). ```{r stan traceplot} rstan::traceplot(school8.fit1,pars=c("mu","tau","lp__"), inc_warmup=TRUE, nrow=3) ``` What we are looking for here is (a) white-noise like, and all chains plotting over top of each other. Typically variance and scale parameters will be positively skewed, and log-posterior is often not as well behaved as others. Pair plot sometimes uncover problems with models. ```{r stan pairs} pairs(school8.fit1,pars=c("mu","eta[1]","tau")) ``` ## Shinystan Shinystan opens a new shiny workspace which allows interactive browsing of the results. ```{r shinystan} school8.fit1s <- launch_shinystan(school8.fit1) ``` ### References [^1]: Rubin, D. B. (1981). Estimation in Parallel randomized experiments. _Journal of Educational Statistics_, __6__, 377-401. [^2]: Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A. and Rubin, D. B. (2014). _Bayesian Data Analysis: Third Edition_. CRC Press. (ISBN: 978-1-4398-4095-5) [^3]: Gelman, A. and Hill, J. (2007). _Data Analysis Using Regression and Hierarchical/Multilevel Models._ Cambridge University Press. [^4]: Dempster, A. P. Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). _Journal of the Royal Statistical Society,_ __39__, 1--38.