The 8 Schools Problem

This is the classic eight schools example from Rubin(1981)1. (It is also found in Chapter 5 of Gelman et al., 2014 2.) The story is that 8 different schools experimented with an SAT coaching experiment. The performance gains of the coached students were compared to students on a weight list control. Separate estimates were obtained for each school, but because the size of the schools differed, the standard errors differed as well.

Here are the data:

Schools <- data.frame(row.names=c("A","B","C","D","E","F","G","H"),
                      effect = c(28.39,7.94,-2.75,6.82,-.64,.63,18.01,12.16),
                      see = c(14.9, 10.2, 16.3, 11.0, 9.4, 11.4, 10.4, 17.6))
Schools

Lets start by calculating a weighted average effect. I’ll weight each case by the precision (one over the square of the see).

Schools$w <- 1/Schools$see^2
Schools.mean <- sum(Schools$w*Schools$effect)/sum(Schools$w)
Schools.mean
[1] 7.870546

Here is a plot of the data.

ord <- order(Schools$effect)
plot(Schools$effect[ord[c(1,8)]]+c(-2,2)*Schools$see[ord[c(1,8)]],
     c(nrow(Schools),1),main = "8 Schools data.",type="n",yaxt="n",
     xlab="Effect Size",ylab="School")
points(Schools$effect[ord],nrow(Schools):1,pch=rownames(Schools)[ord])
segments(Schools$effect[ord]-2*Schools$see[ord],nrow(Schools):1,
         Schools$effect[ord]+2*Schools$see[ord],nrow(Schools):1)
abline(v=Schools.mean,col="blue")

Model Setup in Stan.

First we need to load the packages. The rstan package runs stan from R. The shinystan package gives us a browser for the results.

library(rstan)
library(shinystan)
library(parallel) # For using multiple chains

First we set up the Stan model, and put it into a variable called school8 (note the output.var="school8" in stan block tag)

Parts of the stan code.

Almost all models will have a data, parameters and model block. They could have others as well (common ones are transformed data to do pre-calculations and transformed parameters to recast the model).

Data

In a stan model, data refers to values that don’t change over the course of the MCMC loop. This tends to be one of three things: * The observed data values. * Fixed hyperparameters for prior distributions. * Structural hyperparamters (e.g., sample size, number of groups).

In stan (like C++) all variables must be declared. They are generally either int or real (here vector is shorthand for a vector of real values). The modifiers and can be used to constrain the inputs. In the data field this is just used to to type checking.

Parameters

The parameters are the values that stan will try to estimate. These include both latent variables and ordinary parameters and hyperparameters.

In stan, parameters should all be real. Non-continuous parameters don’t work with the Hamiltonian Monte Carlo. (There are tricks for dealing with common cases like mixture models in the stan examples.) This is called lp__ (log pdf) in the output.

Note carefully the use of the lower and upper bounds. It is important for stan to know when a parameter is restricted to say positive values (i.e., a scale parameter) as it needs to constrain the space for the sampler.

Model

This section gives the distribution for all of the parameters. First I give the BUGS-like way of doing this. This is to use the ~ operator to give the distribution. The names are slightly different in stan and R, so the rstan function lookup can help you find the stan function corresponding to the R function.

lookup(dnorm)
lookup(dt)

Note that what stan is actually doing in the model block is calculating the log p.d.f. Thus, the commented out expressions are an alternative to the ~ notation.

Finally, note that this is an incomplete Bayesian model as there are no distributions for mu or tau. In stan this means we are implicitly putting a uniform prior on mu and log(tau); the latter is transformed so that it will always be positive. These are improper priors, but stan will be fine as long as the posterior is proper.

Data preparation

Build a list which contains the data as elements using the names in the data section of the model.

school8.dat <- list(
  J = nrow(Schools),
  y = Schools$effect,
  sigma=Schools$see)

Running Stan

There are two funcitons to start the sampling in stan. The first one is stan(file=XXX,data=YYY,...). This assumes that the stan model is in a file. However, with R markdown, we already saved the model in an object so we can use sampling(model,data=YYY,...). By the way, this same function can be used to make additional samples after we have sampled for a while.

The mc.cores should work well for Unix (Linux and Mac OS), I’m not so sure about Windows.

#options(mc.cores = parallel::detectCores()-1)
school8.fit1 <- sampling(
  school8.stan,       # The model
  data = school8.dat, # The data
  chains = 5,
  warmup = 1000,
  iter = 2000,
  refresh=100        # Show progress
)
starting worker pid=925 on localhost:11964 at 15:53:01.250
starting worker pid=934 on localhost:11964 at 15:53:01.367
starting worker pid=943 on localhost:11964 at 15:53:01.485
starting worker pid=952 on localhost:11964 at 15:53:01.599
starting worker pid=963 on localhost:11964 at 15:53:01.713

SAMPLING FOR MODEL 'school8' NOW (CHAIN 1).

Gradient evaluation took 1e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.1 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 2000 [  0%]  (Warmup)
Iteration:  100 / 2000 [  5%]  (Warmup)
Iteration:  200 / 2000 [ 10%]  (Warmup)
Iteration:  300 / 2000 [ 15%]  (Warmup)
Iteration:  400 / 2000 [ 20%]  (Warmup)
Iteration:  500 / 2000 [ 25%]  (Warmup)
Iteration:  600 / 2000 [ 30%]  (Warmup)
Iteration:  700 / 2000 [ 35%]  (Warmup)
Iteration:  800 / 2000 [ 40%]  (Warmup)
Iteration:  900 / 2000 [ 45%]  (Warmup)
Iteration: 1000 / 2000 [ 50%]  (Warmup)
Iteration: 1001 / 2000 [ 50%]  (Sampling)
Iteration: 1100 / 2000 [ 55%]  (Sampling)
Iteration: 1200 / 2000 [ 60%]  (Sampling)
Iteration: 1300 / 2000 [ 65%]  (Sampling)
Iteration: 1400 / 2000 [ 70%]  (Sampling)
Iteration: 1500 / 2000 [ 75%]  (Sampling)
Iteration: 1600 / 2000 [ 80%]  (Sampling)
Iteration: 1700 / 2000 [ 85%]  (Sampling)
Iteration: 1800 / 2000 [ 90%]  (Sampling)
Iteration: 1900 / 2000 [ 95%]  (Sampling)
Iteration: 2000 / 2000 [100%]  (Sampling)

 Elapsed Time: 0.036901 seconds (Warm-up)
               0.018909 seconds (Sampling)
               0.05581 seconds (Total)


SAMPLING FOR MODEL 'school8' NOW (CHAIN 2).

Gradient evaluation took 6e-06 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 2000 [  0%]  (Warmup)
Iteration:  100 / 2000 [  5%]  (Warmup)
Iteration:  200 / 2000 [ 10%]  (Warmup)
Iteration:  300 / 2000 [ 15%]  (Warmup)
Iteration:  400 / 2000 [ 20%]  (Warmup)
Iteration:  500 / 2000 [ 25%]  (Warmup)
Iteration:  600 / 2000 [ 30%]  (Warmup)
Iteration:  700 / 2000 [ 35%]  (Warmup)
Iteration:  800 / 2000 [ 40%]  (Warmup)
Iteration:  900 / 2000 [ 45%]  (Warmup)
Iteration: 1000 / 2000 [ 50%]  (Warmup)
Iteration: 1001 / 2000 [ 50%]  (Sampling)
Iteration: 1100 / 2000 [ 55%]  (Sampling)
Iteration: 1200 / 2000 [ 60%]  (Sampling)
Iteration: 1300 / 2000 [ 65%]  (Sampling)
Iteration: 1400 / 2000 [ 70%]  (Sampling)
Iteration: 1500 / 2000 [ 75%]  (Sampling)
Iteration: 1600 / 2000 [ 80%]  (Sampling)
Iteration: 1700 / 2000 [ 85%]  (Sampling)
Iteration: 1800 / 2000 [ 90%]  (Sampling)
Iteration: 1900 / 2000 [ 95%]  (Sampling)
Iteration: 2000 / 2000 [100%]  (Sampling)

 Elapsed Time: 0.038638 seconds (Warm-up)
               0.027682 seconds (Sampling)
               0.06632 seconds (Total)


SAMPLING FOR MODEL 'school8' NOW (CHAIN 3).

Gradient evaluation took 7e-06 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.07 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 2000 [  0%]  (Warmup)
Iteration:  100 / 2000 [  5%]  (Warmup)
Iteration:  200 / 2000 [ 10%]  (Warmup)
Iteration:  300 / 2000 [ 15%]  (Warmup)
Iteration:  400 / 2000 [ 20%]  (Warmup)
Iteration:  500 / 2000 [ 25%]  (Warmup)
Iteration:  600 / 2000 [ 30%]  (Warmup)
Iteration:  700 / 2000 [ 35%]  (Warmup)
Iteration:  800 / 2000 [ 40%]  (Warmup)
Iteration:  900 / 2000 [ 45%]  (Warmup)
Iteration: 1000 / 2000 [ 50%]  (Warmup)
Iteration: 1001 / 2000 [ 50%]  (Sampling)
Iteration: 1100 / 2000 [ 55%]  (Sampling)
Iteration: 1200 / 2000 [ 60%]  (Sampling)
Iteration: 1300 / 2000 [ 65%]  (Sampling)
Iteration: 1400 / 2000 [ 70%]  (Sampling)
Iteration: 1500 / 2000 [ 75%]  (Sampling)
Iteration: 1600 / 2000 [ 80%]  (Sampling)
Iteration: 1700 / 2000 [ 85%]  (Sampling)
Iteration: 1800 / 2000 [ 90%]  (Sampling)
Iteration: 1900 / 2000 [ 95%]  (Sampling)
Iteration: 2000 / 2000 [100%]  (Sampling)

 Elapsed Time: 0.03717 seconds (Warm-up)
               0.018411 seconds (Sampling)
               0.055581 seconds (Total)


SAMPLING FOR MODEL 'school8' NOW (CHAIN 4).

Gradient evaluation took 6e-06 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 2000 [  0%]  (Warmup)
Iteration:  100 / 2000 [  5%]  (Warmup)
Iteration:  200 / 2000 [ 10%]  (Warmup)
Iteration:  300 / 2000 [ 15%]  (Warmup)
Iteration:  400 / 2000 [ 20%]  (Warmup)
Iteration:  500 / 2000 [ 25%]  (Warmup)
Iteration:  600 / 2000 [ 30%]  (Warmup)
Iteration:  700 / 2000 [ 35%]  (Warmup)
Iteration:  800 / 2000 [ 40%]  (Warmup)
Iteration:  900 / 2000 [ 45%]  (Warmup)
Iteration: 1000 / 2000 [ 50%]  (Warmup)
Iteration: 1001 / 2000 [ 50%]  (Sampling)
Iteration: 1100 / 2000 [ 55%]  (Sampling)
Iteration: 1200 / 2000 [ 60%]  (Sampling)
Iteration: 1300 / 2000 [ 65%]  (Sampling)
Iteration: 1400 / 2000 [ 70%]  (Sampling)
Iteration: 1500 / 2000 [ 75%]  (Sampling)
Iteration: 1600 / 2000 [ 80%]  (Sampling)
Iteration: 1700 / 2000 [ 85%]  (Sampling)
Iteration: 1800 / 2000 [ 90%]  (Sampling)
Iteration: 1900 / 2000 [ 95%]  (Sampling)
Iteration: 2000 / 2000 [100%]  (Sampling)

 Elapsed Time: 0.032323 seconds (Warm-up)
               0.021422 seconds (Sampling)
               0.053745 seconds (Total)


SAMPLING FOR MODEL 'school8' NOW (CHAIN 5).

Gradient evaluation took 6e-06 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 2000 [  0%]  (Warmup)
Iteration:  100 / 2000 [  5%]  (Warmup)
Iteration:  200 / 2000 [ 10%]  (Warmup)
Iteration:  300 / 2000 [ 15%]  (Warmup)
Iteration:  400 / 2000 [ 20%]  (Warmup)
Iteration:  500 / 2000 [ 25%]  (Warmup)
Iteration:  600 / 2000 [ 30%]  (Warmup)
Iteration:  700 / 2000 [ 35%]  (Warmup)
Iteration:  800 / 2000 [ 40%]  (Warmup)
Iteration:  900 / 2000 [ 45%]  (Warmup)
Iteration: 1000 / 2000 [ 50%]  (Warmup)
Iteration: 1001 / 2000 [ 50%]  (Sampling)
Iteration: 1100 / 2000 [ 55%]  (Sampling)
Iteration: 1200 / 2000 [ 60%]  (Sampling)
Iteration: 1300 / 2000 [ 65%]  (Sampling)
Iteration: 1400 / 2000 [ 70%]  (Sampling)
Iteration: 1500 / 2000 [ 75%]  (Sampling)
Iteration: 1600 / 2000 [ 80%]  (Sampling)
Iteration: 1700 / 2000 [ 85%]  (Sampling)
Iteration: 1800 / 2000 [ 90%]  (Sampling)
Iteration: 1900 / 2000 [ 95%]  (Sampling)
Iteration: 2000 / 2000 [100%]  (Sampling)

 Elapsed Time: 0.033816 seconds (Warm-up)
               0.019145 seconds (Sampling)
               0.052961 seconds (Total)
There were 410 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmupThere were 2 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-lowExamine the pairs() plot to diagnose sampling problems
summary(school8.fit1)
$summary
               mean   se_mean       sd
mu         7.812131 0.6055113 5.608953
tau        6.526100 0.5475959 5.594885
theta[1]  11.406455 0.7937759 8.844209
theta[2]   7.661773 0.7112901 6.728279
theta[3]   6.187038 0.5263840 7.914510
theta[4]   7.449181 0.5391214 6.876766
theta[5]   5.250541 0.5238661 6.588764
theta[6]   6.084175 0.5881689 6.928273
theta[7]  10.227007 0.7628530 7.193739
theta[8]   8.351604 0.6037402 7.949255
lp__     -16.765925 1.0801876 6.665330
                2.5%         25%
mu        -1.9650045   4.0061144
tau        0.4908227   2.4808113
theta[1]  -2.3989105   5.4760790
theta[2]  -4.6237519   2.9270998
theta[3] -11.0406690   0.7400600
theta[4]  -5.3785224   2.4641135
theta[5]  -8.5583917   0.8033183
theta[6]  -8.5682197   0.9106422
theta[7]  -1.3595897   5.5085064
theta[8]  -5.9049507   2.5862723
lp__     -27.6712192 -21.4322308
                50%        75%     97.5%
mu         8.114573  11.627998 18.650955
tau        5.104776   8.894914 20.571063
theta[1]  10.741247  16.130516 32.410127
theta[2]   7.986128  12.038098 21.293496
theta[3]   6.565595  11.655198 20.626196
theta[4]   7.674158  11.928284 21.379169
theta[5]   5.657908  10.159945 17.273066
theta[6]   6.451819  11.236821 18.478337
theta[7]  10.308138  14.461569 26.174680
theta[8]   8.372006  12.716449 24.744555
lp__     -17.660582 -12.808046  1.238644
             n_eff     Rhat
mu        85.80631 1.041245
tau      104.39071 1.031969
theta[1] 124.14298 1.025667
theta[2]  89.47764 1.038188
theta[3] 226.06989 1.018209
theta[4] 162.70289 1.022423
theta[5] 158.18573 1.028340
theta[6] 138.75410 1.024984
theta[7]  88.92577 1.033819
theta[8] 173.36152 1.018429
lp__      38.07544 1.114614

$c_summary
, , chains = chain:1

          stats
parameter        mean       sd
  mu         5.317119 5.722964
  tau        5.415942 5.154340
  theta[1]   8.284340 9.138626
  theta[2]   4.766320 6.748719
  theta[3]   4.010707 7.251676
  theta[4]   5.134581 6.811278
  theta[5]   3.162939 5.801276
  theta[6]   3.827647 6.313627
  theta[7]   7.233195 8.048787
  theta[8]   5.883783 7.859923
  lp__     -15.065836 6.611098
          stats
parameter         2.5%          25%
  mu        -2.0692392   0.06696485
  tau        0.9128414   0.96777399
  theta[1]  -2.2795964   1.10345272
  theta[2]  -3.9538152  -1.06680084
  theta[3]  -8.3303515  -0.61107846
  theta[4]  -4.6738503   0.07632760
  theta[5]  -8.4102941   0.28988835
  theta[6]  -8.6773500  -0.19629819
  theta[7]  -2.2706246  -0.32696602
  theta[8]  -5.9658763   0.46633543
  lp__     -26.5509196 -20.60641638
          stats
parameter          50%       75%
  mu         4.9957764  9.791407
  tau        3.8712827  8.088730
  theta[1]   6.6593258 13.839511
  theta[2]   3.5701161  9.798721
  theta[3]   1.3700207  8.796577
  theta[4]   2.8740655  9.964245
  theta[5]   0.8033183  6.837366
  theta[6]   1.7221348  8.188059
  theta[7]   6.7532038 12.872211
  theta[8]   3.7967882 11.168415
  lp__     -15.9032018 -8.032832
          stats
parameter      97.5%
  mu       16.879393
  tau      18.484609
  theta[1] 29.886852
  theta[2] 18.569349
  theta[3] 19.199158
  theta[4] 20.511356
  theta[5] 15.564108
  theta[6] 17.485785
  theta[7] 25.558868
  theta[8] 22.526985
  lp__     -6.324546

, , chains = chain:2

          stats
parameter        mean       sd
  mu         9.113725 4.997920
  tau        4.921228 5.352254
  theta[1]  11.703441 7.351434
  theta[2]   9.406019 5.752201
  theta[3]   8.043952 7.035265
  theta[4]   8.623902 6.620876
  theta[5]   7.252884 6.004023
  theta[6]   7.877358 6.530397
  theta[7]  10.973099 5.804341
  theta[8]   9.334019 6.723365
  lp__     -13.034108 8.771491
          stats
parameter         2.5%        25%
  mu        -1.8337718   6.030965
  tau        0.4908227   0.886425
  theta[1]  -1.4841371   8.146024
  theta[2]  -3.9383633   6.243934
  theta[3]  -9.6757426   4.132688
  theta[4]  -5.9050293   4.680339
  theta[5]  -5.6360762   3.218087
  theta[6]  -6.4022144   4.395750
  theta[7]  -1.3521462   7.914310
  theta[8]  -3.8439566   5.369012
  lp__     -26.5091632 -19.938951
          stats
parameter         50%       75%
  mu        10.711412 11.745782
  tau        3.398615  6.984397
  theta[1]  11.461121 14.513713
  theta[2]  10.819651 12.038098
  theta[3]  10.162328 11.907102
  theta[4]  10.436656 11.928284
  theta[5]   8.785181 11.944174
  theta[6]   9.932461 11.727059
  theta[7]  11.554961 13.512744
  theta[8]  11.080091 12.185820
  lp__     -14.945254 -5.352956
          stats
parameter      97.5%
  mu       17.381211
  tau      18.023103
  theta[1] 29.586328
  theta[2] 20.468949
  theta[3] 19.272080
  theta[4] 20.924958
  theta[5] 16.442430
  theta[6] 17.999776
  theta[7] 23.968096
  theta[8] 22.561822
  lp__      1.238644

, , chains = chain:3

          stats
parameter        mean       sd
  mu         7.959066 5.451501
  tau        7.209491 5.871847
  theta[1]  11.962623 8.890615
  theta[2]   7.595246 6.542198
  theta[3]   6.014075 7.574704
  theta[4]   7.408524 6.594604
  theta[5]   5.228719 6.322350
  theta[6]   6.081984 7.046691
  theta[7]  10.505488 6.840579
  theta[8]   8.277614 8.169887
  lp__     -18.149799 5.000652
          stats
parameter        2.5%        25%
  mu        -1.965004   4.526018
  tau        1.697291   3.192851
  theta[1]  -3.084323   6.334717
  theta[2]  -4.183357   3.049151
  theta[3]  -9.272498   1.995942
  theta[4]  -5.689904   3.422867
  theta[5]  -8.806101   1.032569
  theta[6]  -9.054196   1.726537
  theta[7]  -1.539866   5.876482
  theta[8]  -7.341958   3.274583
  lp__     -27.916636 -21.964785
          stats
parameter         50%        75%
  mu         7.992465  11.267931
  tau        5.571287   9.309433
  theta[1]  11.062798  16.595132
  theta[2]   7.475613  11.880208
  theta[3]   6.408747  11.004031
  theta[4]   7.486955  11.358652
  theta[5]   6.061246   9.319472
  theta[6]   6.348071  10.842026
  theta[7]  10.242733  14.586693
  theta[8]   8.153980  12.921977
  lp__     -18.237036 -14.075445
          stats
parameter      97.5%
  mu       18.937056
  tau      21.770242
  theta[1] 32.910887
  theta[2] 21.734591
  theta[3] 19.910615
  theta[4] 20.337824
  theta[5] 16.556155
  theta[6] 19.252657
  theta[7] 25.519162
  theta[8] 25.607118
  lp__     -9.278952

, , chains = chain:4

          stats
parameter        mean       sd
  mu         8.568027 5.534737
  tau        7.168199 5.435400
  theta[1]  12.733641 9.028515
  theta[2]   8.412744 6.541800
  theta[3]   6.565178 8.283020
  theta[4]   8.027052 6.263444
  theta[5]   5.453597 6.643689
  theta[6]   6.317908 6.678812
  theta[7]  11.030815 6.814620
  theta[8]   8.995761 8.064238
  lp__     -18.215608 5.201645
          stats
parameter         2.5%        25%
  mu        -1.3212825   5.293882
  tau        1.3719551   3.325571
  theta[1]  -1.2472495   6.694012
  theta[2]  -3.8508585   4.466115
  theta[3] -12.3482461   2.193619
  theta[4]  -4.3446945   4.176093
  theta[5]  -9.3785866   1.829915
  theta[6]  -7.9954878   2.466654
  theta[7]  -0.7609649   6.606466
  theta[8]  -5.4575975   4.456238
  lp__     -28.8531399 -21.583613
          stats
parameter         50%        75%
  mu         8.347975  11.490139
  tau        5.878870   9.367476
  theta[1]  10.993628  18.013020
  theta[2]   8.465054  12.197814
  theta[3]   7.300503  11.048765
  theta[4]   8.152605  11.714965
  theta[5]   6.137964   9.988435
  theta[6]   6.759475  10.570555
  theta[7]  10.131324  14.984971
  theta[8]   8.297449  13.153525
  lp__     -18.490537 -14.761629
          stats
parameter      97.5%
  mu       19.976295
  tau      20.576796
  theta[1] 34.252377
  theta[2] 22.195894
  theta[3] 20.732651
  theta[4] 20.210171
  theta[5] 18.077228
  theta[6] 18.604222
  theta[7] 26.637261
  theta[8] 26.506941
  lp__     -6.891882

, , chains = chain:5

          stats
parameter        mean       sd
  mu         8.102717 5.544260
  tau        7.915642 5.547808
  theta[1]  12.348232 8.971728
  theta[2]   8.128535 7.076505
  theta[3]   6.301277 8.775130
  theta[4]   8.051848 7.499751
  theta[5]   5.154567 7.415675
  theta[6]   6.315977 7.416364
  theta[7]  11.392437 7.459067
  theta[8]   9.266841 8.314063
  lp__     -19.364271 4.689265
          stats
parameter        2.5%         25%
  mu        -2.064587   4.1700491
  tau        1.794061   4.0570490
  theta[1]  -2.690773   6.0130344
  theta[2]  -5.910027   3.7126105
  theta[3] -13.015217   1.2974379
  theta[4]  -6.509460   3.1743881
  theta[5] -10.947974   0.6910546
  theta[6] -10.173149   1.7264875
  theta[7]  -1.176554   5.9927923
  theta[8]  -5.822930   3.6335596
  lp__     -28.390295 -22.5641666
          stats
parameter         50%        75%
  mu         8.090350  11.386533
  tau        6.515542  10.255615
  theta[1]  11.360511  17.552293
  theta[2]   7.965408  12.758135
  theta[3]   6.573409  12.018820
  theta[4]   7.860309  12.710849
  theta[5]   5.481983   9.993061
  theta[6]   6.554202  11.168807
  theta[7]  10.726751  16.397259
  theta[8]   8.739287  14.749213
  lp__     -19.476391 -16.229731
          stats
parameter      97.5%
  mu        18.81987
  tau       22.01051
  theta[1]  33.58971
  theta[2]  22.15199
  theta[3]  23.12638
  theta[4]  23.55926
  theta[5]  19.31765
  theta[6]  19.42346
  theta[7]  27.26951
  theta[8]  25.85211
  lp__     -10.04014

Alternate style using external file.

options(mc.cores = parallel::detectCores()-1)
school8.fit1 <- stan(
  file="Bayesian/school8.stan",       # The model
  data = school8.dat, # The data
  chains = 5,
  warmup = 1000,
  iter = 2000,
  refresh=100        # Show progress
)
summary(school8.fit1)

Summaries using the base Stan functions

Printing gives summaries of the posterior for the specified parameters. Use the pars argument to select what to print.

print(school8.fit1,pars=c("theta","mu","tau","lp__"), probs=c(.1,.5,.9))
Inference for Stan model: school8.
5 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=5000.

           mean se_mean   sd    10%
theta[1]  11.41    0.79 8.84   1.10
theta[2]   7.66    0.71 6.73  -1.07
theta[3]   6.19    0.53 7.91  -3.19
theta[4]   7.45    0.54 6.88  -0.73
theta[5]   5.25    0.52 6.59  -2.87
theta[6]   6.08    0.59 6.93  -2.05
theta[7]  10.23    0.76 7.19   0.16
theta[8]   8.35    0.60 7.95  -0.71
mu         7.81    0.61 5.61   0.07
tau        6.53    0.55 5.59   0.97
lp__     -16.77    1.08 6.67 -24.32
            50%   90% n_eff Rhat
theta[1]  10.74 22.86   124 1.03
theta[2]   7.99 15.71    89 1.04
theta[3]   6.57 15.20   226 1.02
theta[4]   7.67 15.80   163 1.02
theta[5]   5.66 12.79   158 1.03
theta[6]   6.45 14.28   139 1.02
theta[7]  10.31 19.33    89 1.03
theta[8]   8.37 17.82   173 1.02
mu         8.11 14.39    86 1.04
tau        5.10 13.65   104 1.03
lp__     -17.66 -7.30    38 1.11

Samples were drawn using NUTS(diag_e) at Wed Mar  6 15:53:05 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

The plot function gives 50% and 95% intervals.

plot(school8.fit1)
ci_level: 0.8 (80% intervals)
outer_level: 0.95 (95% intervals)

Traceplot shows convergence (note stan and coda have slightly different traceplot functions).

rstan::traceplot(school8.fit1,pars=c("mu","tau","lp__"), inc_warmup=TRUE, nrow=3)

What we are looking for here is (a) white-noise like, and all chains plotting over top of each other.

Typically variance and scale parameters will be positively skewed, and log-posterior is often not as well behaved as others.

Pair plot sometimes uncover problems with models.

pairs(school8.fit1,pars=c("mu","theta[1]","tau"))

Shinystan

Shinystan opens a new shiny workspace which allows interactive browsing of the results.

school8.fit1s <- launch_shinystan(school8.fit1)

Launching ShinyStan interface... for large models this  may take some time.

Listening on http://127.0.0.1:3422

References


  1. Rubin, D. B. (1981). Estimation in Parallel randomized experiments. Journal of Educational Statistics, 6, 377-401.

  2. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A. and Rubin, D. B. (2014). Bayesian Data Analysis: Third Edition. CRC Press. (ISBN: 978-1-4398-4095-5)

---
title: "8 Schools Stan"
author: "Russell Almond"
date: "1/31/2019"
output: html_notebook
editor_options: 
  chunk_output_type: inline
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# The 8 Schools Problem

This is the classic eight schools example from Rubin(1981)[^1].  (It is also found in Chapter 5 of Gelman et al., 2014 [^2].)  The story is that 8 different schools experimented with an SAT coaching experiment.  The performance gains of the coached students were compared to students on a weight list control.  Separate estimates were obtained for each school, but because the size of the schools differed, the standard errors differed as well.

Here are the data:
```{r data}
Schools <- data.frame(row.names=c("A","B","C","D","E","F","G","H"),
                      effect = c(28.39,7.94,-2.75,6.82,-.64,.63,18.01,12.16),
                      see = c(14.9, 10.2, 16.3, 11.0, 9.4, 11.4, 10.4, 17.6))

Schools
```
Lets start by calculating a weighted average effect.  I'll weight each case by the precision (one over the square of the see).
```{r grand mean}
Schools$w <- 1/Schools$see^2
Schools.mean <- sum(Schools$w*Schools$effect)/sum(Schools$w)
Schools.mean
```
Here is a plot of the data.
```{r data Plot}
ord <- order(Schools$effect)
plot(Schools$effect[ord[c(1,8)]]+c(-2,2)*Schools$see[ord[c(1,8)]],
     c(nrow(Schools),1),main = "8 Schools data.",type="n",yaxt="n",
     xlab="Effect Size",ylab="School")
points(Schools$effect[ord],nrow(Schools):1,pch=rownames(Schools)[ord])
segments(Schools$effect[ord]-2*Schools$see[ord],nrow(Schools):1,
         Schools$effect[ord]+2*Schools$see[ord],nrow(Schools):1)
abline(v=Schools.mean,col="blue")

```

## Model Setup in Stan.

First we need to load the packages.  The `rstan` package runs stan from R. The `shinystan` package gives us a browser for the results. 

```{r}
library(rstan)
library(shinystan)
library(parallel) # For using multiple chains
```

First we set up the Stan model, and put it into a variable called `school8` (note the `output.var="school8"` in stan block tag)

```{stan output.var="school8.stan"}
data {
  int<lower=0> J; // number of schools
  real y[J]; // estimated treatement effects
  real<lower=0> sigma[J]; //s.e. of effects.
}
parameters {
  real mu;
  real<lower=0> tau;
  vector[J] theta;
} 
model {
  theta ~ normal(mu,tau);
  //target += normal_lpdf(theta|mu,tau);
  y ~ normal(theta,sigma);
  //target += normal_lpdf(effect|theta,see);
}
```

### Parts of the stan code.

Almost all models will have a `data`, `parameters` and `model` block.  They could have others as well (common ones are `transformed data` to do pre-calculations and `transformed parameters` to recast the model).

#### Data

In a stan model, _data_ refers to values that don't change over the course of the MCMC loop.  This tends to be one of three things:
* The observed data values.
* Fixed hyperparameters for prior distributions.
* Structural hyperparamters (e.g., sample size, number of groups).

In stan (like C++) all variables must be declared.  They are generally either `int` or `real` (here `vector` is shorthand for a vector of real values).  The modifiers <lower=XXX> and <upper=XXX> can be used to constrain the inputs.  In the data field this is just used to to type checking.

#### Parameters

The parameters are the values that stan will try to estimate.  These include both latent variables and ordinary parameters and hyperparameters.  

In stan, parameters should all be real.  Non-continuous parameters don't work with the Hamiltonian Monte Carlo.  (There are tricks for dealing with common cases like mixture models in the stan examples.)  This is called `lp__` (log pdf) in the output.

Note carefully the use of the lower and upper bounds.  It is important for stan to know when a parameter is restricted to say positive values (i.e., a scale parameter) as it needs to constrain the space for the sampler.

#### Model

This section gives the distribution for all of the parameters.  First I give the BUGS-like way of doing this.  This is to use the `~` operator to give the distribution.  The names are slightly different in stan and R, so the rstan function `lookup` can help you find the stan function corresponding to the R function.
```{r lookup, echo=TRUE}
lookup(dnorm)
lookup(dt)
```

Note that what stan is actually doing in the model block is calculating the log p.d.f.  Thus, the commented out expressions are an alternative to the `~` notation.  

Finally, note that this is an incomplete Bayesian model as there are no distributions for `mu` or `tau`.  In stan this means we are implicitly putting a uniform prior on `mu` and `log(tau)`; the latter is transformed so that it will always be positive.  These are improper priors, but stan will be fine as long as the posterior is proper.

## Data preparation

Build a list which contains the data as elements using the names in the data section of the model.
```{r School 8 Data}
school8.dat <- list(
  J = nrow(Schools),
  y = Schools$effect,
  sigma=Schools$see)
```

## Running Stan

There are two funcitons to start the sampling in stan.  The first one is `stan(file=XXX,data=YYY,...)`.  This assumes that the stan model is in a file.  However, with R markdown, we already saved the model in an object so we can use `sampling(model,data=YYY,...)`.  By the way, this same function can be used to make additional samples after we have sampled for a while.

The mc.cores should work well for Unix (Linux and Mac OS), I'm not so sure about Windows.

```{r Run Stan-sampling}
#options(mc.cores = parallel::detectCores()-1)
school8.fit1 <- sampling(
  school8.stan,       # The model
  data = school8.dat, # The data
  chains = 5,
  warmup = 1000,
  iter = 2000,
  refresh=100        # Show progress
)
summary(school8.fit1)
```
Alternate style using external file.
```{r Run Stan-stan}
options(mc.cores = parallel::detectCores()-1)
school8.fit1 <- stan(
  file="Bayesian/school8.stan",       # The model
  data = school8.dat, # The data
  chains = 5,
  warmup = 1000,
  iter = 2000,
  refresh=100        # Show progress
)
summary(school8.fit1)
```

## Summaries using the base Stan functions

Printing gives summaries of the posterior for the specified parameters.  Use the `pars` argument to select what to print.

```{r print stanfit}
print(school8.fit1,pars=c("theta","mu","tau","lp__"), probs=c(.1,.5,.9))

```

The `plot` function gives 50% and 95% intervals.
```{r stan plot}
plot(school8.fit1)
```

Traceplot shows convergence (note stan and coda have slightly different traceplot functions).

```{r stan traceplot}
rstan::traceplot(school8.fit1,pars=c("mu","tau","lp__"), inc_warmup=TRUE, nrow=3)
```

What we are looking for here is (a) white-noise like, and all chains plotting over top of each other.

Typically variance and scale parameters will be positively skewed, and log-posterior is often not as well behaved as others.


Pair plot sometimes uncover problems with models.

```{r stan pairs}
pairs(school8.fit1,pars=c("mu","theta[1]","tau"))

```

## Shinystan

Shinystan opens a new shiny workspace which allows interactive browsing of the results.

```{r shinystan}
school8.fit1s <- launch_shinystan(school8.fit1)
```



### References

[^1]: Rubin, D. B. (1981).  Estimation in Parallel randomized experiments.  _Journal of Educational Statistics_, __6__, 377-401.

[^2]: Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A. and Rubin, D. B. (2014).  _Bayesian Data Analysis:  Third Edition_.  CRC Press.  (ISBN:  978-1-4398-4095-5)

[^3]: Gelman, A. and Hill, J. (2007).  _Data Analysis Using Regression and Hierarchical/Multilevel Models._ Cambridge University Press.

[^4]: Dempster, A. P. Laird, N. M. and Rubin, D. B. (1977).  Maximum likelihood from incomplete data via the EM algorithm (with discussion).  _Journal of the Royal Statistical Society,_ __39__, 1--38.
