\documentclass{article}
\usepackage{proceed2e}
\usepackage{amsmath,graphicx}
%\usepackage{rgadefs}
%\usepackage{rgafigs}
\usepackage{indentfirst}
\usepackage{url}
\usepackage{apacite}
\usepackage{alltt}
\usepackage{algorithmic}
\def\logit{\mathop{\rm logit}\nolimits}
\def\mean{\mathop{\rm mean}\nolimits}
\def\Var{\mathop{\rm Var}\nolimits}
%
\def\cat{\mathop{\rm cat}\nolimits}
\def\Dirichlet{\mathop{\rm Dirichlet}\nolimits}
\def\lognormal{\mathop{\rm lognormal}\nolimits}
%% BF greek
\def\bfalpha{\boldsymbol\alpha}
\def\bfbeta{\boldsymbol\beta}
\def\bfgamma{\boldsymbol\gamma}
\def\bfdelta{\boldsymbol\delta}
\def\bfepsilon{\boldsymbol\epsilon}
\def\bfzeta{\boldsymbol\zeta}
\def\bfeta{\boldsymbol\eta}
\def\bftheta{\boldsymbol\theta}
\def\bfiota{\boldsymbol\iota}
\def\bfkappa{\boldsymbol\kappa}
\def\bflambda{\boldsymbol\lambda}
\def\bfmu{\boldsymbol\mu}
\def\bfnu{\boldsymbol\nu}
\def\bfxi{\boldsymbol\xi}
\def\bfpi{\boldsymbol\pi}
\def\bfrho{\boldsymbol\rho}
\def\bfsigma{\boldsymbol\sigma}
\def\bftau{\boldsymbol\tau}
\def\bfupsilon{\boldsymbol\upsilon}
\def\bfphi{\boldsymbol\phi}
\def\bfchi{\boldsymbol\chi}
\def\bfpsi{\boldsymbol\psi}
\def\bfomega{\boldsymbol\omega}
\def\bfOmega{\boldsymbol\Omega}
%************************* Title & Authors ******************************
\title{\large{\bf Causal Identification of the Effects of POMDP
Actions with non-Random Treatment Assignment}}
\author{{\bf Ima Pseudonym}\\pseudonym@dev.null}
%\author{{\bf Russell G. Almond}\\ Florida State University\\ralmond@fsu.edu }
%******************************* Abstract *****************************
\begin{document}
\maketitle
\begin{abstract}
Response-to-intervention (RTI) is an educational framework for placing
students into an appropriate level of support. The ability of the
students is measured at several time points and the lowest performing
students are placed into supplemental Tier~2 instruction. This
framework is naturally modeled with a partially observed Markov
decision process (POMDP), but using the POMDP model for planning
requires an estimate of the effect of the Tier~2 instruction. This
can be estimated from historical data, but unless the mechanism by
which the treatments were assigned in the historical data are causally
independent, the estimate will be biased. In particular, if the
treatment (action) assignment is made purely on the basis of observed
data, then the causal effect can be identified, but if teacher
judgment, or some other unrecorded variable, is used to determine the
treatment, the data will not meet the backdoor criteria for causal
identifiability. This paper explores the implications of the lack of
causal identifiability through a simple simulation study.
\end{abstract}
\vspace{1cm} \noindent{Key words}: POMDP, Causal Identification,
Application (Education)
\section{Introduction}
Partially observed Markov decision processes
\cite{Boutilier1999} provide a variety of algorithms for
finding an optimal sequence of decisions; however, these algorithms
all rely on having an estimate of the effects of various actions that
can be taken at each time point. If the effects are unknown, they
must be estimated from an existing database of measurements. However,
in many databases, the mechanism by which actions is assigned is not
randomized, thus the treatment (action) assignment may depend on observed
or even unobserved variables. In this case, the effects of the action
may not be causally identified.
As an example, consider the educational policy of
response-to-intervention \cite{Fuchs2012}. In RTI students are
split into two (or three) tiers based on their scores on a pretest:
Tier~I, regular classroom instruction, and Tier~II regular instruction
supplemented by small group instruction. Clearly Tier~II is more
expensive, and the goal is to maximize student performance subject to
a budget constraint on how many Tier~II seats are available. The
weakly coupled POMDP algorithms of \citeA{BoutilierLiu2016} seem ideal
for this, but the algorithm requires an estimate of the treatment
effect for the Tier~II instruction. Instructors are supposed to
monitor student performance for Tier~II students, adapting the
instruction if the student is not properly responding to the
intervention. Here the ability of POMDP models to produce forecasts
under different policies seems like a tool that would be helpful.
RTI has been implemented in classrooms for quite some time, and there
exist suites of tools, such as easyCBM \cite{easyCBM}, which support
teachers implementing an RTI program. Consequently, large databases
of student measurements exist, but as these are gathered over a
variety of schools and districts different policies are used for
assigning students to Tier~II. While many schools use a strictly
mechanical rule based on the screening test scores, others allow
teacher's discretion \cite{Mellard2009,Jenkins2013}. Worse, in the
real data there is often a mixture of policies, and establishing what
the policy is for each school and district is prohibitively expensive.
This paper looks at the issue of identification of the effects of
actions in a POMDP when there exists a potential unmeasured
relationship between the treatment assignment and the latent
variable. It attempts to find bounds on the causal effects using a
sensitivity analysis.
\section{A Simplified Response-to-Intervention Model}
RTI is method for delivering educational interventions which has been
shown to be effective in closing achievement gaps \cite{Fuchs2012}.
Although the details vary, in a typical RTI situation, students are
given a screening test three times in an academic year. On the basis
of the screening test, students are assigned to one of three tiers of
instruction. Tier~1 is continued whole class instruction, Tier~2 is
small group instruction in addition to the whole class instruction,
and Tier~3 is individual instruction. In some implementations, Tier~3
is an assignment to a special education classroom. For simplicity,
this paper only considers Tiers~1 and~2.
There is considerable variability in how the assignment to the tiers
is done \cite{Mellard2009,Jenkins2013}. Often it is implemented as a
simple cut score on the screening test, but in some cases the teacher
could use expert judgment to override the cut score. In many cases,
there is a limit to how many students can be assigned to Tier~2 based
on constraints such as the amount of time the teacher, aid, or
specialist can spend on small group work. These limits could be set
at the classroom, school or district level (for example, a reading
specialist could be shared across several schools).
The name \textit{response-to-intervention} comes about because of what
happens within Tier~2. In Tier~2 students are given more frequent
(often weekly) progress monitoring tests. If students are not
making ``adequate progress'' the intervention should be changed;
possibly changing the intensity (meeting more frequently, for longer
periods or with smaller group sizes) or the curriculum or approach
changed. In extreme cases, the student might be moved to Tier~3 or, if
the student did unexpectedly well, returned to Tier~1. The definition
of adequate progress is vague, and it is clear that a planning system
could help educators forecast the effects of changing the educational
plan for a student.
For the purposes of this paper, the RTI process will be simplified by
simply considering the Tier~1 and Tier~2 assignments without looking
at the progress monitoring. Also, for simplicity only one intensity
of Tier~2 treatment will be considered. Finally, additional screening
tests will be included so that there are more decision points. The
goal is to estimate the effect of the Tier~2 assignment so that it can
be used in planning.
\subsection{Common Data Layout}
Let $I$ be the number of students, and $T_i$ be the number of
measurements made on Student~$i$. Let $T_{max} = \max_{i\in I} T_i$.
Let $obs_{t,i}$ be the observation for Student~$i$ on the $t$th
measurement occasion. Let $Time_{t,i}$ be the elapsed time between
measurement occasion $t$ and $t+1$ for Student~$i$ and let
$Dose_{t,i}$ be the dosage of treatment received by Student~$i$
between times $t$ and $t+1$. In general the dose will be the treatment
intensity multiplied by the elapsed time.
Let $\theta_{t,i}$ be the proficiency of Student~$i$ at measurement
occasion~$t$. For simplicity, both $\theta_{t,i}$ and $obs_{t,i}$
will be taken as unidimensional even though the multidimensional case
is more interesting (e.g., if the overall proficiency is reading, the
students ability to decode works and comprehend sentences could be
separate measures and addressed with different interventions).
Note that the indexes are backward from the usual description so these
can be described as a one-dimensional array of vectors in Stan
\cite{stan}.
\section{Common Evidence Model}
Another problem that arises with the educational context is that the
measurement instruments are different at each time point. In
contrast, in a model trying to find the position of a robot, the same
instruments (with the same measurement properties) are used to measure
the robot's position at each time point. In an educational setting
the instrument is a test, but the same test cannot be used repeatedly.
For example, if the same reading passage was used over and over
increases in comprehension or reading fluency could be due to
familiarity with the specific passage. Therefore, the measurement
models consist of a collection of instruments for each time point,
each with potentially different relationships to the target latent
variable, $\theta$.
\citeA{UAIReading} illustrate another possible identification issue
which arises if both the average growth rate and the difficulty
(negative intercept in a regression model) and discrimination (slope
in a regression model) of the instrument must be estimated from the
same data. If on average the students score higher at
Time~2 than at Time~1, it is impossible to tell if an observed
difference is score is due to student improvement, or a difference
between the forms of the test administered at Time~1 and
Time~2, or some combination. \citeA{VerticalStationary} identify two
approaches to this problem: (1)~perform some kind of data
collection designed to put all of the instruments on a common scale,
and (2)~assume that the average growth is the same as the average
change in difficulty and examine deviations from stationarity.
The easyCBM product \cite{easyCBM} uses the first approach. A
separate calibration study was done where a number of different forms
of the progress monitoring instruments were given to students at about
the same time so that they could be placed on a common scale.
Furthermore, this initial calibration study establishes the parameters
that link the observations to the latent variable. This is the
approach taken in this simulation.
The relationship between the latent variable and the observation is
assumed to be a simple latent regression:
\begin{equation}
Obs_{t,i} \sim N(obs_{int} + obs_{slope}\theta_{t,i}, res_{std})
\label{eq:em}
\end{equation}
The three parameters which control equation~\ref{eq:em} are further
defined in terms of other parameters. In Psychometrics, the
\textit{reliability} of an instrument is defined as the correlation
between two different readings from an instrument taken under
identical conditions. Let $obs_{rel}$ be the
reliability of the instrument, $obs_{std,1}$ be the standard
deviation of the scores at the first measurement occasion, and
$obs_{mean,1}$ be the mean of those scores. To identify the latent
scale, $\theta_{i,1}$ is assume to have a standard normal
distribution. Therefore,
\begin{align*}
obs_{int} &= obs_{mean,1} \\
obs_{slope} &=obs_{std,t} \sqrt{obs_{rel}} \\
res_{std} &=obs_{std,t_1}\sqrt{1-obs_{rel}} \\
\end{align*}
This should ensure that the scale at the initial time point is
properly identified.
\subsection{Variable Slopes Model}
The model used in this study assumes that students' abilities grow
according to a Wiener process with drift. That is, between each time
point there is an independent increment to each student's ability, and
those increments accumulate over time. The process is assumed to have
drift as the students are actively receiving instruction, and
the average trend will depend on the instruction received.
The average growth (or drift) has two components a natural growth
component and a treatment effect. Students in Tier~I receive the
normal instruction and only exhibit normal growth. Students in
Tier~II receive both normal instruction and some kind of supplemental
instruction; thus, their growth with have both natural and treatment
effects. The variable $Dose_{t,i}$ indicates how much supplemental
instruction each student receives between measurement points $t$ and
$t+1$. It is zero for students in Tier~I and positive for students in
Tier~II.
Using this decomposition for the average learning gain, the change in
the latent proficiency can be decomposed as:
\begin{equation}
\theta_{t+1,i} = \theta_{t,i} + slope_i*Time_{t,i} +
treat_{eff}*Dose_{t,i} + \epsilon_{t,i} \label{eq:varSlope2}
\end{equation}
In this equation, the natural growth rate, $slope_{i}$,
varies by person, but the treatment effect does not. Also, it is
assumed that the treatment effect and natural growth rate are
additive. Finally, to make this a Wiener process, the variance of the
innovation term, $\epsilon_{t,i}$ depends on the elapsed time,
$Time_{t,i}$; in particular, $\epsilon_{t,i} \sim
N(0,\sqrt{var_{innov}Time_{t,i}})$.
\citeA{Willett1988} notes that there is often a correlation between
the slope and the initial value in growth curves.\footnote{%Beth Phillips
XXX, personal communication, has indicated that she has found
this correlation to be both positive and negative across many
studies involving pre-school children.} This is because the first
measurement occasion is often not the true time zero. Consider a
growth curve for reading in Kindergarten students. Most students will
have received some kind of pre-reading instruction either through home
or pre-school. So even if the first measurement occasion is the first
day of class, they still will have received prior instruction.
Students who naturally grow at a faster rate are likelier to then be
at a higher level when first measured. Students entering Kindergarten
vary considerably in the amount of pre-school they may have attended
and the number of reading related activities that they do in their
home life, so the effective time zero may vary from student to
student.
To capture this idea, the slope distribution is characterized with
three parameters, $slope_{mu}$, $slope_{std}$ and $slope_{r2}$. The
last parameter is the correlation between the $slope_{i}$ and
$\theta_{i,1}$. To capture this relationship, the slopes are made
dependent on the initial proficiencies as follows:
\begin{equation}
slope_{i} = slope_{mu} + slope_{std}(\sqrt{1-slope_{r2}^2}\phi_{i} +
slope_{r2}*\theta{1,i},
\end{equation}
where both $\theta{1,i}$ and $\phi_{i}$ have unit normal
distributions.
\subsection{Tier assignment policies}
If the goal is to identify the treatment effect, $treat_{eff}$, then
ideally the treatment would be randomly assigned. This is often done
in trials for specific interventions. However, collecting data under
controlled conditions is fairly expensive, especially when considering
that often tests with pre-schoolers require human administration and
strict fidelity checks are needed to ensure uniformity of the
treatment. Even a study with a million dollar budget can usually only
afford to measure several hundred students on 3 time points in a
year.\footnote{XXX, personal communication.}
The alternative is to go to databases of student measurement that are
gathered through normal educational applications of an RTI system.
There are two problems. First, as no fidelity checks are done on the
treatment, there is likely considerable variability in the efficacy of the
implementation. Second, different districts, schools and classrooms
may use different policies for assignment into the tier groups.
Examine two different policies. The first will be based on a simple
cut score model. The second will allow the teacher to override the
cut score with expert judgment.
\noindent{Cut Score Policy}. This is the easiest policy to implement:
if $obs_{t,i} < cut_{t}$ then Student~$i$ is assigned to Tier~2,
otherwise to Tier~1. \citeA{Mellard2009} and \citeA{Jenkins2013}
surveyed a number of schools and found a fair number of them using
variations on this policy. Often the cut score is set to allow a
certain number of students into Tier~2. In the case of the simulation
study described here, the cut score for each time point is set to
catch students who are one standard deviation down from the expected
observation score at each time point.
\noindent{Cut Score with Override Policy}. This policy is meant to
emulate the situation where the cut score rule is in place, but the
teacher may use expert judgment to override the scoring rule. In this
scoring rule, the teacher uses personal observation of the student to
assess the student's value of $\theta_{t,i}$. If the teacher chooses
to override, then the student is assigned to Tier~2 if $theta_{t,i} <
cut_{t}$. It is assumed that the teacher overrides with a certain
probability $override_p$, and that the override decision is made
independently for each student (and independently of $\theta$).
The assumptions in the cut score with override policy are unrealistic,
but this is more or less designed to be a worst case scenario for
causal identification. Also, the cut score policy is a special case
of the cut score with override policy with the override probability
set to zero, which is convenient for implementation.
\section{A simple simulation study}
To assess whether or not the treatment effect could be recovered under
ideal conditions, a simulation study was performed. Data was
simulated for 400 student at 10 time points using both of the two
policies (simulation code is in the accompanying file
\texttt{varSlopesSim2a.R}). Then the model (\texttt{varSlopes2.stan})
was fit using Stan \cite{stan}. Five chains were run for 2000
interactions each (with 1/2 used for warm up). The usual tests
indicated that for both simulations the chains had reached the
stationary distribution. (The accompanying file
\texttt{varSlopesRun2.R} shows the model fitting and checking code.)
In both simulations, the treatment effect was set to .25,
corresponding to growth of 1/4 of a standard deviation over an
academic year (a fairly typical effect size for an educational
intervention). In the simulation using the simple cut score policy,
the mean treatment effect posterior was .13 with a standard deviation
of .09, a median of .11, and a 95\% credibility interval of .01 to
.34; which contains the true simulation value. For the cut score with
override policy, the override probability was set to .5. In this
simulation, the posterior mean was .04, the standard deviation, .03,
the median, .03 and the 95\% interval .00 to .12, clearly an under
estimate.
\section{Causal Identification}
So why does the policy without override produce an apparently unbiased
estimate, and allowing the teacher to override produce a biased
estimate? The answer can be found by trying to see whether or not the
effect of the treatment is causally identified by the data
\cite{Pearl2009}. Examine Figure~\ref{fig:HMM}(a), which corresponds
to the cut score without override policy. In this case, as $obs_i$ is
observed, there is no backdoor path to $\theta_2$ or $obs_2$ from
$Dose_1$, so its effects are causally identified.
\begin{figure*}
\begin{centering}
\noindent
\includegraphics[width=.95\linewidth]{CutScore.pdf}
\caption{Two time points of the model under the cut score policy with
(b) and without (a) override.}
\label{fig:HMM}
\end{centering}
\end{figure*}
For the cut score with override policy, Figure~\ref{fig:HMM}(b), there
is an extra dashed edge from $\theta_1$ to $Dose_1$. This introduces
a backdoor path which destroys the causal identification. Thus, the
estimates from this model are biased.
Note that this could also be cast as a model misspecification problem
rather than a causal identification problem. In particular, if the
mechanism corresponding to the dashed arrow were known, and added to
the MCMC model, the act of dosing becomes another observation. The
policy parameters corresponding to the dashed line (e.g., the override
probability) are not estimable from data, but a sensitivity analysis
could be performed by trying a range of parameters for the override
mechanism. This would at least produce bounds for the size of the
treatment effect. This is obviously the next step for this research.
\section{Discussion}
Intuitively, finding an optimal policy for a POMDP requires first
finding good estimates of the probable effects of the various
actions. However, unbiased estimates of those effects depend on the
mechanism by which actions are assigned in the training data. In
particular, a problem might exist if all of the variables used to
assign the action are not observed. In particular, to create unbiased
estimates of the action effects one of two conditions must hold:
(1)~actions are assigned only on the basis of observed variables, or
(2)~the mechanism by which the action assignment is related to the
latent variables is explicitly modeled. In the latter case, a
suitable parameterization of the model can allow bounds for the causal
effect of the action to be calculated using a sensitivity analysis.
The good news is that in a typical POMDP policy, the action selection
is made on the basis of a function of the sequence of observations.
Even though this is more complex than the simple example provided
here, it is still sufficient to satisfy the no backdoor path
criterion, and so the causal identification holds. The problem comes
when the database used for estimate is based on historical records
where the mechanism for action assignment was not recorded. Here the
potential use of expert judgment could open a backdoor that would
cause a problem with the causal identification.
There are several other issues with these data that have not yet been
addressed. The first is structural missingness. Students who are
assigned to Tier~2 are typically measured more often than students who
are assigned to Tier~1. For students in Tier~2 the model is
effectively estimating $slope_i + treat_{eff}$, while for Tier~1 it is
only estimating $slope_i$, but with fewer time points. It is unclear
if this will cause problems (an increase in the posterior variance is
likely).
A second issue is that it was assumed that the Tier~2 effect was
uniform and did not vary from person to person or depend on the state
of $\theta_{t,i}$. \citeA{musicTICL} suggested that the effect of an
educational intervention was likely to be highest for students at or
near the proficiency level for which it was defined. Using a more
sophisticated model for the treatment effect is probably appropriate.
A third issue is that the Tier~2 treatment is applied to a small
group. As the Tier~2 treatment is applied to small groups (and
sampling is usually done at the classroom or higher level so that all
of the students in the group are included in the sample), this calls
into question the stable unit treatment value assumption. Again, a
more complicated model is needed to model this dependency. Still, the
procedure explored here provides a way to start to approach the
problem of applying AI planning techniques to classroom decision
making.
\bibliographystyle{apacite}
\bibliography{HMMRefs}
\section*{Acknowledgments}
%% I would like to thank Joe Nese (University of Oregon) and Beth
%% Phillips (Florida State University) for many helpful discussions about
%% how RTI is typically implemented. I would also like to thank Qian
%% Zhang (FSU) for helpfully listening while I tried to figure out why my
%% simulation was not working.
%% This work was supported in part by the Society for the Study of School
%% Psychology (SSSP) through the Early Career Research Awards Program to
%% Joseph F. T. Nese at the University of Oregon. The opinions expressed
%% here represent only the views of the author and not of any of the
%% sponsoring institutions.
\end{document}