\name{EbuildT} \alias{EbuildT} \title{Builds expected values of sufficient statistics in multivariate normal regression} \description{ This does the E-step of an EM algorithm for multivariate regression. It is assumed that \code{X} is fully observed and \code{Y} may have missing data. This returns an augmented matrix which contains the expected value of the means and the covariance matrix for the current set of parameters, \code{B}, \code{b0}, and \code{Syy.x}. } \usage{ MbuildT(X, Y, B, b0, Syy.x, w = 1) } %- maybe also 'usage' for other objects documented here. \arguments{ \item{X}{A data matrix with rows representing observations and columns variables. These are independent variables in a regression, and must be fully observed. } \item{Y}{A data matrix with rows representing observations and columns variables. These are dependent variables in a regression, and may contain missing values. \code{X} and \code{Y} should have the same number of rows. } \item{B}{A \eqn{J \times K} matrix containing the non-constant coefficients of a multivariate regression of \code{Y} on \code{X}. Here \eqn{J} is the number of columns of \code{Y} and \eqn{K} is the number of columns of \code{X}. } \item{b0}{A vector of the constant terms from the multivariate regression of \code{Y} on \code{X}. This should have length \eqn{J}. } \item{Syy.x}{A \eqn{J \times J} symmetric matrix giving the residual covariance of the multivariate regression of \code{Y} on \code{X}. } \item{w}{If supplied this should be a vector of weights equal to the number of rows of \code{X}. The total sample size is considered to be the sum of the weights. } } \details{ Following Little and Rubin (2002), the sufficient statistics for a multivariate normal distribution can be represented by a matrix: \deqn{ \bold{T} = \left [ \begin{array}{ccc} -1 & \mu_X & \mu_Y \\ \mu_X & \Sigma_{XX} & \Sigma_{XY} \\ \mu_Y & \Sigma_{YX} & \Sigma_{YY} \end{array} \right ] }. Sweeping (see \code{\link{matSweep}}) the rows and columns corresponding to the \eqn{\bold{X}} variables produces the multivariate regression of \eqn{\bold{Y}} on \eqn{\bold{X}}. \deqn{ SWP[X]\bold{T} = \left [ \begin{array}{ccc} * & * & b_{0.X} \\ * & -\Sigma_{XX}^{-1} & B^T_{XY.X} \\ b_{0.X} & B_{XY.X} & \Sigma_{YY.X} \end{array} \right ] }, Here \eqn{B_{XY.X}} is the matrix of regression coefficients, \eqn{b_{0.X}} is the constants and \eqn{\Sigma_{YY.X}} is the residual covariances. If \code{Y} was fully observed, then \eqn{\bold{T}} could be easily calculated by first forming a matrix \code{XY1=cbind(1,X,Y)}, and then calculating \eqn{SWP[1](XY1)^T W (XY1)}, where \eqn{W} is a diagonal matrix with the weights, \code{w}, on the diagonal, and SWP[1] is the sweep operator applied to the constant row/column (assumed to be the first). Note that to get the parameter of the normal distribution, the part corresponding to the covariance matrix must be scaled by the sum of the weights. If there are missing values, then calculating the expected value of \eqn{\bold{T}} requires two steps. First, the missing values for \code{Y} must be imputed. Second, an adjustment needs to be made for any term in the sum which involves the product of two imputed values (or the square of a single imputed value). In order to perform both the imputation and the adjustment, a parameter estimate \eqn{bold{T}^{(i)}} is required. In this function, this is supplied through the second parameterization, \eqn{SWP[X]\bold{T}}. As \code{X} is fully observed the upper left \eqn{2 \times 2} submatrix can easily be calculated from the data. The remaining parts are passed in as arguments to the function. The principle output is \eqn{bold{T}^{(i+1)}}, that is one E-step of an EM-algorithm for the multivariate normal distribuiton. Note that the imputed data matrix is also returned. Finally, as this is based on the EM-algoirthm, it will only produce unbiased estimates when the data are missing at random. } \value{ A list with three components: \item{T}{A \eqn{(1+K+J) \times (1+K+J)} augmented covariance matrix estimate for \code{X} and \code{Y}.} \item{XY1}{The augmented data matrix with missing values of \code{Y} imputed with regression imputations (using the old parameters).} \item{M}{A logical matrix showing were the missing values in \code{Y} are.} } \references{ Little, R. J. A. and Rubin, D. B. (2002). \emph{Statistical Analysis with Missing Data, Second Edition.} Wiley. } \author{Russell Almond} \note{ This is built to facilitate \code{\link{BQreg}}. } \seealso{ \code{\link{BQreg}}, \code{\link{matSweep}}. } \examples{ data(mvMux) mvX <- mvMux[,1:2] # Complete Data mvY <- mvMux[,3:5] # Data with missing values. ## Start with numbers based on complete data only. mv.cccov <- cov.wt(na.omit(mvMux),method="ML") mv.ccm <- apply(mvMux,2,mean,na.rm=TRUE) mv.ccT <- rbind(c(-1/nrow(mvMux),mv.ccm), cbind(mv.ccm,mv.cccov$cov)) ## Use sweep to do the multivariate regression mv.ccTswp12 <- matSweep(mv.ccT,2:3) mv.BB <- mv.ccTswp12[4:6,2:3] mv.B0 <- mv.ccTswp12[4:6,1] mv.B <- cbind(mv.B0,mv.BB) mv.Syy.x <- mv.ccTswp12[4:6,4:6] Tstep <- EbuildT(mvX, mvY, mv.BB, mv.B0, mv.Syy.x, w=1) } % Add one or more standard keywords, see file 'KEYWORDS' in the % R documentation directory. \keyword{ multivariate }% use one of RShowDoc("KEYWORDS") \keyword{ regression }% __ONLY ONE__ keyword per line