--- title: 'Lab Part 3: Regression' author: "Russell Almond" date: '2022-06-06' output: pdf_document: default html_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Abstract This lab returns to the ADHD data and explores a relationship between ADHD symptoms (`hyper` and `inatt`) and measures of anxiety (`GADD`, `genaxa`) and panic (`PAG` and `paa`) noticed by Prevatt et al (2015). At the end of this lab you should be able to perform a simple linear regression, including diagnostics and prediction. The scientific questions addressed by these data are "Can the impulsivity symptoms be used to predict general anxiety in ADHD patients?" ## 1. ADHD and College Performance Prevatt, Dehili, Taylor and Marshall (2015) were interested in how the symptoms of attention deficit hyperactivity disorder (ADHD) affect the performance of college students. To study this, they looked at the relationship of the two key symptoms of ADHD, inattention and impulsivity, on measures of both academic and general anxiety and panic. In other words, they were looking for a correlation between the symptom variables and the measures of academic anxiety. In this case, we are going to examine the relationship between ADHD symptoms and the anxiety measures. The primary questions in this lab are: * Is it useful to use linear regression to predict the anxiety (or panic) score from the ADHD symptom scores? What is the prediction line? How much of the variance in the anxiety and panic scores can be predicted by the ADHD Symptoms? To do this, we will perform a series of regressions. The $X$ variables will be chosen from the ADHD scores, `hyper`, `inatt` and the sum of `hyper` and `inatt` (which you made in Part 2). For the $Y$ axis, you can choose any of the four varables: * `GADD` -- General Anxiety * `genaxa` -- General Academic Anixiet * `PAG` -- General Panic attacks. * `paa` -- Panic Attacks Academic. (Note that although I'm using the SPSS variable names so that you know which variables I'm referencing, your audience will not know these names, so you will need to explain them. It is normally better to spell them out, unless the abbreviation is used a lot). Note also that you only need to pick one of the four choice for $Y$ variable; your choice. * Is there any reason to believe that the regression is different for students from different years? In particular, are the grad students different from the undergrads. ## 2. The Data 1. The data file with the data for this lab is called Alec-5400Subset.csv. This is the same data set used in the first lab. Refer to the first lab handout for instructions on reading the data set in. (If you saved your data as an SPSS .sav file after the first lab, you can use that instead of reading it in again). Don't forget that you need to (a) add human readable labels to the variables, (b) add string values for the nominal and ordinal variables, and (c) make sure the value `-9` is coded as missing. *Note Bene! The default variable names are all programmers codes, and not human readable. You will need to fix this for full style points.* This data set has lots of missing data. Some of the data are missing at random, some of those data are structurally missing. In particular, the control students did not have all of the same measures taken about their performance that the ADHD students did. We can use some descriptive statistical analysis to see who is in our data set. In particular, use the _[A]{.underline}nalyze \> Compare [M]{.underline}eans \> [M]{.underline}eans ... [ALT+A M M]_ command to compare the sample sizes of the general anxiety score (GADD) and the hyperactive impulse symptoms (hyper) for the control and ADHD groups. What do you see? What does that say about who is included in the sample? Note that the SPSS scatterplot, correlation and regression commands will only use complete cases---students who have both anxiety and impulsivity scores---in the analysis. So make sure you explain who these students are. ## 3. Scatterplots The first step in a regression analysis is usually to look at potential $X$ and $Y$ variables one at a time. This was what Part 2 of the lab is about. So the next step now starts to look at pairs of variables. The basic exploratory tool for exploring the relationship between two continuous variables is the scatterplot. The command for building a scatterplot is _[G]{.underline}raphs \> [L]{.underline}egacy Dialogs \> [S]{.underline}catter/Dot\...[ALT+ G L S]_. In some ways, it doesn't matter which variable is $X$ and which is $Y$ (reversing $X$ and $Y$ produces a plot that is mirrored along the diagonal). However, the convention is that $Y$ is predicted from $X$. Often there is an implicit causal model in this choice. As the implicit causal model here is that ADHD causes anxiety and panic, the ADHD symptoms are better placed on the $X$-axis, and the anxiety and panic checklists on the $Y$-axis. ### 3.1 Adding a Regression Line Statistical modeling always contains assumptions, which should be checked if possible. The key assumption of linear regression is that the relationship between $X$ and $Y$ is linear, or at least not non-linear. If there is a definite curve, then linear regression is not the best choice. To check linearity, add a regression line to the scatterplot. Double click on the graph to open the graph editor, and then select the line tool (looks like a line going through points). This will add a regression line. (Note that by default, SPSS adds the equation of the line, but this often covers over data points. If it can't be draged out of the way, there is a control in the line dialog box that gets rid of it.) If linear regression is an appropriate model for the data, then the point cloud should look like an elipse around the regression line. Note that the regression line could be nearly horizontal and the elipse looks more like a cirlce. That is fine, too. This just means that the relationship between $X$ and $Y$ is weak and the correlation and slope of the line will be close to zero. The problem is when the data have a distinct curve. Adding a lowess[^1] curve to the plot helps spot curves. To add the lowess curve, with the plot in the graph editor, click on the line icon a second time. This will add a second curve; the default is the lowess curve. As it is a local regression, it generally follows the wiggles of the data. It can be difficult sometimes to determine if a given wiggle in the lowess curve is a real change in the data (a curve or a leveling off). What can be done if there is a curve? * Fit a curve (e.g., a polynomial) instead of a linear regression. (This is covered in EDF 5401). * Replace $X$ or $Y$ (or both) with a transformed version (often `log(X)` or `sqrt(X)`). (Again, this is covered in EDF 5401.) * If the curve isn't too bad, fit the linear regression anyway, but note the curve in the limitations section. In this case, the linear regression will pick up just the linear part of the relationship between $X$ and $Y$. This model tells part, but not all of the story. Remember, all models are imperfect descriptions of the real world, the important part is do they capture and interesting and important relationship. ### 3.2 Checking for Outliers As there are two different varables in a regression, there two ways a data value can be an outlier. A value which is an outlier in $X$ (either very large or very small) is a high leverage point. If the regression line is a lever arm, with the fulcrum the point at $(\overline X, \overline Y)$, then moving the outlier in $X$ may shift the regression line by a large amount. So this is called an _influential point_ or a _leverage point_. This can be a problem if a couple of individuals which are not typical of the population are driving the calculation of the sloope. The second kind of outlier is one in the $Y$ direction. Here the question is not just is it high or low, but how far away from the regression line is it. Points that are outliers in this sense are ones that just don't fit the regression model very well. Following up with these points can help identify data entry errrors. They can also identify individuals who are interesting for other reasons. A simple example might help here. Generally, by the time students reach upper elementary or middle school age, there is a fairly high correlation between there skills at decoding (phonics) and comprehension. Dyslexics are an exception to that rule. They are interesting educationally, becuase the respond to different types of reading instruction. To identify outliers in SPSS click the cross-hairs icon in the SPSS Plot Editor to turn your cursor into an identification tool. You can click on any data point to make its case number[^2] appear and disappear. Use this procedure to look for outliers. Once the outlier is identified, the problem becomes what to do about it. There are several solutions: * If the outlier can be traced to a data entry error, it can either be fixed (if there is still access to the raw data) or dropped. * The outlier might clearly belong to a different population. For example, the outlier might be a person with limited English proficiency, so they might not have understood the question properly. They can be excluded, but only by _redefining the population_; for example, while before the population of interest was college students, it now might be college students who are proficient in English. * Sometimes there is no reason to eliminate the outlier, but it could still be influential. Then the analyst can perform a _sensitivity analysis_ by removing the outlier, running the analysis again and then comparing a key summary statistic (e.g., the correlation or slope of the regression line). If they are similar, then the outlier can be safely ignored. If they are different, then the outlier becomes a potential limitation of the study. ## 3.3 Multigroup Scatterplots Another way the regression model could fail is that there could be more than one group in the population, and the relationship betwen the $X$ and $Y$ variables could be different in the different groups. An easy way to check for these kinds of problems is to color (or use different plotting markers) for the different groups. If the groups are visibly separate, then the group structure is an important feature of the model that is left out. To do use separate markers for each group, start by adding the Year variable to the "Markers" box. This will produce a plot where each year has a different color. This works fine on a color screen, but presents a problem on a black and white printer (it also could present a problem for a person with limited color perception). Assigning a different plotting symbol to each group (in addition to the color) makes sure that the group differences can be seen. To do this in SPSS, select the plotting symbol (colored circle) in the legend in the graph editor. Double click, and a properties window should open up. Select a new plotting symbol (and if you like, a new color) from that window. Once the groups have different markers, check the scatterplot for patterns involving the groups. Do all of the groups follow the same general pattern, or are the groups visibly separate? In the latter case, something might need to be done. SPSS will also add different regression lines for each group. The button to do this is in the graph editor, next to the the button that produced the single regression line. It is usual for all of the lines to be slightly different, but if one is very different, this is a cause for concern. ## 4. Correlations To calculate the correlations, us the command _[A]{.underline}nalyze \> [C]{.underline}orrelate \> [B]{.underline}ivariate...[ALT+A C B]._ Drag the variables of interest into the box. The Pearson correlation (the default) is the best choice. SPSS will then produce a correlation matrix, like the one below. ```{r corTable, include=FALSE} vars <- paste0("V",1:3) cormat <- matrix(paste0("cor( ", rep(vars,3), ", ", rep(vars,each=3), ")"), 3,3,dimnames=list(vars,vars)) if (knitr::is_latex_output()) { rownames(cormat) <- paste("{\bf ",vars,"}") } else { rownames(cormat) <- paste0("**",vars,"**") } knitr::kable(cormat, align="c", padding=5, row.names = TRUE) ``` Each row and column corresponds to one the the variables ($V1$, $V2$ and $V3$ in the example), and each cell corresponds to the correlation between the variable in the corresponding row and column. In SPSS output (see below) there are three numbers in each cell: the correlation, a $p$-value, and the number of data points used in the calculation. Normally, APA style would have us report the *p*-values and the same sizes along with the correlations. Don't bother with the *p*-values for Part 3, as we haven't covered them in class yet (Unit 18), but do report the sample sizes. The sample size is the number of subjects which had values for both of the variables. Are the sample sizes different? If so, why? Is that difference likely to affect the interpretation? ![Sample SPSS Corrlation Table.](CorrelationTable.png) The correlation matrix is always symmetric along the major diagonal. First $\text{Cor}(Vi, Vi) = 1$ for any $i$, so the diagonal is 1. (The $n$ for the diagonals does show how many valid observations there are for each variable, so that at least is non-trivial). Second, not that $\text{Cor}(Vi, Vj) = \text{Cor}(Vj, Vi)$. Therefore, the upper triangle of the matrix is always a mirror image of the lower triangle. Because of this, analysts often only publish the lower triangle of the matrix. (The latest version of SPSS gives an option to do this). It is usually not necessary to reproduce the entire correlation table in your document. If there is a single correlation, just put it in the text. The usual APA formatting is $r(n=\underline{n}) = \underline{r}$ (or $r(n=\underline{n}) = \underline{r}, p = \underline{p}$ or $p <.001$ if including the $p$-value), where the underlined values must be filled in based on the SPSS output. These are equations, so they should be put in italics. Also, APA-rules say to leave out the leading 0 in correlation coefficients. Correlations are usually reported to two siginficant digits, although more digits may be needed if the correlation is less than .1 or more than .9. For this lab, you want to look at the correlations between the measures of ADHD symptoms (`inatt`, `hyper` and their sum) and the measures of anxiety (`GADD`, `genaxa`) and panic (`PAG`, `paa`). Now there are many correlations, so it might be useful to put them in a table. But be careful, the SPSS tables usually have more information than is needed, as well as more digits than are typically needed. ## 5. Simple Linear Regression Regression analysis produces an equation for predicting a _dependent_ (or $Y$) variable from one or more _independent_ (or $X$) variables. (This is a different meaning from statistical indepedence; the idea is that the $X$ variables can vary independently, and the $Y$ variable value will depend on $X$; that is $Y$ is modeled as a function of $X$). In linear regression, that function will be a line, with a slope and an intercept (which SPSS calls a constant). Linear regression in SPSS is done through the menu item _[A]{.underline}nalyze \> [R]{.underline}egression \> [L]{.underline}inear\...[Alt+A R L]._ In this dialog you select the dependent variable and one or more predictor (independent) variables. You can also add case labels (the names of states) and these will be used in the diagnostic plots. The _Statistics\..._ button provides a pop-up dialog in which you can select various statistics about the regression. In particular, you will want confidence intervals for the coefficients and model fit statistics. The _Plots\..._ button (called "Diagnostics" in some older versions of SPSS) provides a pop-up dialog in which you can select plots. Checking "Histogram" will get you a histogram of the residuals[^3]. You can add diagnostic plots using this dialog box. Select a variable for the *x*-and *y*-axis and then hit next to get the opportunity to select another plot. The plot that I like best is the residuals versus predicted values. Select `*ZPRED` (the standardized predicted values) and move this to the *x*-axis, and select `*ZRESID` (the standardized residual values) and move this to the *y*-axis. These steps are optional; there isn't time to cover what to do if these tests produce problems, but EDF 5401 covers this topic. The _Save\..._ button provides a pop-up dialog that allows you to save predicted values and residuals. The Unstandardized predicted values are the values you would get if you computed the predicted value using the estimated slope and intercept from the full sample. (Russell also likes the Adjusted predicted values, which are from the regression line that leaves out the point being predicted -- you can try these out, to see which are helpful to you.) You can also request prediction interval for individual predictions. Finally, saving some kind of residual will allow you to make additional plots. The standardized residuals are the most useful. Again, these steps are optional, but could be useful for more advanced applications. The _Options\..._ dialog has options relevant to multiple linear regression and missing values. We don't need to worry about it. Look at the output and check the correlation. Is the relationship strong or weak? Is it plausible to believe that there is a linear relationship between the two scores? Are you concerned about any points? ### 5.1 Model Summary Table The SPSS regression command produces lots of output; but not all of it is interesting. Don't just dump the tables from SPSS into the report; pick and choose what you need. Often the table needs to be cleaned up to conform to APA style, and don't be afraid to cut rows and columns that are not part of the story the paper is telling. The first table to examine is the "Model Summary Table." There are a number of useful statistics here. First, is the correlation, "R" Note that capital $R$ refers to the multiple correlation coefficient, as opposed to $r$ which is used for the bivariate correlation between to variables. In a simple regression (with a single predictor), $R=r$, so there is not distinction. The "R Square" ($R^2$) is just the square of the correlation, but it has an important interpretation. It describes the amount of the variance in the $Y$ variable that can be explained or predicted if $X$ is known. This statistic is an important measure for interpreting the size of the effect. ![SPSS Sample Model Table.](ModelSummary.png) The other two statistics aren't really useful in the context of a single regression. The "Adjusted R Square" has an additional penalty for the number of variables. It is useful for comparing two regressions, with different number of predictors, but it isn't needed now. The "Standard Error of the Estimate" is the standard deviation of the residuals (the difference between the predictions and the actual values). This value is stastically important, because it is used to calculate many of the standard errors. However, SPSS will do those calculations, so there is no need to write this down. ### 5.2 Coefficients Table The Coefficients Table is probably the most important part of the output. In particular, the columns marked "B" gives the coefficients in the regression equation. The intercept, $b_0$ is marked "(Constant)", and the slope $b_1$ will be in the row corresponding to the name of the $X$ variable. The equation will be $Y = b_1 X + b_0$. Two style points about writing equations: (1) they should be set in italics (technically math italics), and (2) if you use varibles like $X$ and $Y$, make sure that the reader knows what $X$ and $Y$ are. ![SPSS Sample Coefficients Table.](CoefficientsTable.png) The next column is the standard error. This provides information about how much the estimated parameter might change if the data were different. The general rule (from the normal distribution) is that 95% of the time, the estimate will be within plus or minus two standard errors. In particular, this can be used to see if the model with zero slope (i.e., the variables are unrelated). Simply divide the slope by its standard error, and compare the value to 2. Actually, SPSS has already done the division for us, that is the value in the "t" column. Looking ahead to Part 4, this can be used to test if the slope is non-zero. The "Sig." column gives the chances of seeing a $t$-value as large the one observed if the slope really was 0; that is, the $p$-value. Thus, the way to summarize this is "The slope _was_ (or _was not_) signficiantly different from zero, $b_1 =$[slope]{.underline}, $t($[df]{.underline}$) =$ [t]{.underline}, $p = $,[Sig]{.underline} (or $p<.001$ if the Sig value is .000). The underlined quantities come from the table. For example, for the sample table, $b_1 = .21, t(49) = 10.6, p <.001$. The degrees of freedom (d.f.) value comes from the ANOVA table (the Residual row). Note that usually a model with a zero constant is not particularly meaningful, so this test is seldom reported. Finally, the "Beta" column is for comparing slopes in a multiple regression. Remember the slope includes information about the standard deviation or $X$, the standard deviation of $Y$ as well as their relationship. Thus, it is difficult to compare slopes when the $X$ values have different scales (units). The "Beta" is a standardized regression coefficient. While interesting in a multiple regression, in a single regression, it is always just the correlation coefficient. ### 5.3 ANOVA Table The ANOVA table is another way to test the regression. (This is not needed for Part 3 of the Lab, but Part 4 will pick it up; so this section can be skipped for now.) While the slope test (in Section\ 5.2) tests one predictor at a time, the ANOVA test all of the predictors together. ![SPSS ANOVA Table.](ANOVATable.png) The idea is closely related to $R^2$. Let $Y_i$ be the $i$th value for the dependent variable, and let $\hat Y_i$ be the value predicted for the $i$th value by the regression equation. Finally, let $\bar Y$ be the mean of the $Y$'s. The sum of squares regression, $\sum_i (\hat Y_i - \bar Y)^2$ is the amount of variability "explained" or predicted by the model. The sum of squares residual, $\sum_i (Y_i - \hat Y_i)$ is the amount of variability that is unexplained. (The total is just the sum of the other two.) The degrees of freedom how many data points are used for each. Estimating the grand mean (or equivalently the constant) uses up 1 data point, so the Total d.f. is alway $N-1$. The Regression d.f. is the number of slopes that were estimated (so always 1 for a simple regression). The residual d.f. is the difference between the two. (Don't worry too much about this, as SPSS does all the needed calculations). The Mean Squares are the Sum of Squares divided by their degrees of freedom. The Residual Mean Square is the variance of the residuals (the square of the standard error of the estimate). If the predictors didn't have any predictive power, then the regression mean square would be zero, instead it would be more or less the same as the residual mean square. To test this, take their ratio: this is the $F$ value in the table. If nothing is happening (i.e., the null hypothesis) holds, the $F$-value should be around 1.0. The "Sig." column gives the probablity of seeing an $F$-value that large if all of the slopes really were zero. If that is small, then the model where the $X$ variables have no predictive power is unlikely. The $F$-test requires two degrees of freedom, one for the numerator (Regression) and one for the denominator (Residual). The APA style for writing an $F$-test result is $F("[df1]{.underline}, [df2].{underline}) = [F]{.underline}, $R^2 = $[R Square].{underline}, $p = $[p]{.underline} (or $p <.001$). The $R^2$ (from the model summary table) gives an indication of the size of the effect). For the sample table, the result would be $F(1,49) = 112.6, R^2 = .70, p < .001.$ The full ANOVA table is seldom placed in papers, usually the results are reported in the text. Note that for a single regression, the slope test and the ANOVA test are identical. In fact, $F=t^2$, and the $p$-value ("Sig.") will always be identical. So only one needs to be reported. ## 6. Diagnostics (Optional) [This is leftover from a previous version of the lab. This has been dropped from the current SPSS syllabus, mainly for time reasons. Model checking is still important, but the only part of the model that needs checking is linearity, and that is done with the scatterplot. Feel free to read this section or skip.] One of the first assumptions of least squares regression is that the residuals are approximately normally distributed. This can be tested with a histogram of the residuals. To do this in SPSS you need to save the residuals (either the raw residuals or the standardized residuals for this test) when doing the regression. A problem with the plots generated by SPSS is that if you use Case labels all of the points are labeled. This makes the plot busy and difficulty to read. Turn off the labels as described above then select the cross-hair data labeling mode button. You can now pick out points that look unusual to label. Also, don't forget to add Labels for your variables. This will give you human readable labels on the *x-* and *y-axes*. If you did forget, you can always double click on the axis label in the graph editor to produce a better label. The second assumption is that all of the residuals have approximately the same variance. We can test this with a fitted value versus residual plot. For this we want the standardized residuals and the predicted values. We can either do this from the saved values or we can request it through the regression dialog. The residual versus predicted plot contains a lot of information. First, if any of the residuals is particularly large (or small) we suspect an outlier. Secondly, if we can detect a curved pattern, then that is an indication that the linear regression is not explaining all that is happening. There may be some higher order polynomial effect. Third, we can identify heteroscedasticity (to check the homogeneity of residual variances). This usually results in a triangle shape pattern for the residuals: residuals on the left are larger in magnitude than the ones on the right (or the other way around). If you go on to take EDF 5401 you will learn more about heteroscedasticity and what to do about it. If you detect outliers, you may wish to re-run the analysis without the outliers. If the conclusions are substantially different, you should report both conclusions. ## 7. Predicting Future Observations (Optional) [This is leftover from a previous version of the lab. This has been dropped from the current SPSS syllabus, mainly for time reasons. Feel free to read it or skip.] The last part of the lab ~~requires~~ [no longer required] assessing how well the model predicts the general anxiety scores from the hyperactive impulse scores. To do this, you will need to save predicted values in the regression dialog. There are several different varieties of predicted values, but the best one for our purposes is the "Adjusted" predictions. These refit the model without each value in turn and then use that model to predict the data point that was left out. For example, the prediction for *617* is made using all of the data points except *617* in fitting the regression line. One of the fundamental rules of statistics is that we should always be honest about how much we know and how much we don't know. Thus, along with our prediction, we should say something about the accuracy of our prediction. Statisticians usually do this by producing an interval estimate. They pick a probability (usually 95%) and produce an interval that should contain the actual value with that probability.[^4] SPSS will calculate both the predicted value and a prediction interval; however, it offers a choice of two different prediction intervals. This is because there are two sources of prediction errors. Suppose we were interested in the mean anxiety score for students who scored exactly 12 on the hyperactivity impulse scale. Our prediction would be the point on the regression line corresponding to *X* = 12. However, there is some sampling error in the slope and the intercept, so we have a confidence interval around where the point should be. This is the "mean" type prediction interval produced by SPSS. If we are interested in a particular student with that score, then we also need to consider the fact that most data points don't lie exactly on the line. The residual variance gives us the amount of additional error we need to add to our intervals. The "individual" style intervals in SPSS add this extra variance. These are the ones that we want. If you set up the "Save..." dialog in SPSS on our regression as shown in Figure 1, you should get the following four new variables in your data view: - PRE\_1 -- this is the exact value (on the transformed scale if you transformed the outcome) predicted by the line. - RES\_1 -- this is the residual (difference from the predicted value) for each school. - LCI\_1 -- lower prediction (confidence) interval for each state. This is the lower bound on our uncertainty about the prediction. - UCI\_1 -- upper prediction (confidence) interval for each state. This is the upper bound on our uncertainty about the prediction. Each time you run the regression using the "Save..." option, you will get a new set of residuals and predicted values. SPSS will increment the number so "\_1" is from the first regression model, "\_2" is from the second and so on. You probably will want to name the saved variables IMMEDIATELY after you run each model, or soon you will forget what all of the saved items are!! Once you have done that, getting the prediction for a particular student is simple. Just scroll down in the data until you get to that student\'s row and look across for the PRE\_k (or ADJ\_k) column (point prediction) and LCI\_k and UCI\_k columns (lower and upper bounds for confidence interval). 6. The Assignment ================= The assignment is to analyze the data Alec-5400Subset.csv to find if there is a linear relationship between the ADHD symptom scores and anxiety or panic. Choose one of the two anxiety variables (`GADD` or `genaxa`) or one of the two panic variables (`PAG` or `paa`) as your outcome. (All four are potentially interesting, but more work than is needed for the class.) You will need to do 3 regressions: 1. Hyperactivity (`hyper`) vs your chosen $Y$ 2. Inattentiveness (`inatt`) vs your chosen $Y$ 3. ADHD Symptoms (`hyper + inatt`) vs your chosen $Y$. For each regression you need to: a) Verify that the relationship is mostly linear (i.e., no curve) b) Check for Year differences using colors & plotting symbols c) Calculate the correlation coefficient d) Write the equation of the prediction line. For the write-up, extend the exsting write-ups from Part 1 and 2. You may need to tweak some sections, and you will add to the results and conclusions. - *Introduction* -- Tweak this to emphasize the importance of your chosen $Y$ variable. - *Background (Minimal for this lab)* -- This doesn't need much change. - *Problem statement/Hypothesis* -- Here you want to state what the goal of the regression in, so adjust this to include your chosen $Y$ variable. - *Data description/Measures* -- Make sure all of the measures you are using are included. _Note Bene: There are substantially fewer data points for `hyper` and `inatt` than for the possible $Y$ measures. As the effective sample for the regression is only people with both measures, who is the population for this part of the study. - *Results* -- Add the results of the regression. Make sure that figures are numbered and referenced in the text. - *Conclusions* -- Recap the most important results and relate them back to the real world. What was the answer to your research question? Are there any limitations of the way the data were collected or the analysis that would affect the ability to generalize beyond your sample? In particular, to which population does it apply (all students or ADHD students only)? As before you may place figures or tables either interspersed in the text or at the end of the document. **Remember each figure and table should have a number, a caption (a clear description of what is in there) and should be referenced somewhere in the text.** If you don't have anything to say about it, why include it? *Failure to follow these guidelines will result in lost style points*. 7. FAQs and Hints ================= 1) *Use variable labels.* If you add text labels to your variables as you create them (you can do this in the transformation dialog) the plots and table will come out with more human readable labels. 2) *Do I need both histograms and boxplots?* The best way to answer this question is to think of your lab report as telling a story. Do the histograms and boxplots tell different stories? If yes, include them both (and explain in the text the interesting observations in both). If no, pick the one that tells the story the best and include only that one. 3) *Is XXX an outlier I should worry about?* Not every point that shows up on the extreme ends of the scatterplot is an outlier. The boxplot has a built-in test for outliers, so that is a good tool for double checking whether something you noted in the scatterplot is an outlier or not. If you suspect outliers, another test you can make is to rerun the regression excluding the potential outliers. To do this, use the command [D]{.underline}ata \> [S]{.underline}elect Cases... [ALT+D S] and select the "If..." option and write an expression that will exclude the outliers, e.g., "hyper \< 35". Then run the regression or correlation command again. The slope and correlation should change a little bit, but not a lot. If they do change a lot, then the outlier is worth mentioning. If they don't you could give it a passing mention (e.g., "XXX thought to be an outlier, but rerunning the regression with XXX excluded produced only a small change in the correlation and slope."), but not more. However, if the results change markedly, report both numbers (unless you have a substantial reason for thinking the outlier doesn\'t belong in the population). It is fairly common for students to go outlier crazy at this point in time, don't fall into that trap. *Here are some web sites that cover SPSS and regression that you may find helpful:* [[http://www.ats.ucla.edu/stat/spss/seminars/SPSSGraphics/spssgraph.htm]{.underline}](http://www.ats.ucla.edu/stat/spss/seminars/SPSSGraphics/spssgraph.htm) [[http://core.ecu.edu/psyc/wuenschk/spss/corrregr-spss.doc]{.underline}](http://core.ecu.edu/psyc/wuenschk/spss/corrregr-spss.doc) Reference ========= Coladarci, T. & Cobb, C. D., Minium, E. W., & Clarke, R. C. (2014). *Fundamentals of Statistical Reasoning in Education* (4^th^ Ed.) Hoboken, NJ: John Wiley & Sons. Prevatt, F., Dehili, V., Taylor, N. & Marshall, D. (2015). Anxiety in College Students with ADHD: Relationship to Cognitive Functioning. *Journal of Attention Disorders*, **19**, 222-230. doi:10.1177/1087054712457037 [^1]: These are called loess curves in SPSS. [^2]: If you have short names, like the state postal codes, and you add them in the labels field when building the plot, you will get labels instead of case numbers. In this data set we have nothing more useful than the case numbers. [^3]: See the handout on residuals. [^4]: This is covered briefly starting on page 148 of Coladarci and Cobb (2014). However, the formula given in the book is incomplete, it only includes one source of uncertainty: the uncertainty due to the fact that the data points are not exactly on the regression line. This uncertainty is measured with the standard error of the estimate. There is also an additional source of uncertainty, as we have estimated the slope and intercept with a sample. The formulae given in the lectures take this into account, as do the calculations in SPSS. Basically, SPSS does the right think so we don\'t need to worry too much about the simplification in your book.