---
title: "Lab 2 Exploratory Data Analysis (in R)"
author: "Russell Almond"
date: "5/22/2022"
output:
  html_document: default
  pdf_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

The source for this file is available at:
[Part2EDAinR.Rmd](https://pluto.coe.fsu.edu/svn/common/rgroup-shiny/ADHDLab/Part2EDAinR.Rmd)


# A Recap of Part 1

## Loading Packages

We showed how to install R, R Studio and packages and talked about the Tidyverse packages.

We are going to use one more package called `xtable` which will help make tables.  We will need to install this one.  We will also need the `haven` package to read SPSS data.

```{r installFlexplot, eval=FALSE}
## These only need to be run once, so I've set Eval to False.
install.packages('xtable')
install.packages('haven')
```

Now load the needed packages

```{r setupLibraries}
library(tidyverse)
library(psych)
```

## Reload the data

Part 1 talked about how to load the data into R and clean  it.  

I have a slightly larger version of the ACED data set already loaded and cleaned here:
[ACED-Subset-New.sav](https://pluto.coe.fsu.edu/svn/common/rgroup-shiny/ADHDLab/ACED-Subset-New.sav)

This is a SPSS data set, so we can use the `haven::read_spss()` function to load it.

```{r load data}
ACEDnew <- haven::read_spss("https://pluto.coe.fsu.edu/svn/common/rgroup-shiny/ADHDLab/ACED-Subset-New.sav")
head(ACEDnew)
```


# Exploratory Data Analysis

The goal of exploratory data analysis is to get a feeling for the data.  This helps the analyst (1) check that the data was properly entered and cleaned, (2) understand what to expect from the data, and (3) look at potential problems and solutions.

Although the goal is to work model free, a big question at this stage is "Are the data close enough to normal to used statistics based on the normal distribution?"  To do that the analyst want to look at the skewness and kurtosis.  In particular, either strong positive or negative skewness or high kurtosis mean the analyst should consider either transformations of the variables, or robust alternatives to the standard normal theory tests.  Usually graphical displays, particularly, histograms and boxplots, are best for judging skewness and kurtosis.

The other thing we want to mention here is the center and spread. This means either the mean and standard deviation or one of the robust alternatives.  

It is also important to look at the number of valid (non-missing) observations.  If this is subantially different from the total sample size, it can tell the analyst that only a subset of the subjects were measured on some particular variable.  In particular, it can add a selection bias to the statistic, making the results harder to generalize.

## Measures of Center

The two most used measures of center are the `mean` and the `median`.

| Efficient (Normal) | Robust | Poor   |
| ------------------ | ------ | ---------------------- |
| mean               | median | minimum, maximum, mode |
|                    | trimmed|                        |

The corresponding R functions are `mean`, `median`, `min`, `max`.  Note that the function `mode` returns the storage mode of the data, and not the statistic.  To get a trimmed mean, use `mean(x, trim=.1) to trim 10 percent of the values.

```{r trimmed Mean}
mean(c(-100,2:10))
mean(c(-100,2:10),trim=.1)
```
Note that R by default includes NAs, and NAs are contageous.  Use the `na.rm=TRUE` argument to get rid of them.

```{r RemoveNAs}
mean(c(NA,1:10))
mean(c(NA,1:10),na.rm=TRUE)
```

The `mean` is the efficient choice, that means that if the data are approximately normal, the mean will have the lowest standard error for the given sample size.  However, the mean is sensitive to outliers, which are commonly found in distributions with high kurtosis, or strong positive or negative skewness.  The trimmed mean explicitly allows tradeoff between robustness and efficiency (the higher the trim, the more outliers it will remove, but also the more good data points).

The `median` is more robust to departures from normality resistant to outliers, but if the data are approximately normally distributed, the standard error of the median is generally bigger than the standard error of the mean.

The `minimum` and `maximum`, although very useful for data cleaning, are not really useful for describing the location of the data.  If there is one or more outliers, chances are it will be the minimum, the maximum or both.

The `mode` is also not a particularly good measure of center.  Technically, if all of the values are unique, then they are all modes, so we haven't reduced the data at all.  If theree are multiple modes, SPSS just reports the lowest (with a footnote), so real analysts seldom bother with it.

## Measures of Scale

The spread is related to how closely the data points are grouped around the center.  The _standard deviation_ (`SD`) is the usual choice, and the _interquartile range_ (`IQR`) offers a robust alternative.

| Efficient (Normal) | Robust | Poor   |
| ------------------ | ------ | --------|
| sd                 | IQR    | range  |
|                    | mad    |    |
                  
The R commands are `sd`, `IQR`, and `mad`.  The function `var` also exists, but it works just a little bit differently (see Part 3).

The formal statistical definition of the range is `max - min`.  It is fairly common to think of the pair `(min, max)` as the range, but that is not what SPSS (or R) computes.  While the actual min and max have some value in checking that the data are properly cleaned, the range itself is neither robust nor efficient.

The _median abosulte deviation_ (`mad`) is another robust measure of spread.  Unfortunately, SPSS does not allow us to select it when making tables of summary statistics.

## Skewness and Kurtosis

Usually, the best way to judge skewness or kurtosis is graphically (using a histogram or a boxplot).  However, SPSS offers a skeweness and kurtosis statistic as well.

A symmetric distribution (like the normal distribution) has a skewness of zero; negative values of the skewness statistic indicate negative skewness; postive values, positive skewness.   In practice, a given set of data is  rarely perfectly symmetric; a sample from a symmetric distribution will have a non-zero skewness statistic.  This is not an issue, unless the skewness is very strong (positive or negative).  

How strong is a cause for concern?  The heuristic I use is to divide the skewness statistic by its standard error.
I use a similar heuristic for the kurtosis.  The kurtosis statistic is scaled so that the normal distribution has a kurtosis of zero.  Only if the kurtosis statistic is bigger than two standard errors do I start to worry.  Here, I only worry about high kurtosis (heavy tails) as that means lots of outliers.  (High kurtosis slows down the central limit theorem, but low kurtosis is not a problem.)

The `psych` package provides a `skew` and `kurtosi` function which calculates skewness and kurtosis statistics.  Unfortunately, it does not give the standard errors.

# Describing Data

This section will look at generating tables of statistics.

## Describe

`describe`

`summarize`

Data Transformation Cheat Sheet


## Frequencies

`table`

## Table Styles

`xtable`

`kable`


Be careful because default tables often contain variables you don't need, or way too many digits.  Generally Speaking, the units for numbers should make the number of digits in the raw data.  For questionnaires that are mostly counts, this means no decimals.  For money, are you rounding to the nearest cent? dollar?  hundred dollars?  

- Means and medians can take one more digit (so if the raw data is an integer, round means to the nearest tenth).
- SDs can take two additional digits (so to the nearest hundredth).

Many functions, in particular, `print`, but also `xtable` and `kable` take a `digits` argument that controls how many digits are printed.

Tables are generally numbered `Table`\ _n_, where the numbers are in order of appearance.  (Books use numbers of the form _chapter_`.`_table_; but papers just use sequential numbers.)  The space between `Table` and the number should be a non-breaking space.  Give the caption a human readable description of what is in the table.  The caption for a table goes above the table.

Caption the table in Word (or whatever word processor you are using) not SPSS, as you may need to renumber it later.  Also, make sure that the table doesn't split across page boundaries, or get separated from hits caption.  (There are check boxes for these things in Word).

# One-dimensional Graphical Summaires

`ggplot2`

`aes`

`geom_xxx`


## Histograms

* The button with a grid of dots is used to open a dialog which can be used to adjust the number of bins in the histogram.

* The button with the normal curve can be used to add a normal or other distributional curve.

`geom_histogram`

`geom_density`


## Boxplots

## Barcharts

## Pie Charts

* Angles are harder to judge than lengths, so pie charts are harder to read than bar charts.
  - 3D pie charts add a visual rotation to the angle judgement, makeing them even worse than 2d pie charts.

* Pie charts give a better sense of part-of-whole than do bar charts.

# Creating New Variables

SPSS allows you to perform mathematical operations on scale variables (and some logical operations on nominal and ordianl ones) to create new variables.

SPSS allows you to recode nominal and ordinal variables in several ways.

## Compute Variables


## Recoding Variables

### String to Numeric


### Collapsing Categories

# Breakdown by Category


## Cross-tabs

## Pivoting Tables


## Boxplots and Barplots

Bar charts are similar.  When you have two categorical variables, sometimes swapping which one is the main variable and which defines the clusters will make a better plot.  Try it both ways to see.


## Paneling by Rows and Columns

For histograms paneling by rows in much better than paneling by columns because it lines up the X-axis (it doesn't matter as much with other graph types).  The graph by race is paneled by rows.  Note how this makes it easy to judge the difference in center or spread.


# Looking at Subsets of Data

* (Temporarily) Remove outliers
* Temporarily Remove Group

_Sensitivity Analysis_:  Remove potentially influential observations and see how much statistics change.

# Chosing Graphics

## Make them part of the story

## Graphs at end, versus graphs in text

## Figures and Captions

## Meaningful Labels

## Digits and Unnecessary Statistics

# Your Mission

The variables in the lab can be grouped into several categories:

* _Study Condition_:  Group

* _Background Variables_: Gender, Year, Age, Ethnicity

* _SAT_:  SAT, SATVERBAL, SATQUANT, SATWRIT
  - Also, create SATTOTAL by adding Verbal and Quant
  
* _Anxiety Measures_: GADD, genaxa

* _Panic Measures_:  PAG, paa

* _ADHD Measures_: inatt, hyper
  - The complete ADHD measure is the sum of inatt and hyper.
  - Note carefully that the sample size for the ADHD measure is much smaller than for the background variables.  Why this discrepancy and who is missing?
  
1) Describe how the sample is broken down by "Year".  Also, look at the relationship between "Year" and "group".  You will want to use a graph or table to support your results.

2) Describe the distribution of Age in the sample.  You should include measures of center and scale and a description of skewness and kurtosis.  Include a figure which shows the distribution.

3) Describe the distribution of the SATTOTAL (note, you will need to compute this variable, it is not in the original data set).  Again, include center, scale, skewness, kurtosis and a figure to support the latter.

4) Describe the distribution of the anxiety and panic measures.  Again, include center, scale, skewness, kurtosis and a figure to support the latter.
  
5) Describe the distribution of the inatt, hyper, and total ADHD symptoms measures (you will need to compute this).  Again, include center, scale, skewness, kurtosis and a figure to support the latter.

6) Pick on of the anxiety or panic measures and describe how it differs across years.

You can append this information to the Part 1 draft:  this is the first subsection in the "Results" section.


# Rubric for the Lab

Here are the score points for the technical steps in Part 2:

* Create Combined variables (SATTOTAL and ADHDSymptoms) [10 pts]

* Histograms, boxplots, or barplots for all varaibles. [10 points]

* Text summaries which includes information about center, scale and pointers to the figure.  [10 pts]

* Anxiety (or Panic) by Year Comparision (text) [10 pts]

* Anxiety (or Panic) by Year Comparison (graphs and tables) [10 pts]

* Year by Group description. [10 pt]

Style accounts for the remaining 40 points.

* Text style [20 pts]  
  - Pay close attention to sequencing the analyses in a way that makes them easy for the reader to follow.  
  - Make sure that figures and tables have numbers & captions, are referenced in the text and don't split across pages or get separated from their captions.

* Graph and Table Style [20 pts]
  - Labels should be human readable, not SPSS variable names.
  - Reasonable number of digits.
  - Don't include unnecessary statistics.
  - Make sure that the figure or table is part of the story of your paper.