---
title: "Exploratory Data Analysis with GGplot"
author: "Russell Almond"
date: "August 27, 2020"
output:
  pdf_document: default
  html_document: default
  word_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

```

## Tidyverse Software

For this example, we are going to use GGplot, which is part of the tidyverse.  Tidyverse is an extra layer on top of R which makes it easy to manipulate data as a kind of a workflow.  Note that tidyverse is actually a meta-package:  it downloads a number of generally useful packages, including GGplot (GG stands for _Grammar of Graphics_, a book about how to build up complex plots from smaller pieces.)

The command `install.packages()` installs packages, that is, it downloads them from the CRAN library to your local computer.  The command `library()` tells R that you want to use that package in this session.  You need to run `library()` every time, but you only need to run `install.packages()` once.  

```{r library, echo=TRUE}
if (!("tidyverse" %in% row.names(installed.packages()))) {
  install.packages("tidyverse",repos="https://cloud.r-project.org",dependencies=TRUE)
}
library(tidyverse)

```


# Dplyr tools

Tools for manipulating data.

## Tibbles

For this exercise we will use the data set `state.x77` which comes with R.  You can find more information about this data set by doing:
```{r}
help(state.x77)
```

A `tibble` is a data structure with rows corresponding to cases and columns to variables.  It is a _tidy_ version of a data frame.


```{r}
as_tibble(state.x77) %>% add_column(region=state.region,name=state.name,code=state.abb,center_x=state.center$x,center_y=state.center$y) -> state77
View(state77)
state77
```

The `View()` command opens the data frame/matrix/tibble in another window.  

* Try `state77` in the console.  The tibble is slightly different from the data frame in the way it prints.

* Tibble and data frames are pretty much interchangeable.  (Where they aren't use `as.data.frame()` or `as_tibble()` to convert.

![Dangerous Bend](dangerousBend.png)
Note the type of the variables are shown in the display of the tibble.  The name and postal code are left as strings, but region is a factor (with four levels).  In a data frame, the string variables are automatically converted to factors, which is not always what you want.

* Use `read_csv()` instead of `read.csv()` to load a CSV file as a tibble instead of a data frame.


## The Pipe

The special operator `%>%` can be used to chain operations together.

The expression above gives an example.  The output of `as_tibble()` is passed to the `add_column()` which is then passed to the assignment operator `->`.

Note the backward arrow `->`.  This is like the usual assignment operator `<-` except now the name of the variable is on the right instead of the left.


A typical chain looks like:

_data_ `%>%` `select`(_variables_) `%>%` `filter`(_cases_) `%>%` _analysis_() `->` _result_

Or maybe the _analysis_ is replaced with a call to _ggplot_ to make a plot.


## Selecting Variables

The `select()` command can be used to select a subset of variables.

```{r}
state77 %>% select(code,Population,Income)
state77 %>% select(code,region:code)
state77 %>% select(-name)
state77 %>% select(code,starts_with("center"))
```

Usually having more columns than you need is harmless.  

For example, using `lm()` to fit a regression of `ggplot()` to make a plot will just use the variables referenced in the model or plot description.

However, sometimes is it easier to work with a smaller subset of the data with just the stuff you need.

## Making New Variables

We already saw the `add_column()` function for adding columns.

The `mutate()` function adds new columns as a function of the old ones:

```{r}
state77 %>% mutate(Pop_Density=Population/Area) -> state77a
state77a
```


## Recoding Variables

Recoding is important because sometimes the way the variable is stored in the data file is not the same as the way we want to analyze it. 

* Factor variables can represent categories with integer values or string labels.

  + Often there is a _code book_ which maps integer category labels to string values.  For example:
  
1. Female
2. Male

The `factor()` function creates factor variables.  

```{r}
factor(c(1,1,1,2,2,2),levels=1:2,labels=c("Female","Male"))
factor(c("Male","Male","Male","Female","Female","Female"),levels=c("Male","Female"))
ordered(c("H","H","M","M","L","L"), levels=c("L","M","H"))

```
* The `levels` argument tells R how the data are coded (in the case of integer coding).
* The `labels` argument gives the names for the levels (if omitted it is the same as `levels`).

![Dangerous Bend](dangerousBend.png)
The `ordered()` function produces an ordered variable as opposed to `factor()` which produces a nominal one.  This only makes a difference in a few places.  Probably the most important one is how they are used in an Analysis of Variance (ANOVA).  That is covered in EDF 5402.

_Note Bene!_  The `read_csv()` function which is part of the tidyverse will read factor variables as either character or integer variables, depending on how they are coded.  So you will need to use `mutate(x=factor(x))` to convert `x` into a factor.

The function `parse_factor()` is almost the same, but gives a warning if some of the levels aren't recognized.

```{r}

factor(c("Male","Female","Non-binary"),levels=c("Male","Female"))
parse_factor(c("Male","Female","Non-binary"),levels=c("Male","Female"))
```

Another way to do the coding is to use 
* `recode()` (makes a character or numeric value)
* `recode_factor()` (makes a factor variable)

The first argument is the vector to be recorded, the remaining arguments are the values to be replaced.  

```{r}
recode_factor(c(1,1,1,2,2,2),`1`="Male",`2`="Female")
recode_factor(c(1,1,1,2,2,2),"Male","Female")
recode_factor(c("M","M","F","F"),M="Male",F="Female")
recode_factor(c("White","Black","Latinx","Other"),White="White",.default="Non-White")


```

Note how we used the last version to collapse several categories into one.  This is often useful, particularly when the number of subjects in one category is small.


## Recoding NAs

A special case of recoding comes about with missing values.  

In R, these are called `NA` (for Not Applicable).  

* `NA`s are contagious:  `NA` + anything is still `NA`.
```{r}
NA+5
mean(c(1,2,NA))
mean(c(1,2,NA),na.rm=TRUE)
```


* `NaN` (not a number) is similar but it comes from nonsense arthimatic (taking log of negative number).

* `NA`s can be coded in many different ways in a data set:  
  + Leave the value blank.
  + Special character, e.g., `.` or `*`
  + Special String, e.g., `NA`
  + Nonsense numeric value, e.g., `-9`

When using nonsense numeric values, it is important to pick a value that is not plausible, e.g., a large negative value.  That way, if you accidently forget to convert, you can know that something is wrong.

The function `na_if()` can be used to replace a value with NAs.

```{r}
na_if(c(1:5,-9),-9)
starwars %>% select(name,eye_color) %>%
  mutate(eye_color=na_if(eye_color,"unknown"))
```
The function `replace_na()` goes in the opposite direction.

For example, we might want to treat missing values as score of 0 on a test.

```{r}
replace_na(c(1,1,0,0,NA),0)
```


## Logical Tests

The function `if_else()` is also useful for splitting data sets up into groups.  

We can see the form in:
```{r}
args(if_else)
```

Note that condition is a logical expression which should yeild a true or false value for every row of the tibble.  The variable `true` is the value to use if true, `false` the value to use if false, and `missing` the value to use if missing.

```{r}
int5 <- -5:5
if_else(int5<0,"-","+")
if_else(int5<0,-int5,int5) #Absolute value
na_if(int5,0)
if_else(na_if(int5,0)<0 ,"-","+","0")

```

Here are the common logical tests:

* `==`  -- equals (don't confuse this with `=` assignment.)
* `!=` -- not equals
* `<`, `<=`, `=>`, `>` -- less than, &c.
* `!` -- Not (true if the rest of the expression is false)
* `is.na()` -- True if the value is NA, false otherwise. (Also, `!is.na()`)

* `&` -- logical and (true when LHS and RHS are true)
* `|` -- logical or (true if either LHS or RHS is true)
* `%in%` -- True if value is in list.


```{r}
drupes <- c("Almond","Cashew","Walnut")
c("Peanut","Almond","Hazelnut","Macademia","Cashew") %in% drupes
```

## Selecting Cases

Very often instead of setting the value to NA, we just want to exclude that row from the data set.

The command `filter()` does this.

```{r}
state77 %>% filter(!(code %in% c("AK","HI")))
```

Sometimes we want to temporarily remove the biggest values or the smallest values so we can see the details in a plot.

```{r}
state77 %>% select(name,Area) %>% filter(Area <200000)
```

Sometimes we want to create subsets of the data that just have fewer cases.

The functions `sample_frac()` and `sample_n()` specify the size of the sample in fraction of the original data or absolute size.

The function `slice()` will select a contiguous range of cases, which is useful when looping through the data.


## Calculating Summary Statistics

Pipe the output of the select and filter command into `summarize()`:

```{r}
state77 %>% summarize(N=n(),Income=mean(Income),Population=mean(Population))
```
Here are some useful functions to use with `summarize()`:

* `n()`, `n_distinct()`, `sum(!is.na())` -- Count, count of unique values, count of non-missing values.
* `mean()`, `median()` -- Measures of center
* `min()`, `max()`, `quantile()` -- Position other than the center.
```{r}
state77 %>% select(Population) %>% summarize(Min=min(Population),Q1=quantile(Population,.25),Q2=median(Population),Q3=quantile(Population,.75),Max=max(Population))
```

* `sd()`, `IQR()`, `mad()` -- measures of scale.
* `sum()`, `prod()` -- Arithmetic
* `sum()`, `any()`, `all()` -- Summarize logical expressions (count number true, true if all are true, true if any is true).

All of these functions have an optional argument `na.rm`.  If there are NAs, you usually want to include `na.rm=TRUE`, as otherwise the value will be NA.

## Summarizing Multiple columns.

Often, you want to do the same summary on several columns.  

The function `summarize_all()` does that.

```{r}
state77 %>% select(Area,Population) %>% summarize_all(mean,na.rm=TRUE)
```
You can use multiple statsitics by putting them in a list.
```{r}
state77 %>% select(Area,Population) %>% summarize_all(list(mean=mean,sd=sd))
```

The function `summarize_at()` combines the `select()` and `sumarize()`.

The function `summarize_if()` allows the selection of columns based on logical criteria.

## Calculating Statistics by Group

Very often we want to be to compare groups.  We can use the function `group_by()` to split the data set by a factor variable.

```{r}
state77 %>% group_by(region) %>% select(Area,Population) %>% summarize_all(list(mean=mean,sd=sd))

```

```{r}
state77 %>% group_by(region) %>%
  select(Area,Population) %>%
  summarise_all(list(Min=min,Q1=function(x){quantile(x,.25)},Q2=median,Q3=function(x){quantile(x,.75)},Max=max))
```
![Dangerous Bend](dangerousBend.png)
The `function(){}` makes an anonymous function.  This gets around the problem that `quantile()` needs two arguments, but `summarize_all()` expects a function of just one.

## The cheat sheet.

You can find a handy list of dplyr and other tidyverse commands for manipulating data by selected "Help > Cheat Sheets > Data Mainpulation with dplyr" from the RStudio menu.

# Graphics

## Making Histograms


```{r}
ggplot(state77,aes(Population)) + geom_histogram()
```
```{r}
ggplot(state77,aes(Population)) + geom_histogram(binwidth=500)
```
```{r}
ggplot(state77,aes(Population)) + geom_histogram(bins=10)
```


```{r}
ggplot(state77,aes(Population)) + geom_dotplot()
```
```{r}
ggplot(state77,aes(Population)) +geom_dotplot(binwidth=1000) +geom_density(aes(y=..scaled..))
```
```{r}
ggplot(state77,aes(Population)) +geom_histogram(binwidth=1000) +geom_density(aes(y=1000*..count..))
```

```{r}
ggplot(state77,aes(Population)) +geom_histogram(binwidth=1000) +stat_function(fun= function(x) dnorm(x,mean=mean(state77$Population), sd=sd(state77$Population))*nrow(state77)*1000)
```
```{r}
bw <- 1000
ggplot(state77,aes(Population)) + geom_histogram(aes(y=..density..),binwidth=bw) + 
stat_function(fun=dnorm, args=c(mean=mean(state77$Population), sd=sd(state77$Population))) +
scale_y_continuous("Density",sec.axis=sec_axis(trans = ~ . * bw * nrow(state77), name = "Counts"))
                
```


## Panel Histograms by a Group

```{r}
ggplot(state77,aes(Population)) + facet_grid(rows=vars(region)) + geom_dotplot()
```

```{r}
ggplot(state77,aes(Population)) + facet_grid(rows=vars(region)) + geom_dotplot(binwidth=750)+geom_density(aes(y=750*..count..))
```

## Making Boxplots

```{r}
ggplot(state77,aes(x=region,y=Population)) + geom_boxplot()
```
```{r}
ggplot(state77,aes(x=region,y=Population)) + geom_violin()
```
```{r}
ggplot(state77,aes(region,Population)) + geom_dotplot(binaxis="y",stackdir="center")
```


# Saving Your Work

## Saving Your Plots

```{r}
ggsave("foo.png")
```

![Just saved file.](foo.png)


## Saving Your Tables

```{r}
library(xtable)
print(xtable(state77 %>% group_by(region)%>% select(Population,Area) %>% summarize_all(list(mean=mean,sd=sd))),digits=3,type="html",file="foo.html")

```
[result](foo.html)

## Working in R Markdown