--- title: "Exploratory Data Analysis with GGplot" author: "Russell Almond" date: "August 27, 2020" output: pdf_document: default html_document: default word_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Tidyverse Software For this example, we are going to use GGplot, which is part of the tidyverse. Tidyverse is an extra layer on top of R which makes it easy to manipulate data as a kind of a workflow. Note that tidyverse is actually a meta-package: it downloads a number of generally useful packages, including GGplot (GG stands for _Grammar of Graphics_, a book about how to build up complex plots from smaller pieces.) The command `install.packages()` installs packages, that is, it downloads them from the CRAN library to your local computer. The command `library()` tells R that you want to use that package in this session. You need to run `library()` every time, but you only need to run `install.packages()` once. ```{r library, echo=TRUE} if (!("tidyverse" %in% row.names(installed.packages()))) { install.packages("tidyverse",repos="https://cloud.r-project.org",dependencies=TRUE) } library(tidyverse) ``` # Dplyr tools Tools for manipulating data. ## Tibbles For this exercise we will use the data set `state.x77` which comes with R. You can find more information about this data set by doing: ```{r} help(state.x77) ``` A `tibble` is a data structure with rows corresponding to cases and columns to variables. It is a _tidy_ version of a data frame. ```{r} as_tibble(state.x77) %>% add_column(region=state.region,name=state.name,code=state.abb,center_x=state.center$x,center_y=state.center$y) -> state77 View(state77) state77 ``` The `View()` command opens the data frame/matrix/tibble in another window. * Try `state77` in the console. The tibble is slightly different from the data frame in the way it prints. * Tibble and data frames are pretty much interchangeable. (Where they aren't use `as.data.frame()` or `as_tibble()` to convert. ![Dangerous Bend](dangerousBend.png) Note the type of the variables are shown in the display of the tibble. The name and postal code are left as strings, but region is a factor (with four levels). In a data frame, the string variables are automatically converted to factors, which is not always what you want. * Use `read_csv()` instead of `read.csv()` to load a CSV file as a tibble instead of a data frame. ## The Pipe The special operator `%>%` can be used to chain operations together. The expression above gives an example. The output of `as_tibble()` is passed to the `add_column()` which is then passed to the assignment operator `->`. Note the backward arrow `->`. This is like the usual assignment operator `<-` except now the name of the variable is on the right instead of the left. A typical chain looks like: _data_ `%>%` `select`(_variables_) `%>%` `filter`(_cases_) `%>%` _analysis_() `->` _result_ Or maybe the _analysis_ is replaced with a call to _ggplot_ to make a plot. ## Selecting Variables The `select()` command can be used to select a subset of variables. ```{r} state77 %>% select(code,Population,Income) state77 %>% select(code,region:code) state77 %>% select(-name) state77 %>% select(code,starts_with("center")) ``` Usually having more columns than you need is harmless. For example, using `lm()` to fit a regression of `ggplot()` to make a plot will just use the variables referenced in the model or plot description. However, sometimes is it easier to work with a smaller subset of the data with just the stuff you need. ## Making New Variables We already saw the `add_column()` function for adding columns. The `mutate()` function adds new columns as a function of the old ones: ```{r} state77 %>% mutate(Pop_Density=Population/Area) -> state77a state77a ``` ## Recoding Variables Recoding is important because sometimes the way the variable is stored in the data file is not the same as the way we want to analyze it. * Factor variables can represent categories with integer values or string labels. + Often there is a _code book_ which maps integer category labels to string values. For example: 1. Female 2. Male The `factor()` function creates factor variables. ```{r} factor(c(1,1,1,2,2,2),levels=1:2,labels=c("Female","Male")) factor(c("Male","Male","Male","Female","Female","Female"),levels=c("Male","Female")) ordered(c("H","H","M","M","L","L"), levels=c("L","M","H")) ``` * The `levels` argument tells R how the data are coded (in the case of integer coding). * The `labels` argument gives the names for the levels (if omitted it is the same as `levels`). ![Dangerous Bend](dangerousBend.png) The `ordered()` function produces an ordered variable as opposed to `factor()` which produces a nominal one. This only makes a difference in a few places. Probably the most important one is how they are used in an Analysis of Variance (ANOVA). That is covered in EDF 5402. _Note Bene!_ The `read_csv()` function which is part of the tidyverse will read factor variables as either character or integer variables, depending on how they are coded. So you will need to use `mutate(x=factor(x))` to convert `x` into a factor. The function `parse_factor()` is almost the same, but gives a warning if some of the levels aren't recognized. ```{r} factor(c("Male","Female","Non-binary"),levels=c("Male","Female")) parse_factor(c("Male","Female","Non-binary"),levels=c("Male","Female")) ``` Another way to do the coding is to use * `recode()` (makes a character or numeric value) * `recode_factor()` (makes a factor variable) The first argument is the vector to be recorded, the remaining arguments are the values to be replaced. ```{r} recode_factor(c(1,1,1,2,2,2),`1`="Male",`2`="Female") recode_factor(c(1,1,1,2,2,2),"Male","Female") recode_factor(c("M","M","F","F"),M="Male",F="Female") recode_factor(c("White","Black","Latinx","Other"),White="White",.default="Non-White") ``` Note how we used the last version to collapse several categories into one. This is often useful, particularly when the number of subjects in one category is small. ## Recoding NAs A special case of recoding comes about with missing values. In R, these are called `NA` (for Not Applicable). * `NA`s are contagious: `NA` + anything is still `NA`. ```{r} NA+5 mean(c(1,2,NA)) mean(c(1,2,NA),na.rm=TRUE) ``` * `NaN` (not a number) is similar but it comes from nonsense arthimatic (taking log of negative number). * `NA`s can be coded in many different ways in a data set: + Leave the value blank. + Special character, e.g., `.` or `*` + Special String, e.g., `NA` + Nonsense numeric value, e.g., `-9` When using nonsense numeric values, it is important to pick a value that is not plausible, e.g., a large negative value. That way, if you accidently forget to convert, you can know that something is wrong. The function `na_if()` can be used to replace a value with NAs. ```{r} na_if(c(1:5,-9),-9) starwars %>% select(name,eye_color) %>% mutate(eye_color=na_if(eye_color,"unknown")) ``` The function `replace_na()` goes in the opposite direction. For example, we might want to treat missing values as score of 0 on a test. ```{r} replace_na(c(1,1,0,0,NA),0) ``` ## Logical Tests The function `if_else()` is also useful for splitting data sets up into groups. We can see the form in: ```{r} args(if_else) ``` Note that condition is a logical expression which should yeild a true or false value for every row of the tibble. The variable `true` is the value to use if true, `false` the value to use if false, and `missing` the value to use if missing. ```{r} int5 <- -5:5 if_else(int5<0,"-","+") if_else(int5<0,-int5,int5) #Absolute value na_if(int5,0) if_else(na_if(int5,0)<0 ,"-","+","0") ``` Here are the common logical tests: * `==` -- equals (don't confuse this with `=` assignment.) * `!=` -- not equals * `<`, `<=`, `=>`, `>` -- less than, &c. * `!` -- Not (true if the rest of the expression is false) * `is.na()` -- True if the value is NA, false otherwise. (Also, `!is.na()`) * `&` -- logical and (true when LHS and RHS are true) * `|` -- logical or (true if either LHS or RHS is true) * `%in%` -- True if value is in list. ```{r} drupes <- c("Almond","Cashew","Walnut") c("Peanut","Almond","Hazelnut","Macademia","Cashew") %in% drupes ``` ## Selecting Cases Very often instead of setting the value to NA, we just want to exclude that row from the data set. The command `filter()` does this. ```{r} state77 %>% filter(!(code %in% c("AK","HI"))) ``` Sometimes we want to temporarily remove the biggest values or the smallest values so we can see the details in a plot. ```{r} state77 %>% select(name,Area) %>% filter(Area <200000) ``` Sometimes we want to create subsets of the data that just have fewer cases. The functions `sample_frac()` and `sample_n()` specify the size of the sample in fraction of the original data or absolute size. The function `slice()` will select a contiguous range of cases, which is useful when looping through the data. ## Calculating Summary Statistics Pipe the output of the select and filter command into `summarize()`: ```{r} state77 %>% summarize(N=n(),Income=mean(Income),Population=mean(Population)) ``` Here are some useful functions to use with `summarize()`: * `n()`, `n_distinct()`, `sum(!is.na())` -- Count, count of unique values, count of non-missing values. * `mean()`, `median()` -- Measures of center * `min()`, `max()`, `quantile()` -- Position other than the center. ```{r} state77 %>% select(Population) %>% summarize(Min=min(Population),Q1=quantile(Population,.25),Q2=median(Population),Q3=quantile(Population,.75),Max=max(Population)) ``` * `sd()`, `IQR()`, `mad()` -- measures of scale. * `sum()`, `prod()` -- Arithmetic * `sum()`, `any()`, `all()` -- Summarize logical expressions (count number true, true if all are true, true if any is true). All of these functions have an optional argument `na.rm`. If there are NAs, you usually want to include `na.rm=TRUE`, as otherwise the value will be NA. ## Summarizing Multiple columns. Often, you want to do the same summary on several columns. The function `summarize_all()` does that. ```{r} state77 %>% select(Area,Population) %>% summarize_all(mean,na.rm=TRUE) ``` You can use multiple statsitics by putting them in a list. ```{r} state77 %>% select(Area,Population) %>% summarize_all(list(mean=mean,sd=sd)) ``` The function `summarize_at()` combines the `select()` and `sumarize()`. The function `summarize_if()` allows the selection of columns based on logical criteria. ## Calculating Statistics by Group Very often we want to be to compare groups. We can use the function `group_by()` to split the data set by a factor variable. ```{r} state77 %>% group_by(region) %>% select(Area,Population) %>% summarize_all(list(mean=mean,sd=sd)) ``` ```{r} state77 %>% group_by(region) %>% select(Area,Population) %>% summarise_all(list(Min=min,Q1=function(x){quantile(x,.25)},Q2=median,Q3=function(x){quantile(x,.75)},Max=max)) ``` ![Dangerous Bend](dangerousBend.png) The `function(){}` makes an anonymous function. This gets around the problem that `quantile()` needs two arguments, but `summarize_all()` expects a function of just one. ## The cheat sheet. You can find a handy list of dplyr and other tidyverse commands for manipulating data by selected "Help > Cheat Sheets > Data Mainpulation with dplyr" from the RStudio menu. # Graphics ## Making Histograms ```{r} ggplot(state77,aes(Population)) + geom_histogram() ``` ```{r} ggplot(state77,aes(Population)) + geom_histogram(binwidth=500) ``` ```{r} ggplot(state77,aes(Population)) + geom_histogram(bins=10) ``` ```{r} ggplot(state77,aes(Population)) + geom_dotplot() ``` ```{r} ggplot(state77,aes(Population)) +geom_dotplot(binwidth=1000) +geom_density(aes(y=..scaled..)) ``` ```{r} ggplot(state77,aes(Population)) +geom_histogram(binwidth=1000) +geom_density(aes(y=1000*..count..)) ``` ```{r} ggplot(state77,aes(Population)) +geom_histogram(binwidth=1000) +stat_function(fun= function(x) dnorm(x,mean=mean(state77$Population), sd=sd(state77$Population))*nrow(state77)*1000) ``` ```{r} bw <- 1000 ggplot(state77,aes(Population)) + geom_histogram(aes(y=..density..),binwidth=bw) + stat_function(fun=dnorm, args=c(mean=mean(state77$Population), sd=sd(state77$Population))) + scale_y_continuous("Density",sec.axis=sec_axis(trans = ~ . * bw * nrow(state77), name = "Counts")) ``` ## Panel Histograms by a Group ```{r} ggplot(state77,aes(Population)) + facet_grid(rows=vars(region)) + geom_dotplot() ``` ```{r} ggplot(state77,aes(Population)) + facet_grid(rows=vars(region)) + geom_dotplot(binwidth=750)+geom_density(aes(y=750*..count..)) ``` ## Making Boxplots ```{r} ggplot(state77,aes(x=region,y=Population)) + geom_boxplot() ``` ```{r} ggplot(state77,aes(x=region,y=Population)) + geom_violin() ``` ```{r} ggplot(state77,aes(region,Population)) + geom_dotplot(binaxis="y",stackdir="center") ``` # Saving Your Work ## Saving Your Plots ```{r} ggsave("foo.png") ``` ![Just saved file.](foo.png) ## Saving Your Tables ```{r} library(xtable) print(xtable(state77 %>% group_by(region)%>% select(Population,Area) %>% summarize_all(list(mean=mean,sd=sd))),digits=3,type="html",file="foo.html") ``` [result](foo.html) ## Working in R Markdown