--- title: "Working With R Data" author: "Russell Almond" date: "9/4/2020" output: pdf_document: default html_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(tidyverse) ``` # Objectives At the end of this lesson you should be able to * Make vectors in R * Access parts of the vector using the `[]` operator. + Numeric Indexes + Negative Indexes + Logical Indexes + Character Indexes * Check types of object using `is` and `mode` functions. * Convert types of object using `as` functions. * Access names elements of lists using `$`. * Access elements, row and columns of matrixes using `[,]` * Convert between data frames and matrixes * Read and write data frames using `read.csv` and `write.csv`. This lesson covers the traditional R way of doing things. The next lesson will show tidyverse alternatives. # R Objects containing data ## Basic R Container objects * Vector -- ordered collection of objects of the same storage `mode` (`[` extract) + Named Vector -- adds a `names` attribute (Can use names in subscripts) + Matrix, Array -- adds a `dim` and `dimnames` attribute * List -- ordered collection of objects of any type or mode (`[[` extract) + Named List -- add `names` attribute (Can use `$` to extract elements) + S3 Class -- adds a `class` attribute + data.frame -- a list of columns in a spreadsheet. Uses (`[` or `$` to extract). + tibble -- The tidyverse extension of a data frame. * S4 Class -- formal class mechanism. Uses `@` instead of `$`. ## Storage modes. The mode function in R refers to storage modes, not the mode of a distribution. ```{r} mode(123) mode(123L) mode(TRUE) mode("True") mode(3.14) mode(t) ?mode ``` * The `is.XXX` functions can be used to check the type (mode or class) of an object. * The `as.XXX` functions can be used to convert between different types. ```{r} is.integer(3) is.integer(3L) as.integer(3) is.integer(as.integer(3)) as.integer("three") as.character(3) as.logical(3) ``` The most commonly seen modes are: * Numeric + Real or double (the default) + Integer (Putting an `L` after a number tells R that this should be an integer.) + Logical (`TRUE`/`T` or `FALSE`/`F`) * Character -- Each element of a character vector is a string. * Any -- A vector of anything is a list; thus, almost all R objects are in fact vectors. ## Factors * The `factor` and `ordered` classes also behave a lot like storage modes. * Atucally, they are R classes where the data values are integers and there is a special property which gives the names of the levels. * The built-in data value `state.region` is a factor. *( The function `head()` lists the first 6 data points instead of all of them.) ```{r factors} head(state.region) levels(state.region) head(as.integer(state.region)) head(as.character(state.region)) unclass(state.region) ``` * The values of a factor variable are just labels, + Numeric labels - `as.integer()` + String labels - `as.character()` * The function `as.factor()` will force a numeric or character vector into a factor. + R will just pick and arbitrary order (usually alphabetical) for labels. + Alphabetical ordering doesn't always work with `as.ordered().` - `High`,`Low`,`Medium` + Use the function `ordered()` with more control over the levels. ```{r ordered} help(ordered) ofact <- ordered(c("H","M","H","L","M","H"),levels=c("L","M","H")) ofact ``` # Vectors All R objects are vectors: scalars in R are vectors of length 1. ```{r} cat("The output will start with '[1]' to show that this is a vector.\n") 3.14159 ``` ## Making vectors The `:` operator produces sequences (of integers) between first and second argument. (The function `seq()` allows step sizes of other than one.) ```{r} 1:3 3:1 -1:1 -3:-1 ``` The `c()` function can be used to glue vectors together. (`c` stands for combine) ```{r} c(1:3, 10:12) c("Hansel", "Gretel", "Tedd","Alice") ``` ### Implicit Looping R implicitly loops over all the elements of a vector. Such implicit loops are faster than explicit `for` loops. ```{r} 1:11 (1:11)/2 mean(1:11) z <- ((1:11) - mean(1:11))/sd(1:11) z mean (z) sd(z) ``` ### Random vectors * R has a number of built in random number generators to generate random numbers. * The most commonly used are `runif`, `rnorm` and `sample`. + Sample has a `replace` option to do sampling with or without replacement. * There are also many others, with names that look like `rXXX` (try substituting chisq, t, beta, gamma, &c for XXX). ```{r} runif(5) rnorm(10) sample.int(5,5,replace=TRUE) ``` ### Exercises 1. Generate 100 random numbers with mean 50 and standard deviation 25. ```{r} ``` 1a. Use the result of the previous question to generate a random sample of size 101 with one outlier of 200. ```{r} ``` 2. Generate random integers between 0 and 100 ```{r} ``` 3. The variable `state.area` contains the areas of the 50 US states (in alphabetical area). Create a random sample of size 10 of the state areas. ```{r} ``` ## Three ways of subscripting a vector * The `[]` operator is used to subscript vectors. * There are three different things you can put inside of the brackets: + numbers, - negative numbers (exclude values) + logical expressions + names (character values). ### Numeric Indexes * Numbers are the most straightforward way to do indexing. * R starts the indexes at 1 and it goes up to the length of the vector. + The function `length()` is useful in writing indexes. * Giving multiple indexes with return a sub-vector (remember, there are no scalars in R, just vectors of length 1). ```{r} int10 <- 1:10 int10[3] int10[c(5:7,9)] state.area[c(1,length(state.area))] ``` Another useful trick is to use negative indexes. These leave the numbered variables out. ```{r} int10[-2] int10[-(3:8)] ``` Indexing expressions can also be used on the LHS of assignment operators, to allow to assignment to just certain values. ```{r} int10[3] <- -3 int10 ``` ### Logical Indexes The second option for indexing is to use a logical vector the same length as the vector you are indexing. ```{r} int10<0 int10[int10<0] int10[int10<0] <- abs (int10[int10<0]) int10 ``` Be careful with NAs. ```{r} int55 <- -5:5 sqrt(int55) < 1.2 int55[sqrt(int55) < 1.2] ``` The real power of logical indexes comes when we have two vectors of the same length. For example, `state.abb` gives the two letter postal codes of the states. Suppose we wanted to see all of the states that are bigger than average: ```{r} state.abb[state.area>median(state.area)] ``` ### Aside: `ifelse` and `if` The built in language primitive `if` is **not** vectorized. It is expecting a single value. The code below will not do what you think it will. ```{r} if (int55 < 0) { cat("Negative.\n") } else { cat("Non-negative.\n") } ``` The functions `any()`, `all()` and `isTRUE()` are often useful here. ```{r} if (all(int55 >0)) { cat("Positive.\n") } else { cat("Not all positive.\n") } ``` The function `ifelse()` can be used to loop over if-else expressions. * There are two differences from `if`. + First the condition is a logical vector. + Second, both the if-true and if-false argument are always evaluated, so they better not generate an error! ```{r} ifelse(int55<0, "-","+") ``` ```{r NAs} intA <- 1:10 intA[3] <- -3 A<- sqrt(intA) A mean(A) mean(A,na.rm=TRUE) ``` ### Names and character indexes It would be really convenient if we could access the state data by name. Florida is the `r which(state.abb=="FL")` state alphabetically, but I can't remember that. What we can do is add names to a vector. Then we can select by name. ```{r} names(state.area) <- state.abb head(state.area) head(names(state.area)) state.area["FL"] state.area[c("NY","CA")] ``` Sometimes we need to make up names. The `paste()` command is handy for that. It is vectorized, so you can put a bunch of numbers in. ```{r} paste("Student",1:5,sep="_") ``` ### Exercises 4. Write an expression that removes the outlier from the data you generated for 1b. ```{r} ``` 5. Suppose the data you generated for problem 1 was suppose to have a minimum score of 0 and a maximum score of 100. Fix, the data set so that all of the values are between 0 and 100. ```{r} ``` 6. Fix my positive/negative test, so that it has a 0 as well ```{r} ``` 7. Find all of the states that are bigger than Florida. ```{r} ``` 8. Generate a bunch of random integers between -10 and 10. Then turn all negative integers into NA. ```{r} ``` # Matrixes, Lists and Data Frames ## Matrixes and Arrays * A matrix is an object with rows and columns. * An array can have any number of dimensions. * But they all the entries need to be the same type (mode). * There is a `dim()` attribute which shows the dimensions of the matrix. ```{r} dim(state.x77) head(state.x77) ``` ### Getting and setting dims * The `dim()` function is used to access the number of rows and columns. + `dim()[1]` gets the number of rows + `dim()[2]` gets the number of columns. + For matrixes, the functions `nrow()` and `ncol()` are easier to remember. Setting `dim()` will reshape a vector into a matrix or array. ```{r} nrow(state.x77) ncol(state.x77) int12 <- 1:12 dim(int12) <- c(3,4) int12 ``` ### `matrix()` and `array()` functions * Setting the `dim()` attribute directly is not recommended (makes for hard to read code). * Instead use `matrix()` or `array()` * R stores matrixes in row major order (like FORTRAN, not like c). + Use `byrow=TRUE` to reverse this in `matrix` or `array` ```{r} matrix(1:12,3,4) matrix(1:12,3,4,byrow=TRUE) array(1:24,c(2,3,4)) ``` ### Numeric and logical indexes For matrixes and arrays, the `[]` operator does something a little bit different. In particular, `x[i,j]` picks out row $i$ and column $j$. Either the row or column selector could be * A number or vector of numbers (pick those rows or columns) * A negative number of vector of negative numbers (excluded those rows or columns) * A logical vector of size `nrow(x)` or `ncol(x)` (select the rows/columns corresponding to true). * A character vector (select rows or columns by name, see below). * Left blank, in which case all rows/columns are selected. If a single row or column is selected, then it turns into a vector. ```{r} state.x77[1:5,1:5] state.x77[1:5,] state.x77[9,] dim(state.x77[9,]) head(state.x77[,3]) state.x77[9,,drop=FALSE] dim(state.x77[9,,drop=FALSE]) ``` ### `dimnames` and character indexes To use character indexes with matrixes, we need to set the `rownames()` and `colnames()` of the matrix. We can also use the `dimnames()` (although this will produce a list). ```{r} rownames(state.x77) colnames(state.x77) dimnames(state.x77) rownames(state.x77) <- state.abb head(state.x77) ``` ### Row and column sums and averages Remember that a matrix is just a vector with a `dim` attribute. Consequently, `mean` and other summary functions don't do what we want: ```{r} mean(state.x77) sd(state.x77) var(state.x77) cor(state.x77) ``` Taking row and column sums are such a frequent operation, that there is a shortcut for them: `rowSums()`, `colSums()`, `rowMeans()`, `colMeans()` ```{r} colMeans(state.x77) ``` The `summary` function in the `tidyverse` package is another way to do this. ### Exercises: 9. Find the population for all states whose area is bigger than Florida's. ```{r} ``` 10. Calculate the population density (population per area) for each state ```{r} ``` 11. Turn the state.x77 data into z-scores by subtracting the column means and dividing by the column standard deviations. ```{r} ``` 12. Scale the state.x77 data from 0 (minimum in the column) to 1 (maximum in the column). ```{r} ``` ## Lists * A list in R is a special vector whose elements can be anything, even other lists. * It is possible to build up quite complex objects from lists (Old S3 class system.) Use the `list() constructor to make lists ```{r list} list(1,2:3,"four",quote(2+3)) ``` Notice that the second element is a vector and the last element is an R expression (this is what `quote` does). R lists are quite flexible. ![Dangerous Bend](dangerousBend.png) Notice that the list is show with a double square bracket `[[` instead of a single one `[`. This is because with lists the extraction operators behave a little bit differently. The single bracket refers to a sublist, and the double bracket to the element. Fortunately, this doesn't come up a lot at the beginning because, most people use the `$` extractors instead. ## Named Lists and `$` extraction Named lists have a special role in R. They are similar to environments in that they allow the analyst to associate names and values. If `x` is a list then `x[[name]]` or `x$name` will retrieve (or set if used with `<-`) that element. ```{r listExtractor} alist <- list(one=1, two=2:3, three="three", four=quote(2+2)) alist alist$two alist$two <- 2 alist ``` ## Lists and Classes This ability to associate names and values is very hand. The older S3 (informal) class system just uses lists with appropriate values as classes. To get components, just use the `$` operator. For example, the function `lm()` does a regression and returns an object of class `lm`. The `$` operator can be used to access its components. ```{r classExtractor} fit1 <- lm(dist~speed,data=cars) fit1$coefficients ``` ## Data frame * A data frame is a list that behaves like a matrix. + A data frame is a list of columns with a class of `data.frame`. * Different columns can have different classes or storage modes. + Matrixes and arrays all must be the same kind of value. * Using the single square bracket `[i,j]` can reference row i and column j, like a matrix. * Using the `$` operator can reference columns. ```{r dataframes} ?mtcars names(mtcars) # Get the variable names rownames(mtcars) # Get the car names mtcars[1:5,] # First five rows mtcars["Honda Civic", ] # Just one car mtcars[,"mpg"] # Just MPG variable mtcars$disp # Just DISP variable ``` ## data.frame(), as.matrix and as.data.frame The function `data.frame()` will put a data frame together column by column. (If one of the arguments is a matrix each column in the matrix will become a column in the data frame.) ```{r data.frame} stateX77 <- data.frame(state.x77,region=state.region,row.names=state.abb) stateX77 stateX77$Income ``` The functions `as.data.frame()` and `as.matrix()` can be used to go back and forth between the two different representations. * All matrixes can be converted to data frames, but data frames can only be converted to matrixes if all of the variables are the same type. * There are certain mathematical operators (like taking the inverse) which only work on matrixes. For most of what I do in R, the data frame is the most convenient representation for data. The `tidyverse` package uses the `tibble` instead of the `data.frame`. A `tibble` is a new class for data frames which has slightly more intelligence printing and more consistent subseting behavior. ## read.table and read.csv Most common format for storing data is tab separated value (`.dat`) and comma separated value (`.csv`). * Cases are rows * Variables are separated by tab or comma * Often a header row giving variable names * Sometimes there are row names. * Sometimes quotes are used for strings The functions `read.table()` and `read.csv()` read these data files and produce data frames. * Really the same function with different options. * Many options, look at the help!! ```{r helpReadTable} help(read.table) ``` These functions automatically convert strings to factors. The `as.is` optional argument suppresses that. Often factors, dates and missing values need to be cleaned up after reading in the data. (More about this in the next lesson). **Windows Only**. Usually both `.dat` and `.csv` files are mapped to open in Excel when you double click on them. If the file is open in Excel, then Windows will lock the file and not let another program read it. You may need to close the file in Excel before you can read it into R. The functions `write.table()` and `write.csv()` go in the opposite directions. The `tidyverse` alternative is `read_csv()`. It might be somewhat easier to use, but it produces tibbles instead of data frames. More about this in the next session. ## Foreign interfaces R can read data from an other packages, but you need to load the `foreign` package first. * `library(foreign)` (Part of the base R distribution) + `read.spss` (SPSS) + `read.dta` (Stata) + `read.ssd` (SAS) + `read.systat` (Systat) Excel workbooks are another common format. The easiest way to work with Excel data is to save it in `.csv` format from Excel. You could also try the `xlsx` package (need to install it first). * `library(xlsx)` (Need to install from CRAN) + `read.xlsx` (Excel) The book [_R for Data Science_](https://r4ds.had.co.nz/index.html) (Grolemund and Wickham, 2017) recommends the `haven` and `readxl` packages. Also, the `DBI` package allows importing data directly from databases (an advanced R trick). ### Exercises Use the function `write.csv()` to write out the `stateX77` data we made. Read it into Excel (or another spreadsheet) make some changes. Now read the modified version back into R. ```{r} ``` # Next Lesson You are now read for [Exploratory Data Analysis with Tidyverse and GGplot](EDAwithGGPlot.Rmd).