---
title: "Arrays, Matrixes and Data Frames"
output: html_notebook
author: Russell Almond
---
# R Objects containing data
* Vector -- ordered collection of objects of the same storage `mode` (`[` extract)
+ Named Vector -- adds a `names` attribute (Can use names in subscripts)
+ Matrix, Array -- adds a `dim` and `dimnames` attribute
* List -- ordered collection of objects of any type or mode (`[[` extract)
+ Named List -- add `names` attribute (Can use `$` to extract elements)
+ S3 Class -- adds a `class` attribute
+ data.frame -- a list of columns in a spreadsheet. Uses (`[` extract).
* S4 Class -- formal class mechanism. Uses `@` instead of `$`.
### Storage modes.
The mode funciton in R refers to storage modes, not the mode of a distribution.
```{r}
mode(123)
mode(123L)
mode(TRUE)
mode("True")
mode(3.14)
mode(t)
?mode
```
The `as.XXX` and `is.XXX` functions can be used to convert between different types.
```{r}
is.integer(3)
is.integer(3L)
as.integer(3)
is.integer(as.integer(3))
as.integer("three")
as.character(3)
as.logical(3)
```
## Implicit Looping in Vectors
All R objects are vectors: scalars in R are vectors of length 1.
```{r}
cat("The output will start with '[1]' to show that this is a vector.\n")
3.14159
```
R implicitly loops over all the elements of a vector. Such implicit loops are faster than explicit `for` loops.
```{r}
1:11
mean(1:11)
y <- (1:11 - mean(1:11))/sd(1:11)
mean (y)
sd(y)
```
### Making vectors
The `:` operator produces sequences (of integers) between first and second argument.
```{r}
1:3
3:1
-1:1
-3:-1
```
The `c()` function can be used to glue vectors together. (`c` stands for combine)
```{r}
c(1:3, 10:12)
c("Haenzel", "Greatel", "Tedd","Alice")
```
### Random vectors
R has a number of built in random number generators to generate random numbers. The most commonly used are `runif`, `rnorm` and `sample`. There are also many others, with names that look like `rXXX` (try substituting chisq, t, beta, gamma, &c for XXX).
```{r}
runif(5)
rnorm(10)
sample.int(5,5,replace=TRUE)
```
### Exercises
1. Generate 100 random numbers with mean 50 and standard deviation 25.
```{r}
```
1a. Use the result of the previous question to generate a random sample of size 101 with one outlier of 200.
```{r}
```
2. Generate random integers between 0 and 100
```{r}
```
3. The variable `state.area` contains the areas of the 50 US states (in alphabetical area). Create a random sample of size 10 of the state areas.
```{r}
```
## Three ways of subscripting an array
The `[]` operator is used to subscript vectors. There are three different things you can put inside of the brackets: numbers, logical expressions and names (character values).
### Numeric Indexes
Numbers are the most straightforward way to do indexing. R starts the
indexes at 1 and it goes up to the length of the vector. The function
``length()` is useful in writing indexes. Giving multiple indexes
with return a sub-vector (remember, there are not scalars in R, just
vectors of length 1).
```{r}
int10 <- 1:10
int10[3]
int10[c(5:7,9)]
state.area[c(1,length(state.area))]
```
Another useful trick is to use negative indexes. These leave the numbered varaibles out.
```{r}
int10[-2]
int10[-(3:8)]
```
Indexing expressions can also be used on the LHS of assignment operators, to allow to assignment to just certain values.
```{r}
int10[3] <- -3
int10
```
### Logical Indexes
The second option for indexing is to use a logical vector the same length as the vector you are indexing.
```{r}
int10<0
int10[int10<0]
int10[int10<0] <- abs (int10[int10<0])
int10
```
Be careful with NAs.
```{r}
int55 <- -5:5
sqrt(int55) < 1.2
int55[sqrt(int55) < 1.2]
```
The real power of logical indexes comes when we have two vectors of the same length. For example, `state.abb` gives the two letter postal codes of the states. Suppose we wanted to see all of the states that are bigger than average:
```{r}
state.abb[state.area>median(state.area)]
```
### Asside: `ifelse` and `if`
The built in langauge primitive `if` is **not** vectorized. It is expecting a single value. The code below will not do what you think it will.
```{r}
if (int55 < 0) {
cat("Negative.\n")
} else {
cat("Non-negative.\n")
}
```
The functions `any()`, `all()` and `isTRUE()` are often useful here.
```{r}
if (all(int55 >0)) {
cat("Positive.\n")
} else {
cat("Not all positive.\n")
}
```
The function `ifelse()` can be used to loop over if-else expressions. There are two differences. First the condition is a logical vector. Second, both the if-true and if-false argument are always evaluated, so they better not generate an error!
```{r}
ifelse(int55<0, "-","+")
```
### Names and character indexes
It would be really convenient if we could access the state data by name. Florida is the `r which(state.abb=="FL")` state alphabetically, but I can't remember that.
What we can do is add names to a vector. Then we can select by name.
```{r}
names(state.area) <- state.abb
head(state.area)
head(names(state.area))
state.area["FL"]
state.area[c("NY","CA")]
```
Sometimes we need to make up names. The `paste()` command is handy for that. It is vectorized, so you can put a bunch of numbers in.
```{r}
paste("Student",1:5,sep="_")
```
### Exercises
4. Write an expression that removes the outlier from the data you generated for 1b.
```{r}
```
5. Suppose the data you generated for problem 1 was suppose to have a minimum score of 0 and a maximum score of 100. Fix, the data set so that all of the values are between 0 and 100.
```{r}
```
6. Fix my positive/negative test, so that it has a 0 as well
```{r}
```
7. Find all of the states that are bigger than Florida.
```{r}
```
8. Generate a bunch of random integers between -10 and 10. Then turn all negative integers into NA.
```{r}
```
# Maxtries and Arrays
## Matrixes and Arrays are vectors with a `dim` attribute
* A matrix is an object with rows and columns.
* An array can have any number of dimensions.
* But they all the entries need to be the same type (mode).
* There is a `dim()` attribute which shows the dimensions of the matrix.
```{r}
dim(state.x77)
head(state.x77)
```
### getting and setting dims
The `dim()` function is used to access the number of rows and columns. `dim()[1]` gets the number of rows and `dim()[2]` the number of columns.
For maxtrixes, the functions `nrow()` and `ncol()` are easier to use.
Setting `dim()` will reset a vector into a matrix or array.
```{r}
nrow(state.x77)
ncol(state.x77)
int12 <- 1:12
dim(int12) <- c(3,4)
int12
```
### `matrix()` and `array()` functions
* Setting the `dim()` attribute directly is not recommended (makes for hard to read code).
* Instead use `matrix()` or `array()`
* R stores matrixes in row major order (like FORTRAN, not like c).
+ Use `byrow=TRUE` to reverse this in `matrix` or `array`
```{r}
matrix(1:12,3,4)
matrix(1:12,3,4,byrow=TRUE)
array(1:24,c(2,3,4))
```
## numeric and logical indexes
For matrixes and arrays, the `[]` operator does something a little bit different. In particular, `x[i,j]` picks out row $i$ and column $j$. Either the row or column selector could be
* A number or vector of numbers (pick those rows or columns)
* A negative number of vector of negative numbers (excluded those rows or columns)
* A logical vector of size `nrow(x)` or `ncol(x)` (select the rows/columns corresponding to true).
* A character vector (select rows or columns by name, see below).
* Left blank, in which case all rows/columns are selected.
If a single row or column is selected, then it turns into a vector.
```{r}
state.x77[1:5,1:5]
state.x77[1:5,]
state.x77[9,]
dim(state.x77[9,])
head(state.x77[,3])
state.x77[9,,drop=FALSE]
dim(state.x77[9,,drop=FALSE])
```
## `dimnames` and character indexes
To use character indexes with matrixes, we need to set the `rownames()` and `colnames()` of the matrix.
We can also use the `dimnames()` (although this will produce a list).
```{r}
rownames(state.x77)
colnames(state.x77)
dimnames(state.x77)
rownames(state.x77) <- state.abb
head(state.x77)
```
## Row and column sums and averages
Remember that a matrix is just a vector with a `dim` attribute. Consequently, `mean` and other summary functions don't do what we want:
```{r}
mean(state.x77)
sd(state.x77)
var(state.x77)
cor(state.x77)
```
Taking row and column sums are such a frequent operation, that there is a shortcut for them: `rowSums()`, `colSums()`, `rowMeans()`, `colMeans()`
```{r}
colMeans(state.x77)
```
## Apply and Sweep
The `apply()` operator can turn any summary function into a row or column function.
```{r}
help(apply)
```
The MARGIN argument to apply should be 1 for rows, 2 for columns and so forth for generaly arrays.
```{r}
int12
apply(int12,1,max)
apply(int12,2,max)
```
The `sweep` operator "subtracts" a vector from all of the rows or columns of the matrix.
"Subtracts" is in quotes because actually any operator can be used here. Subtracts "-" and divides "/" are the most common.
```{r}
help(sweep)
row.min <- apply(int12,1,min)
sweep(int12,1,row.min,"/")
col.min <- apply(int12,2,min)
sweep(int12,2,col.min,"-")
```
## Exercises:
9. Find the population for all states whose area is bigger than Florida's.
```{r}
```
10. Calculate the population density (population per area) for each state
```{r}
```
11. Turn the state.x77 data into z-scores by subtracting the column means and dividing by the column standard deviations.
```{r}
```
12. Scale the state.x77 data from 0 (minimum in the column) to 1 (maximum in the column).
```{r}
```
# Lists
## Single `[` and double `[[` extraction
## Named Lists and `$` extraction
## `lapply` and `sapply` for looping through lists
## Classes as list with special behavior
### Generic functions and methods
### 'factor' and 'ordered' classes
#### S4 classes vs S3 classes
# Data frame
A data frame is a list that behaves like a matrix.
Different columns can have different classes.
Data frame coerces character values to factors.
## data.frame() and read.table()
## Matrix-like behaior -- Using `[` subscripts
### apply, rownames, colnames and colsum and rowsum
## List-like behavior -- Using `[[` and `$` subscripts
## names, lapply and sapply
## as.matrix and as.data.frame