---
title: "Arrays, Matrixes and Data Frames"
output: html_notebook
author:  Russell Almond
---
# R Objects containing data

* Vector -- ordered collection of objects of the same storage `mode` (`[` extract)
  + Named Vector -- adds a `names` attribute (Can use names in subscripts)
  + Matrix, Array -- adds a `dim` and `dimnames` attribute
* List -- ordered collection of objects of any type or mode (`[[` extract)
  + Named List -- add `names` attribute (Can use `$` to extract elements)
  + S3 Class -- adds a `class` attribute
  + data.frame -- a list of columns in a spreadsheet.  Uses (`[` extract).
* S4 Class -- formal class mechanism.  Uses `@` instead of `$`.

### Storage modes.

The mode funciton in R refers to storage modes, not the mode of a distribution.


```{r}

mode(123)
mode(123L)
mode(TRUE)
mode("True")
mode(3.14)
mode(t)

?mode

```

The `as.XXX` and `is.XXX` functions can be used to convert between different types.

```{r}
is.integer(3)
is.integer(3L)
as.integer(3)
is.integer(as.integer(3))
as.integer("three")
as.character(3)
as.logical(3)
```


## Implicit Looping in Vectors

All R objects are vectors:  scalars in R are vectors of length 1.
```{r}
cat("The output will start with '[1]' to show that this is a vector.\n")
3.14159  
```

R implicitly loops over all the elements of a vector.  Such implicit loops are faster than explicit `for` loops.
```{r}
1:11
mean(1:11)
y <- (1:11 - mean(1:11))/sd(1:11)
mean (y)
sd(y)

```

### Making vectors

The `:` operator produces sequences (of integers) between first and second argument.

```{r}
1:3
3:1
-1:1
-3:-1
```
The `c()` function can be used to glue vectors together.  (`c` stands for combine)
```{r}
c(1:3, 10:12)
c("Haenzel", "Greatel", "Tedd","Alice")
```

### Random vectors

R has a number of built in random number generators to generate random numbers.  The most commonly used are `runif`, `rnorm` and `sample`.  There are also many others, with names that look like `rXXX` (try substituting chisq, t, beta, gamma, &c for XXX).

```{r}

runif(5)
rnorm(10)
sample.int(5,5,replace=TRUE)

```

### Exercises

1. Generate 100 random numbers with mean 50 and standard deviation 25.

```{r}

```

1a.  Use the result of the previous question to generate a random sample of size 101 with one outlier of 200.

```{r}

```

2. Generate random integers between 0 and 100

```{r}

```
3. The variable `state.area` contains the areas of the 50 US states (in alphabetical area).  Create a random sample of size 10 of the state areas.

```{r}

```


## Three ways of subscripting an array

The `[]` operator is used to subscript vectors.  There are three different things you can put inside of the brackets:  numbers, logical expressions and names (character values).

### Numeric Indexes

Numbers are the most straightforward way to do indexing.  R starts the
indexes at 1 and it goes up to the length of the vector.  The function
``length()` is useful in writing indexes.  Giving multiple indexes
with return a sub-vector (remember, there are not scalars in R, just
vectors of length 1). 

```{r}

int10 <- 1:10
int10[3]
int10[c(5:7,9)]
state.area[c(1,length(state.area))]
```
Another useful trick is to use negative indexes.  These leave the numbered varaibles out.

```{r}
int10[-2]
int10[-(3:8)]
```
Indexing expressions can also be used on the LHS of assignment operators, to allow to assignment to just certain values.
```{r}
int10[3] <- -3
int10
```

### Logical Indexes

The second option for indexing is to use a logical vector the same length as the vector you are indexing.  

```{r}
int10<0
int10[int10<0]
int10[int10<0] <- abs (int10[int10<0])
int10
```
Be careful with NAs.
```{r}
int55 <- -5:5
sqrt(int55) < 1.2
int55[sqrt(int55) < 1.2]
```
The real power of logical indexes comes when we have two vectors of the same length.  For example, `state.abb` gives the two letter postal codes of the states.  Suppose we wanted to see all of the states that are bigger than average:

```{r}
state.abb[state.area>median(state.area)]
```

### Asside:  `ifelse` and `if`

The built in langauge primitive `if` is **not** vectorized.  It is expecting a single value.  The code below will not do what you think it will.
```{r}
if (int55 < 0) {
  cat("Negative.\n")
} else { 
  cat("Non-negative.\n")
}
```
The functions `any()`, `all()` and `isTRUE()` are often useful here.
```{r}
if (all(int55 >0)) {
  cat("Positive.\n")
} else { 
  cat("Not all positive.\n")
}
```
The function `ifelse()` can be used to loop over if-else expressions.  There are two differences.  First the condition is a logical vector.  Second, both the if-true and if-false argument are always evaluated, so they better not generate an error!

```{r}
ifelse(int55<0, "-","+")
```

### Names and character indexes

It would be really convenient if we could access the state data by name.  Florida is the `r which(state.abb=="FL")` state alphabetically, but I can't remember that.

What we can do is add names to a vector.  Then we can select by name.

```{r}
names(state.area) <- state.abb
head(state.area)
head(names(state.area))
state.area["FL"]
state.area[c("NY","CA")]
```
     
Sometimes we need to make up names.  The `paste()` command is handy for that.  It is vectorized, so you can put a bunch of numbers in.
```{r}
paste("Student",1:5,sep="_")
```

### Exercises

4.  Write an expression that removes the outlier from the data you generated for 1b.

```{r}

```

5.  Suppose the data you generated for problem 1 was suppose to have a minimum score of 0 and a maximum score of 100.  Fix, the data set so that all of the values are between 0 and 100.

```{r}

```

6. Fix my positive/negative test, so that it has a 0 as well
```{r}

```
7. Find all of the states that are bigger than Florida.
```{r}

```
8. Generate a bunch of random integers between -10 and 10.  Then turn all negative integers into NA.
```{r}

```

# Maxtries and Arrays

## Matrixes and Arrays are vectors with a `dim` attribute

* A matrix is an object with rows and columns.  

* An array can have any number of dimensions.

* But they all the entries need to be the same type (mode).  

* There is a `dim()` attribute which shows the dimensions of the matrix.

```{r}
dim(state.x77)
head(state.x77)
```


### getting and setting dims

The `dim()` function is used to access the number of rows and columns. `dim()[1]` gets the number of rows and `dim()[2]` the number of columns.

For maxtrixes, the functions `nrow()` and `ncol()` are easier to use.

Setting `dim()` will reset a vector into a matrix or array.
```{r}
nrow(state.x77)
ncol(state.x77)
int12 <- 1:12
dim(int12) <- c(3,4)
int12
```
### `matrix()` and `array()` functions

* Setting the `dim()` attribute directly is not recommended (makes for hard to read code).

* Instead use `matrix()` or `array()`

* R stores matrixes in row major order (like FORTRAN, not like c).  
    + Use `byrow=TRUE` to reverse this in `matrix` or `array`
    
```{r}
matrix(1:12,3,4)
matrix(1:12,3,4,byrow=TRUE)
array(1:24,c(2,3,4))
```


## numeric and logical indexes

For matrixes and arrays, the `[]` operator does something a little bit different.  In particular, `x[i,j]` picks out row $i$ and column $j$.  Either the row or column selector could be

* A number or vector of numbers (pick those rows or columns)
* A negative number of vector of negative numbers (excluded those rows or columns)
* A logical vector of size `nrow(x)` or `ncol(x)` (select the rows/columns corresponding to true).
* A character vector (select rows or columns by name, see below).
* Left blank, in which case all rows/columns are selected.

If a single row or column is selected, then it turns into a vector.

```{r}
state.x77[1:5,1:5]
state.x77[1:5,]
state.x77[9,]
dim(state.x77[9,])
head(state.x77[,3])
state.x77[9,,drop=FALSE]
dim(state.x77[9,,drop=FALSE])
```

## `dimnames` and character indexes

To use character indexes with matrixes, we need to set the `rownames()` and `colnames()` of the matrix.  
We can also use the `dimnames()` (although this will produce a list).

```{r}
rownames(state.x77)
colnames(state.x77)
dimnames(state.x77)
rownames(state.x77) <- state.abb
head(state.x77)
```
## Row and column sums and averages

Remember that a matrix is just a vector with a `dim` attribute.  Consequently, `mean` and other summary functions don't do what we want:

```{r}
mean(state.x77)
sd(state.x77)
var(state.x77)
cor(state.x77)
```

Taking row and column sums are such a frequent operation, that there is a shortcut for them:  `rowSums()`, `colSums()`, `rowMeans()`, `colMeans()`

```{r}
colMeans(state.x77)
```

## Apply and Sweep

The `apply()` operator can turn any summary function into a row or column function.  
```{r}
help(apply)
```
The MARGIN argument to apply should be 1 for rows, 2 for columns and so forth for generaly arrays.  

```{r}
int12
apply(int12,1,max)
apply(int12,2,max)
```

The `sweep` operator "subtracts" a vector from all of the rows or columns of the matrix.

"Subtracts" is in quotes because actually any operator can be used here.  Subtracts "-" and divides "/" are the most common.
```{r} 
help(sweep)
row.min <- apply(int12,1,min)
sweep(int12,1,row.min,"/")
col.min <- apply(int12,2,min)
sweep(int12,2,col.min,"-")
```

## Exercises:

9. Find the population for all states whose area is bigger than Florida's.

```{r}

```

10. Calculate the population density (population per area) for each state

```{r}

```

11. Turn the state.x77 data into z-scores by subtracting the column means and dividing by the column standard deviations.

```{r}

```
12. Scale the state.x77 data from 0 (minimum in the column) to 1 (maximum in the column).
```{r}

```

# Lists

## Single `[` and double `[[` extraction

## Named Lists and `$` extraction

## `lapply` and `sapply` for looping through lists

## Classes as list with special behavior

### Generic functions and methods

### 'factor' and 'ordered' classes

#### S4 classes vs S3 classes

# Data frame

A data frame is a list that behaves like a matrix.

Different columns can have different classes.

Data frame coerces character values to factors.

## data.frame() and read.table()

## Matrix-like behaior -- Using `[` subscripts


### apply, rownames, colnames and colsum and rowsum

## List-like behavior -- Using `[[` and `$` subscripts

## names, lapply and sapply

## as.matrix and as.data.frame