When you finish this lesson, you will be able to 1) Start and Stop R and R Studio 2) Download, install and run the tidyverse
package. 3) Get help on R functions.
R is a programming language for statistics. Generally, the way that you will work with R code is you will write scripts---small programs---that do the analysis you want to do. You will also need a development environment which will allow you to edit and run the scripts. I recommend RStudio, which is pretty easy to learn.
In general, you will need three things for an analysis job:
If you use a package manager on your computer, R is likely available there. The most common package managers are homebrew
on Mac OS, apt-get
on Debian Linux, yum
on Red hat Linux, or chocolatey
on Windows. You may need to search for 'cran' to find the name of the right package. For Debian Linux, it is called r-base
.
R Studio development environment. R Studio https://rstudio.com/products/rstudio/download/. The free version is fine for what we are doing.
There are other choices for development environments. I use Emacs and ESS, Emacs Speaks Statistics, but that is mostly because I've been using Emacs for 20 years.
A number of R packages for specific analyses. These can be downloaded from the Comprehensive R Archive Network, or CRAN. Go to https://cloud.r-project.org and click on the 'Packages' tab. We will cover package installation later.
You may want to bookmark the R-project.org web site, as it has lots of useful information, including links to documentation and of course the CRAN library of packages.
Go ahead and download and install R and R Studio using the instructions on those pages.
When you open R Studio, the screen is split into four regions. (You can adjust the size of these regions if you like.)
Although Region 1 is where we will do most of our work, I'm going to start with Region 2. This the R console. R is an interactive programming language. It prompts you that it is waiting for a command with a >
. You can type a command at that prompt and hit return. R will then print the result of the expression. You can try this. Try typing 2+2
and then hit return
. R should respond [1] 4
.
. Why the [1]
. This is because R always works with vectors. This indicates that the answer is a vector and the first element is 4.
R is a separate program from R Studio. The console window communicates between the two programs. In fact, if you open a terminal window (or command window on Windows) in your operating system, and type R, you will get a similar command prompt and can interact directly with R without R Studio.
Region 1 contains an editor for R scripts. When I'm doing data analysis, I want to keep a record of all of the steps I took in doing the analysis. That is the script. An R script file is just a series of R commands, one R command per line. These are put in a text file (can be edited by many different programs) with an extension of .R
(note the capital; important for case-sensitive file systems, like Linux).
To generate a new script file in R Studio, go to the file menu and select "New File ... > R Script". This will open a new window in Region 1. I generally save it right away, so that I can give it a name that reflects my purpose.
Generally how I work in R is I build up a script for my analysis. In R Studio, I can put my cursor on the line I want run and press the Run
button at the top of the script window. This will copy the line to the console and run it. If it didn't work quite right (which often happens) I edit the line and try again. This way I don't keep the mistakes around in my script, just the stuff that worked.
Sometimes I type things directly in the console. These are usually things I just want to try at the moment to see how they work, or maybe to get more information. For example, I might type names(cars)
to get information about the variables in the data set cars
or maybe help(var)
to remind myself of how the command var
works.
I said that there was one command per line, but there are a couple of exceptions. First, you can put two commands per line if you separate them with a semicolon (;
). Second, if R doesn't think the command is complete on one line, it will look for the rest on the next line. I seldom use the semicolon to put two lines together, but I often need to split long lines when writing complex code.
The key to successfully splitting a line is letting R know that there is more to come. Consider the following example.
1 +
2
## [1] 3
Putting the plus sign at the end of the line tells R that there is more to come. So R interprets this as one expression 1+2
. If I put the plus sign on the second line instead, R would interpret this as two expressions: 1
and +2
.
If R thinks there is still more to come, it will prompt with a +
instead of a >
. Try this. Type (1+2
and then return at the R command prompt. R will prompt you with +
because the expression is not complete. Type )
to finish the expression. This is a fairly common mistake to make; if R is unexpectedly prompting you with its continuation prompt, it usually means you forgot a closing quotation mark or quote.
You can add comments to you R code by using the pound sign (or hash tag), #
. When it sees the pound sign, R ignores everything up to the end of the line (unless the pound sign is in a string.)
I use the following convention, which comes from Lisp programming. I use a single pound sign for a comment which comes after the code. This is usually one tab away from the end of the line. I use two pound signs for comments that are in the code. The are aligned with the start of the code line. I use three pound signs for big comments that describe a whole block of code. These are aligned flush left. Finally, I use a whole line of pound signs to separate different parts of a long script file.
R Studio introduces a new kind of script file that I find much more useful than the plain R script. An R Markdown (.Rmd
) document can be created by selecting `New File ... > R Notebook " from the "File" menu in R Studio.
An R Markdown document has three parts. The first part, separated by ---
and ---
is the YAML header (YAML=Yet Another Markup Language). This contains meta-data about the document, like title, author and date. It also contains instructions to Markdown about how you want to compile the document.
The rest of the document alternates between text chunks in the markdown language and code chunks in R. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For the most part, it looks like plain text, but there are some characters that have special meanings. For example, a line that starts with two pound signs starts a new section. If you select "Help > Markdown Quick Reference" you will get a summary of all of the commands. One of the things I like about Markdown is that if you don't know the markdown syntax, it pretty much looks like plain text, so just about everybody can read it.
You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
The code chunk starts and ends with three backquotes. When editing this in R, R puts a little green triangle up in the top right corner of the chunk. Pressing that will run the chunk: