R mini-course: week 1 notes

This is a version of the class material that you can look at alongside its output, chunk by chunk. It overlaps largely with the slides, but has additional commentary, demos, and exercises interleaved throughout. We’ll have one of these to go with each week’s class. I’d recommend reading through the whole document, even if parts of it feel redundant. The more you’re reminded of concepts, the easier it’ll be to learn/remember them!

1. Hello, R!

Welcome to the R mini-course! In keeping with tradition…

print("...an obligatory 'hello, world!'")

## [1] "...an obligatory 'hello, world!'"

Notice that like the slides, below each block of code is the console output. If you’re following along by typing this code into the console bit by bit, you can check your work at each step by seeing if your output matches the output here.

# this line is a comment, so R will always ignore it.
# this is a comment too, since it also starts with "#".

# but the next one is a line of real R code, which does some arithmetic:
5 * 3

# we can do all kinds of familiar math operations:
5 * 3 + 1

# 'member "PEMDAS"?? applies here too -- compare the last line to this one:
5 * (3 + 1)

# spaces usually don't matter, but should be used for legibility
5 * 3+1
5*(3+     1)

## [1] 15
## [1] 16
## [1] 20
## [1] 16
## [1] 20

We’ll mostly be using three “basic” kinds of values in R: numbers (like we just saw), characters, and logicals (this terminology isn’t completely precise – just meant to be evocative for now). Here’s examples of the other three. We’ll see why these are useful very soon.

"a character string!"
TRUE
FALSE

## [1] "a character string!"
## [1] TRUE
## [1] FALSE

2. Variables and Assignments

usually when we do some math, we want to save the result for future use. we can do this by assigning a computation to a variable

firstvar <- 5 * (3 + 1)

now ‘firstvar’ is an object that we can manipulate and inspect. we can see its value by printing it. sending firstvar to the interpreter is equivalent to print(firstvar)

firstvar

## [1] 20

note the word “variable” is unfortunately overloaded with meanings. here a variable is something like a container, which we can put stuff in. this analogy does not apply in general across programming languages. the way R deals with variables is importantly different from the more common “pointer” semantics for variables found in Python and most other languages. the difference between “container” semantics and “pointer” semantics is illustrated in the appendix below.

We can put basically anything into a variable, and we can call a variable pretty much whatever we want (but do avoid special characters besides “_“).

myvar <- "boosh!"
myvar

myVar <- 5.5
myVar

# including other variables or computations involving them:
my_var <- myvar
my_var

myvar0 <- myVar / (myVar * 1.5)
myvar0

## [1] "boosh!"
## [1] 5.5
## [1] "boosh!"
## [1] 0.6666667

When you introduce variables, they’ll appear in the environment tab of the top-right pane in R Studio. You can remove variables you’re no longer using with rm(). (this isn’t necessary, but it saves space in both your brain and your computer’s)

rm(myvar)
rm(my_var)
rm(myVar)
rm(myvar0)

3. Vectors

R was designed with statistical applications in mind, so naturally there’s lots of ways to represent collections or sequences of values (e.g. numbers).

in R, a “vector” is the simplest list-like data structure. you can create a vector with the c() function (for “concatenate”)

myvec <- c(1, 2, 3, 4, 5)
myvec

anothervec <- c(4.5, 4.12, 1.0, 7.99)
anothervec

## [1] 1 2 3 4 5
## [1] 4.50 4.12 1.00 7.99

Surprise: in R, pretty much everything is a vector actually! for example the value 10 is a vector of length one – but because it only has one element, it doesn’t look like or always act like it has internal structure like length >1 vectors do. We can see this in action by using the function identical():

# test whether something is a vector with `is.vector()`
is.vector(c(10, 12))
is.vector(c(10))
is.vector(10)

# test if two things are the same
identical(10, c(10))
identical("boosh", c("boosh"))

identical(10, 10.0)
identical(as.numeric(10), 10.0)

identical(as.integer(10), 10.0)
identical(as.integer(10), 10)

## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] FALSE
## [1] FALSE

exercise: why are the final two identical() statements false? why are the others true?

Vectors can hold elements of any type, but they must all be of the same type. to keep things straight in your head, it can help to include data types in variable names:

myvec_char <- c("a","b","c","d","e")
myvec_char

# if we try the following, R will coerce the numbers into characters:
myvec2 <- c("a","b","c",1,2,3)
myvec2
rm(myvec2)

# you can put vectors or values (length 1 vectors) together with `c()`
longvec <- c(0, myvec, 9, 80, anothervec, 0, NA)

## [1] "a" "b" "c" "d" "e"
## [1] "a" "b" "c" "1" "2" "3"

Suppose the only reason we created myvec and anothervec was to put them together with some other stuff, and assign (“save”) that to longvec. In this case, we can just remove myvec and anothervec, and work with longvec.

rm(myvec)
rm(anothervec)

longvec

##  [1]  0.00  1.00  2.00  3.00  4.00  5.00  9.00 80.00  4.50  4.12  1.00
## [12]  7.99  0.00    NA

Note also that the whole numbers (integers) now have decimals in them. This is because R coerced them into “floating-point” numbers, which are a computer’s decimal-based representation of the real numbers (doubles in R – you’ll learn some cool stuff if you google: “why are floats called ‘doubles’ in R?”, including what a “float” is).

# to see how many elements a vector has, get its `length()`
length(longvec)

# to see what the unique values are, use `unique()` (you'll get a vector back)
unique(longvec)

# a very common operation is to see how many unique values there are:
length(unique(longvec))

# to see a frequency table over a vector, use `table()`
table(longvec)

# note that this works for all kinds of vectors
table(c("a","b","c","b","b","b","a"))
table(c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE))

## [1] 14
##  [1]  0.00  1.00  2.00  3.00  4.00  5.00  9.00 80.00  4.50  4.12  7.99
## [12]    NA
## [1] 12
## longvec
##    0    1    2    3    4 4.12  4.5    5 7.99    9   80 
##    2    2    1    1    1    1    1    1    1    1    1 
## 
## a b c 
## 2 4 1 
## 
## FALSE  TRUE 
##     4     2

An important but not obvious thing: R has a special value called NA, which represents missing data. By default, table() won’t tell you about NA’s (annoying, ik!). So get in the habit of specifying the useNA argument of table()

table(c(1,2,3,2,2,NA,3,NA,NA,1,1))
table(c(1,2,3,2,2,NA,3,NA,NA,1,1), useNA="ifany") # or "always" or "no"

## 
## 1 2 3 
## 3 3 2 
## 
##    1    2    3 <NA> 
##    3    3    2    3

Notice that the structure of this command is:

table(VECTOR, useNA=CHARACTERSTRING)

some terminology:

table() is a function
table() has argument positions for a vector and for a string
in the second command above, we provided table() with two arguments:
- a vector (namely c(1,2,3,2,2,NA,3,NA,NA,1,1)); and
- a character string (namely "ifany")
the second argument position was named useNA
we used the argument binding syntax useNA="ifany"

Argument-binding is kind of like variable assignment, but useNA doesn’t become directly available for use after we give it a value. This might feel kinda abstract, but i promise the intuition will become clearer the further along we get. some arguments – like useNA here – can be thought of as “options” of the function they belong to.

# here's an example that might clarify the concept of argument binding:
round(3.141592653, digits=4)

## [1] 3.1416

round() is a commonly used function. It’s additionally relevant here because it illustrates an important concept called vectorization. Many, many functions in R are vectorized by default, which means that they can take an individual value (like the round() call above), or they can take a vector of many values. In the latter case, the function will apply pointwise to each element of the vector, and then return the resulting values as a vector, with the same length and order as the input:

round(longvec, digits=1)

##  [1]  0.0  1.0  2.0  3.0  4.0  5.0  9.0 80.0  4.5  4.1  1.0  8.0  0.0   NA

note: if f() is a vectorized function and v is a vector given by

v <- c(v_1, v_2, ... , v_n) ,

then the return value f(v) that we get from applying f() to v is:

c(f(v_1), f(v_2), ... , f(v_n))

Okay, back to argument positions. Especially when just starting out, it’s important to supply the names to all function arguments, just to drill them into your head. But in some cases, you can still supply an argument without specifying the argument’s name or using the = syntax, e.g.

round(longvec, 1)

##  [1]  0.0  1.0  2.0  3.0  4.0  5.0  9.0 80.0  4.5  4.1  1.0  8.0  0.0   NA

As we saw with table(), some functions have optional argument positions. This is true of round() too – if we don’t tell it how many digits to round to, it uses the default of 0. some but not all argument positions have default values. We’ll see how this works later.

round(longvec)

##  [1]  0  1  2  3  4  5  9 80  4  4  1  8  0 NA

4. Subsetting and Indexing

We will very often want to access individual elements or subsets of a vector (e.g. if we’ve sorted a vector and want to look at its first element).

There are several ways to do this. here are some examples to give you an idea (note that 1:5 is the vector c(1,2,3,4,5), and == is actual “equals”).

# a vector of several words
vec_words <- c("first","second","third","fourth","fifth")
vec_words[1]
vec_words[2:3]
vec_words[c(1,4)]
vec_words[vec_words=="first"]

## [1] "first"
## [1] "second" "third" 
## [1] "first"  "fourth"
## [1] "first"

It’s quite annoying to have to type every element of a vector. Fortunately, there are many functions designed to make this unnecessary. For example rep() is short for “replicate”; seq() is short for “sequence”; letters is a built-in constant for the vector c("a","b",...,"z")); and we just saw the range operator :.

# you can also combine 'times' and 'each' inside of rep()
(vec_num <- rep(1:5, times=2))
(vec_abc <- rep(letters[1:5], each=2))
(vec_odd <- seq(from=1, to=19, by=2))

##  [1] 1 2 3 4 5 1 2 3 4 5
##  [1] "a" "a" "b" "b" "c" "c" "d" "d" "e" "e"
##  [1]  1  3  5  7  9 11 13 15 17 19

note: putting parentheses around an assignment statement as above causes it to print the value that gets assigned to the variable. saves keystrokes.

exercise: print the vector 1 1 1 2 2 2 1 1 1 2 2 2 by applying rep() to the vector 1:2 (hint: not including spaces, the answer I have in mind takes 23 keystrokes)

note: using the rep() and seq() functions are a simple illustration of what’s called the imporant and fairly general DRY principle in programming (for: “Don’t Repeat Yourself). if you’re finding yourself typing the same sequence of characters over and over (e.g. c(1, 1, 1, ...)), then chances are you can use a function to automate the work for you. adhering to the DRY principle also reduces the chance that you’ll make a typo, or have to go back and edit multiple similar things. we’ll come across this again.

Very often we’ll want to e.g. get the average value or the sum of a vector. we’ll get way more into this in future sessions, but here’s a preview:

# get the mean with mean(), or calculate it ourselves!
(vec_num_mean <- mean(vec_num))
(vec_num_mean <- sum(vec_num) / length(vec_num))

# get the (sample) variance with var(), or calculate it ourselves! 
(vec_num_var <- var(vec_num))
(vec_num_var <- sum((vec_num - mean(vec_num))^2)/(length(vec_num) - 1))

# get the correlation between vec_num and vec_odd (pearson's *r*)
cor(vec_num, vec_odd, method="pearson")

## [1] 3
## [1] 3
## [1] 2.222222
## [1] 2.222222
## [1] 0.492366

exercise: compute pearson’s r on vec_num and vec_odd using only arithmetic.

So why should we care about vectors?! Aside from the fact that almost everything is a vector in R (and in actual data analysis!), here’s an analogy to keep in mind: vectors are like columns of an abstract spreadsheet (not like rows).

all their elements have to have the same type
they have a length and you can perform operations on them
they can contain missing values (NA)

In fact, this is a bit more than an analogy in R!

R’s implementation of a “spreadsheet” – the data frame – is quite literally a list of vectors (with certain properties). the data frame is a beautiful data structure, and is used to represent (flat) datasets e.g. the contents of an excel sheet.

fun fact: the main spreadsheet-like object python’s most popular data analysis library “pandas” borrows heavily from the design of R’s data frame, and is also called a DataFrame (though the implementation is somewhat different).

5. Data Frames!

A data frame is a list of vectors all of which have the same length. (more on lists in week 2)

first_df <- data.frame(1:5, letters[1:5], c(TRUE, TRUE, FALSE, NA, FALSE))
first_df

##   X1.5 letters.1.5. c.TRUE..TRUE..FALSE..NA..FALSE.
## 1    1            a                            TRUE
## 2    2            b                            TRUE
## 3    3            c                           FALSE
## 4    4            d                              NA
## 5    5            e                           FALSE

exercise: why are the column names of first_df weird looking? Rewrite this command so that the column names are more reasonable (hint: look at the next block of code)

Here’s a slightly more interesting data frame, with names for columns. Imagine the id column is a student identifier, group indicates whether the student in that row goes to one of two schools (say, NYU or Columbia law), and score gives the student’s score on (their first attempt at!) the New York bar exam.

cool_df <- data.frame(
  id    = paste0("id_", 1:6),           # unique identifier for each person
  group = rep(c("a","b"), each=3),      # "a" = NYU law school, "b" = Columbia
  score = runif(n=6, min=50, max=100)   # score on the NY bar exam
)
cool_df

##     id group    score
## 1 id_1     a 88.00637
## 2 id_2     a 68.93861
## 3 id_3     a 83.88812
## 4 id_4     b 95.64835
## 5 id_5     b 57.17870
## 6 id_6     b 94.36891

note the use of line-breaks and whitespace for legibility. This is a matter of coding style, but can make things way easier to read. For a complete style guide for writing R code, see Google’s official R style guide and the section on coding style of Hadley Wickham’s (amazing) book Advanced R. An important quote from the book: “You don’t have to use my style, but you really should use a consistent style.”

note: the score column was created with the function runif(). the “r” is for “random”, and the “unif” is for “uniform”, as in “uniform distribution.”

exercise: confirm what the three arguments are doing in the runif() call. then check out dunif(), punif(), and qunif() by googling around a bit. corresponding functions exist for many probability distributions, e.g. r/d/p/qnorm()

We can access rows or columns of data frames using square-bracket syntax [ , ]. The $ operator for individual columns is nice too – that gives us back a vector (since remember, the columns are vectors!). Lots of ways to slice + dice a df – below are some examples.

# getting rows 
cool_df[1:3, ]
cool_df[cool_df$group=="a", ]

# getting columns (there's also [[]], which we'll get to later)
cool_df$score
cool_df[, 1]
cool_df[, "score"]
cool_df[, c("id", "group")]

# getting rows *and* columns
cool_df[1:2, 2:3]

##     id group    score
## 1 id_1     a 88.00637
## 2 id_2     a 68.93861
## 3 id_3     a 83.88812
##     id group    score
## 1 id_1     a 88.00637
## 2 id_2     a 68.93861
## 3 id_3     a 83.88812
## [1] 88.00637 68.93861 83.88812 95.64835 57.17870 94.36891
## [1] id_1 id_2 id_3 id_4 id_5 id_6
## Levels: id_1 id_2 id_3 id_4 id_5 id_6
## [1] 88.00637 68.93861 83.88812 95.64835 57.17870 94.36891
##     id group
## 1 id_1     a
## 2 id_2     a
## 3 id_3     a
## 4 id_4     b
## 5 id_5     b
## 6 id_6     b
##   group    score
## 1     a 88.00637
## 2     a 68.93861

tip: when you’re slicing and dicing a data frame using the [ , ] syntax, remember to use the comma!

exercise: given the examples above, explain how the [ , ] syntax works in a couple of sentences. then google something like “R double bracket subsetting syntax data frame”, and explain how [ , ] differs from [[ ]].

look at the Base R cheatsheet posted on the course page for an excellent overview!

Okay, now let’s see who passed the exam:

cool_df$id[cool_df$score >= 60]

## [1] id_1 id_2 id_3 id_4 id_6
## Levels: id_1 id_2 id_3 id_4 id_5 id_6

note: the output isn’t quoted, and it says Levels: .... this is because R codes characters in data frames by default as factors. A factor is kind of like the representation of a column in statistical packages like Stata, in the sense that a factor vector is technically of integer type (I think this corresponds to “values” in Stata), but each unique integer value is associated with a character string called a level (which I think corresponds to a “value label”). So with factors, there’s a fixed set of possible values, some of which might not even be present in the data (months are a good example of something that’s natural to think of as a factor). This is actually kind of annoying in the big picture, but we’ll live with it for now.

note: to code characters as characters (instead of factors) when creating a data frame, specify data.frame() the stringsAsFactors argument as FALSE.

charVec <- c("booshA","booshB")
facVec  <- as.factor(c("booshA","booshB"))

typeof(charVec)
class(charVec)

typeof(facVec)
class(facVec)

## [1] "character"
## [1] "character"
## [1] "integer"
## [1] "factor"

note: some people like to use “camel case” to separate words in variable names (like in charVec or facVec or theNameOfThisVariable). Others prefer using underscores, as in char_vec or fac_vec or the_name_of_this_variable. Each approach has its advantages, and some people get very opinionated about this (kinda like the hilarious “tabs versus spaces” subplot in Silicon Valley!). Ultimately, though, it really doesn’t matter and boils down to a matter of taste. This is mentioned in the style guides linked to above.

We can add columns to a data frame by combining assignment <- with the dollar-sign $ column-grabbing syntax. Check that these new columns evaluate to what you’re expecting them to.

cool_df$passed <- ifelse(cool_df$score > 60, TRUE, FALSE)
cool_df$aced   <- ifelse(cool_df$score >= 90, TRUE, FALSE)
cool_df$failed <- !cool_df$passed

We’ll go over these extensively next week:

exercise: compute the percentage of law students who aced the exam.

exercise: compute the mean score for each group. (hint: google aggregate())

exercise: how does the failed column get computed?!

Finally, a few useful functions that you can call on data frames to check them out a bit. (note that you can combine names() with assignment to change the column names)

head(cool_df, n=2)
dim(cool_df)        # a vector of length 2: number of rows, number of cols
nrow(cool_df)       # get the number of rows
ncol(cool_df)       # get the number of columns
names(cool_df)      # get the the names of the columns

##     id group    score passed  aced failed
## 1 id_1     a 88.00637   TRUE FALSE  FALSE
## 2 id_2     a 68.93861   TRUE FALSE  FALSE
## [1] 6 6
## [1] 6
## [1] 6
## [1] "id"     "group"  "score"  "passed" "aced"   "failed"

exercise: change the name of the group column to “school”

exercise: recode school’s values as "nyu" and "columbia"

Here’s two more useful functions that you can apply to all kinds of objects. When you get a new dataset, inspecting at the structure and a summary for each column are always good things to do.

str(cool_df)      # look at the structure of the data frame 
summary(cool_df)  # get useful info about each column

## 'data.frame':    6 obs. of  6 variables:
##  $ id    : Factor w/ 6 levels "id_1","id_2",..: 1 2 3 4 5 6
##  $ group : Factor w/ 2 levels "a","b": 1 1 1 2 2 2
##  $ score : num  88 68.9 83.9 95.6 57.2 ...
##  $ passed: logi  TRUE TRUE TRUE TRUE FALSE TRUE
##  $ aced  : logi  FALSE FALSE FALSE TRUE FALSE TRUE
##  $ failed: logi  FALSE FALSE FALSE FALSE TRUE FALSE
##     id    group     score         passed           aced        
##  id_1:1   a:3   Min.   :57.18   Mode :logical   Mode :logical  
##  id_2:1   b:3   1st Qu.:72.68   FALSE:1         FALSE:4        
##  id_3:1         Median :85.95   TRUE :5         TRUE :2        
##  id_4:1         Mean   :81.34                                  
##  id_5:1         3rd Qu.:92.78                                  
##  id_6:1         Max.   :95.65                                  
##    failed       
##  Mode :logical  
##  FALSE:5        
##  TRUE :1        
##                 
##                 
##

That’s it for now – we’ll start getting our hands dirty with some real (and hopefully interesting!) datasets next week.

6. Next Week

more on data frames
reading in external datasets as data frames
installing and loading packages
manipulating and cleaning up data frames
summarizing columns and rows of data frames
group-wise summaries involving multiple columns

appendix: container- versus pointer-based semantics for variables

Here’s a quick illustration of how the semantics of variables can be different depending on the language. As discussed in class, you can think of R variables as “containers” that you can “dump” an arbitrary object into. But in Python, objects kinda just chill on a shelf in memory. Python variables are more like “pointers” to an object that exists independently of how we refer to it. (caution: this analogy is approximate!)

Variable behavior in R:

# assign the vector c(3, 4, 5) to x
x <- c(3, 4, 5)
print(paste("the value of x is:", list(x), sep="     "))

# "dump"/"copy" the contents of x into y
y <- x
print(paste("the value of y is:", list(y), sep="     "))

# add an element to x and look at the result
x <- c(x, 6)
print(paste("now the value of x is:", list(x), sep=" "))

# what's y gonna be now?!?!
print(paste("now the value of y is:", list(y), sep=" "))

## [1] "the value of x is:     c(3, 4, 5)"
## [1] "the value of y is:     c(3, 4, 5)"
## [1] "now the value of x is: c(3, 4, 5, 6)"
## [1] "now the value of y is: c(3, 4, 5)"

Variable behavior in Python:

# x points to the list [3, 4, 5]
x = [3, 4, 5]
print("the value of x is:", x, sep="     ")

# y points to x
y = x
print("the value of y is:", y, sep="     ")

# add an element to x and look at the result
x.append(6)
print("now the value of x is:", x, sep=" ")

# what's y gonna be now?!?!
print("now the value of y is:", y, sep=" ")

## the value of x is:     [3, 4, 5]
## the value of y is:     [3, 4, 5]
## now the value of x is: [3, 4, 5, 6]
## now the value of y is: [3, 4, 5, 6]

note: this demo also illustrates a badass property of R Markdown, for which we have Yihui Xie’s knitr:: package to thank: you can use lots of different languages inside of code chunks (with appropriate chunk options), provided that you have an interpreter or compiler for that language installed on your machine. Here’s a list of To learn more, first read up about knitr:: in general, then read this page of the documentation.

note: (btw, the <WORD>:: syntax means that <WORD> is an R package. the syntax <WORD1>::<WORD2>() means that the function <WORD2>() comes from the package called <WORD1>. we’ll see why this syntax is nice later on)