In these notes we’ll be looking at some sample data from the General Social Survey (GSS), a survey measuring American opinion on a variety of issues. The survey has been conducted bi-annually by NORC for many years. Many thanks to Ben Schapiro for prepping this stuff! It’s very cool data, and quite good for illustrating many common data wrangling tasks. After some discussion about planning in Sections 1-2, we start writing actual code to manipulate the data in Section 3. Feel free to skip directly to Section 3 if you just want to practice coding.

note on the datasets: we have sample data from three years: 2012, 2014, and 2016. You can find the corresponding datasets here (2012), here (2014), and here (2016). In these notes we’ll look just at the 2016 data, but there’s so much more we could look at if we merged all three files. There are a couple of exercises below that will involve doing exactly that. But in case you just want to practice with the combined data, here’s a link to a file with all three years together. Or just get everything as a .zip archive, including a partial codebook and a script.


0. for next week


Keep your dataset in mind when working through these notes. Then try to make a list of things you want to do to the data. For example: recode or plot some of the variables, maybe reshape it with a melt() command, make a frequency table for a couple variables of interest, etc.

Try to make the list manageable – identify maybe five operations total. Then next week, we’ll implement those operations and build a nice lil output document from an R-Markdown file (which is just a fancy type of R script, basically).


1. getting the data and defining the problem


We want to see if the wording of a question has an effect on how people (identified by an id) respond to it. The (categorical) grouping variable that determines what kind of wording they got is form, which takes the values 1 or 2 (which aren’t categorical – we’ll recode them so that they are).

We have three questions with alternative wordings, which we’d like to investigate (manipulated between-subjects, across the levels of form). The questions of interest are:

We are faced with many problems in this country, none of which can be solved easily or inexpensively. I’m going to name some of these problems, and for each one I’d like you to tell me whether you think we’re spending too much money on it, too little money, or about the right amount…

[A]re we spending too much, too little, or about the right amount on (ITEM)?

where (ITEM) is one of the following:

An important question: should we be interested in variation across form for each individual questionnaire item, or can we consider each question an “item” in the sense that a “trial” instantiating it can appear in multiple “conditions” (here form==1 or form==2)? This really depends on our purpose, which here is to demonstrate how to manipulate data. In other words, the question is important if our goals are scientific, but not so much presently…

Since the form==2 questions for nat* all involve the same wording – namely “assistance to …” – maybe it makes sense to consider form==1 an indication of a “variable” that we could call something like non-uniform wording. Or maybe a slightly better option would be to view form as encoding an abstract variable describable as (roughly!) “contains the word assistance in the questionnaire text.” Viewed this way, instead of having \(n=2867\) observations of three different outcome variables, we would now have \(n=2867\times3=8601\) observations of a single outcome variable.

So what we want to do is to investigate relationships between item responses and other characteristics of the data that responses are “nested” inside of: which of the three items it is; which wording the respondent got (form 1 or form 2); demographic characteristics of the respondents; etc.

a note on the importance of domain-specific knowledge: One could certainly argue that the notion of a single, underlying “difference in wording” between the nat* and nat*y variables (in our subset of the GSS) isn’t very well motivated. It’s not like the nat* versions all have more words than their nat*t counterparts, for example…These are the types of issues where domain-specific knowledge/expertise is really important. Since I’m not super well-versed in the GSS, my strategy might not be the best one for the particular problem we’re looking at (particularly if we’re fitting statistical models to the data and making questionable assumptions about what the nature of form is). But nevertheless, doing things the way we’re going to do them will illustrate a pretty decent sample of the types of data-munging and visualization tasks you’ll likely have to deal with. This is all to say the following: hacking skills are very useful, but are absolutely not a substitute for nuanced understanding of the subject matter a dataset is “about.” Ability to manipulate, transform, and model data is a complementary skill to the theorizing that makes up the foundation of all scientific fields. (FWIW, I’d say it’s questionable at best whether there is truly something both interesting and uniform being manipulated across the levels of form, at least if we look at just these three items. I don’t think that the survey creator’s intent was to manipulate the items in this way, but my background in psycholinguistic experimentation kind of forces me to think of everything as an experiment with some kind of factorial design…)

Okay, back to something concrete now!


2. developing a gameplan


First we’ll need to read in the dataset and do some basic cleaning operations to make it easier to work with (e.g. recoding vars that need recoded). After that, we need to map out a gameplan for getting the data in the format we want it in.

How do we determine what the format is that we want it in? I think a good strategy is to think visually: what kind of graph could you imagine making from the data? Once you have an idea of that, work backwards and try to figure out what shape (e.g. columns) the data would need to have to create that graph. This definitely requires practice! And in fact, the more you practice plotting, the better intuitions you’ll have about how to get started cleaning/transforming a dataset.

Here’s a gameplan that came to my mind (and again, heed above disclaimer on domain knowledge): to investigate the impact of wording, item, etc. on responses, we’ll need to have our outcome variable response as a single column of the dataset. This column will have the possible values "too much", "about the right amount", and "too little". But there’s a problem: the outcome is currently spread across six columns…What’d be nice would be to have it instead as two columns: one indicating the question (i.e. natcity or natrace or natfare), and one indicating the response (i.e. 0, 1, 2, 3, 8, or 9, which we’ll recode as NA, "too much", "about the right amount", and "too little"). Since we already checked that form encodes wording type, we can safely just remove the y suffixes from the columns of interest and use form to represent “the wording manipulation.”

So how do we arrive at our target format?

There are many ways. Here’s a natural one to implement, where dat16 is the dataset in the file datasets/gss_2016_test.csv:

After Step 2, we’ll have erased all the rows with empty responses, so what we’re left with is a dataset where dat16$item has one set of values for one level of form, and another set of distinct but corresponding values for the other level of form. But conceptually, the situation is that there’s only three values of dat16$item – not six – and that dat16$form tells us what kind of wording a particular subject got for all three items. This justifies identifying "nat*y" values in dat16$item with their corresponding "nat*" values. In other words,

note: if we’re being pedantic, we shouldn’t just delete all "y"’s, since "natcityy" has one that isn’t part of the original coding scheme (and hence we’re not targeting it for off-chopping!). In practice we’re going to, but see exercise on regular expressions below.

Now our data is (are?!) really really beautiful and we’re ready to start exploring it (although in the colloquial sense I’d say we’ve already explored it quite a bit!).

note: the steps we mapped out above were motivated not just by intuition, but also by the concept of “tidy data,” championed by Hadley Wickham et al. If you’re interested, I’d highly recommend reading this excellent paper (.pdf here), which is kind of a manifesto for a certain (efficient) kind of “tidiness” through all phases of a data-related endeavor.

We also want to see if demographic variables modulate the effect of wording on response. To do this, we’ll need to first clean up several columns, e.g. age, sex, and INCOME16. These are pretty self explanatory demographic concepts, but other than age we don’t really know the correspondence between the numeric codes in the data and the real-world values they represent. To make the correspondence visible, we’ll recode each of them to values we can look at and immediately understand.


3. reading and cleaning and transforming the data


3.1 read in data and inspect

# read in the data and do some initial inspection
dat <- read.csv("datasets/gss_2016_test.csv") 

# check for na's 
# (none, yay!...but are there *really* none?! to be continued...)
colSums(is.na(dat))
##       id     year     form      sex      age racecen1 racecen2 INCOME16 
##        0        0        0        0        0        0        0        0 
##  natcity natcityy  natrace natracey  natfare natfarey 
##        0        0        0        0        0        0
str(dat)
## 'data.frame':    2867 obs. of  14 variables:
##  $ id      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ year    : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ form    : int  1 1 1 1 2 1 1 1 1 2 ...
##  $ sex     : int  1 1 1 2 2 2 1 2 1 1 ...
##  $ age     : int  47 61 72 43 55 53 50 23 45 71 ...
##  $ racecen1: int  1 1 1 1 1 1 1 16 2 1 ...
##  $ racecen2: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ INCOME16: int  26 19 21 26 26 20 26 16 20 20 ...
##  $ natcity : int  2 8 8 2 0 1 2 1 2 0 ...
##  $ natcityy: int  0 0 0 0 1 0 0 0 0 2 ...
##  $ natrace : int  1 1 2 2 0 1 1 2 8 0 ...
##  $ natracey: int  0 0 0 0 8 0 0 0 0 3 ...
##  $ natfare : int  2 8 8 3 0 1 3 3 3 0 ...
##  $ natfarey: int  0 0 0 0 1 0 0 0 0 2 ...
head(dat, n=5)
##   id year form sex age racecen1 racecen2 INCOME16 natcity natcityy natrace
## 1  1 2016    1   1  47        1        0       26       2        0       1
## 2  2 2016    1   1  61        1        0       19       8        0       1
## 3  3 2016    1   1  72        1        0       21       8        0       2
## 4  4 2016    1   2  43        1        0       26       2        0       2
## 5  5 2016    2   2  55        1        0       26       0        1       0
##   natracey natfare natfarey
## 1        0       2        0
## 2        0       8        0
## 3        0       8        0
## 4        0       3        0
## 5        8       0        1

exercise: use the commented code in the next chunk to make formatted tables, and compile an R-markdown document to print them nicely. You’ll need to do some googling – start with “formatted tables R-Markdown knitr kable html”. Next week how to do this will be made completely explicit, so feel free to skip for now :)

# to make nicer-looking tables:
# knitr::kable(t(as.table(colSums(is.na(dat)))))
# knitr::kable(table(dat$form), col.names=c("form","count"))
# knitr::kable(table(dat$sex), col.names=c("sex", "count"))

3.2 basic cleanup and recoding

# now check + clean up + transform columns as necessary

# `id` is an identifier, not a number
dat$id <- paste0("resp", dat$id)

# `form` has just two levels
table(dat$form)
## 
##    1    2 
## 1437 1430
# so we can recode it with `ifelse()`
dat$form <- ifelse(dat$form==1, "standard", "assistance")

# `sex` has just two levels
table(dat$sex)
## 
##    1    2 
## 1276 1591
# so we can recode it with `ifelse()`
dat$sex <- ifelse(dat$sex==1, "male", "female")

# income has lots of different values
# ALSO BTW YAY FIRST PLOT! 
barplot(table(dat$INCOME16))

# so we'll have to recode it in a more mechanical fashion...

# one option:
# recode income to `hi` or `mid` or `lo`, according to external values
lo  <- 1:12; mid <- 13:21; hi  <- 22:26
dat$income_cat <- ifelse(
  dat$INCOME16 %in% lo, "lo", ifelse(
    dat$INCOME16 %in% mid, "mid", ifelse(
      dat$INCOME16 %in% hi, "hi", NA
    )
  )
); rm(lo); rm(mid); rm(hi)

# another option:
# read in external vals as a table, and just sub in values for numeric codes
income <- read.csv(
  "datasets/lookup_income.csv", header=FALSE, col.names=c("num","val"), 
  stringsAsFactors=FALSE
)
# **note**: 
#   the following three lines won't feel intuitive right away, 
#   but try to break down what's happening step by step. 
#   this recoding method is very general and very useful.

# make a vector with values equal to income$val, the desired values 
income_lkup        <- as.vector(income$val)

# set the names of that vector to the current values, i.e. income$num
names(income_lkup) <- income$num

# now recode income by matching the current values w the lookup table names
dat$income_range <- income_lkup[match(dat$INCOME16, names(income_lkup))]

# remove the original income col
dat$INCOME16 <- NULL

note: semicolons just “end” the current line – they’re used to put a small set of similar commands on a single line to save space and emphasize that they’re all involved in a single operation.

note: notice that when recoding income, we assigning the new values to the new column income_range, instead of the existing one, INCOME16. This was deliberate: if we do it this way, then we’ll (temporarily) have both columns to look at, so we can check our work by comparing the old column to the new one.

exercise: make a new column, derived from income_range, that contains the same information as income_cat (which was computed directly from INCOME16)

3.3 a slightly more complicated case

# let's check out the race cols again
table(dat$racecen1, useNA="always")
## 
##    1    2    3    4    5    6    7    8    9   10   11   12   14   15   16 
## 2088  487   58   25   19   22    7    3    4   10    1    1    2   13  103 
##   98   99 <NA> 
##    9   15    0
table(dat$racecen2, useNA="always")
## 
##    0    1    2    3    4    5    6    7    8    9   10   13   14   15   16 
## 2617   35   24  140    1    2    3    3    3    1    4    3    3    4    8 
##   98   99 <NA> 
##    1   15    0
# kinda messy...let's just extract the census's notion of "ethnicity"
# (i.e. hispanic or non-hispanic)

# first read in our associative array of keys and values
race <- read.csv(
  "datasets/lookup_race.csv", header=FALSE, col.names=c("num","val"),
  stringsAsFactors=FALSE
)
race_lkup        <- as.vector(race$val)
names(race_lkup) <- race$num

dat$racecen1 <- ifelse(dat$racecen1 %in% c(0,98,99), NA, dat$racecen1)
dat$racecen2 <- ifelse(dat$racecen2 %in% c(0,98,99), NA, dat$racecen2)

dat$racecen1 <- race_lkup[match(dat$racecen1, names(race_lkup))]
dat$racecen2 <- race_lkup[match(dat$racecen2, names(race_lkup))]

dat$hisp <- ifelse(
  dat$racecen1=="Hispanic", "hispanic", "nonhispanic"
)
dat$hisp[dat$racecen2=="Hispanic"] <- "hispanic"

# now we can safely eliminate the `racecen1/2` columns
dat$racecen1 <- NULL; dat$racecen2 <- NULL

exercise: if we wanted to investigate the effect of “race” on responses, how would/should/could we represent “race”? In other words, how many columns, what is in each column, etc. etc. What would be some benefits and drawbacks of two competing approaches?

3.4 reorganize columns as desired, toss unused ones

# now we have everything except the outcome vars
# reorganize the df the way we want it

# vector of the demographic columns we want to keep around
dem_cols  <- c("id", "year", "form", "sex", "age", "income_cat", "hisp")

# vector of the questionnaire items 
item_cols <- names(dat)[startsWith(names(dat), "nat")]

# subsetting the data by columns, keeping just `dem_cols` and `item_cols`
dat <- dat[, c(dem_cols, item_cols)]

3.5 creating the desired outcome variable with reshape2::melt()

Now we’ll implement the strategy from Section 2 above for representing the responses. This involves a process called “melting” the dataset. The result is moving from the current “wide format” to a “long format” dataset. Google around and you’ll find some nice explainers quickly.

The terminology should become more clear as you see more instances of this very common transformation.

# NATFARE/NATFAREY/NATCITY/NATCITYY/NATRACE/NATRACEY
#    9        "No answer"
#    8        "Don't know"
#    3        "Too much"
#    2        "About right"
#    1        "Too little"
#    0        "Not applicable"

# `melt()` comes from the package `reshape2::`
library("reshape2")

# hold `dem_cols` (see above) constant, collapse the other columns
dat <- melt(dat, id.vars=dem_cols)

# recode the new "column-column" as character (instead of factor)
dat$variable <- as.character(dat$variable)

# rename the column-column ("variable") and the response-column ("value")
names(dat)[names(dat)=="variable"] <- "item"
names(dat)[names(dat)=="value"]    <- "response"

# check out what we have now:
knitr::kable(head(dat, n=10))
id year form sex age income_cat hisp item response
resp1 2016 standard male 47 hi nonhispanic natcity 2
resp2 2016 standard male 61 mid nonhispanic natcity 8
resp3 2016 standard male 72 mid nonhispanic natcity 8
resp4 2016 standard female 43 hi nonhispanic natcity 2
resp5 2016 assistance female 55 hi nonhispanic natcity 0
resp6 2016 standard female 53 mid nonhispanic natcity 1
resp7 2016 standard male 50 hi nonhispanic natcity 2
resp8 2016 standard female 23 mid hispanic natcity 1
resp9 2016 standard male 45 mid nonhispanic natcity 2
resp10 2016 assistance male 71 mid nonhispanic natcity 0

exercise: how does the line of code renaming variable to item work? Try the command "item" -> names(dat)[names(dat)=="variable"]. What is the symbol -> and how is it related to <-?

# toss all the rows w/o real answers
# (don't forget the comma before the closing square bracked -- `dat` is a df!)
dat <- dat[!dat$response %in% c(9, 8, 0), ]

# make a lookup table for the responses
val_lkup <- c(`1`="too little", `2`="about right", `3`="too much")

# recode responses as character
dat$response <- as.character(val_lkup[match(dat$response, names(val_lkup))])

# look at a table
knitr::kable(table(dat$response, useNA="ifany"), col.names=c("resp", "freq"), pad=0)
resp freq
about right 2781
too little 3516
too much 1662

note: the “backtick” syntax (` and `) allows you to use numbers, spaces, and other stuff that you wouldn’t normally be able to directly say without a syntax error. For example if you had a column of dat called favorite color, you couldn’t say dat$favorite color because there’s a space in the column name. But you could say dat$`favorite color` to refer to that column.

Recall point 3. in our gameplan above – we want just 3 unique vals in item. But currently we have six:

knitr::kable(table(dat$item, useNA="ifany"), col.names=c("item", "freq"), pad=0)
item freq
natcity 1315
natcityy 1277
natfare 1385
natfarey 1403
natrace 1294
natracey 1285

A simple – albeit kind of sloppy – solution would be to just remove all occurrences of the letter “y” from the item column. Let’s just do that (see exercise though):

# for now we'll just toss all y's to make life easier
dat$item <- gsub("y", "", dat$item)

# confirm that we have just three vals now
knitr::kable(table(dat$item, useNA="ifany"), col.names=c("item", "freq"), pad=0)
item freq
natcit 2592
natfare 2788
natrace 2579

exercise (important): Remove all trailing "y"’s from the dat$response column. (hint: use the regular expressions cheatsheet on the course page. another hint-slash-exercise: google for examples of gsub(), grep(), grepl(), and gregexpr() in action. Also peep their cousins sub() and regexpr() too, though they’re not quite as useful imo. If this sounds too painful, peruse the intro vignette for the stringr:: package) instead.

exercise: what if anything could or would go wrong if we just chopped off all the "y"’s in the current situation? What conditions would definitely cause this to be a bad idea?


4. summarizing and plotting the data


4.1 summarizing data with a variety of methods

Here’s a variety of ways we can summarize the data, using a few different strategies.

# many ways to summarize data by groups!

# frequency tables
table(dat$sex, dat$response)
##         
##          about right too little too much
##   female        1488       1986      886
##   male          1293       1530      776
table(dat$income_cat, dat$response)
##      
##       about right too little too much
##   hi          678        718      373
##   lo          472        658      252
##   mid        1387       1821      874
# tables can have arbitrary dimensions, but even three is tough to parse...
# table(dat$sex, dat$income_cat, dat$response)

# cross-tabulate 
xtabs( ~ sex + income_cat, data=dat)
##         income_cat
## sex        hi   lo  mid
##   female  816  779 2297
##   male    953  603 1785
xtabs( ~ response + sex + income_cat, data=dat)
## , , income_cat = hi
## 
##              sex
## response      female male
##   about right    294  384
##   too little     368  350
##   too much       154  219
## 
## , , income_cat = lo
## 
##              sex
## response      female male
##   about right    269  203
##   too little     386  272
##   too much       124  128
## 
## , , income_cat = mid
## 
##              sex
## response      female male
##   about right    769  618
##   too little    1027  794
##   too much       501  373
# marginal proportions
# **exercise** try setting margin=2
prop.table(xtabs( ~ response + income_cat, data=dat), margin=1)
##              income_cat
## response             hi        lo       mid
##   about right 0.2672448 0.1860465 0.5467087
##   too little  0.2245855 0.2058180 0.5695965
##   too much    0.2488326 0.1681121 0.5830554
prop.table(table(dat$response, dat$income_cat), margin=1)
##              
##                      hi        lo       mid
##   about right 0.2672448 0.1860465 0.5467087
##   too little  0.2245855 0.2058180 0.5695965
##   too much    0.2488326 0.1681121 0.5830554
# can use formula interface with `aggregate()`
aggregate(age ~ sex + income_cat, FUN="mean", data=dat)
##      sex income_cat      age
## 1 female         hi 49.33946
## 2   male         hi 49.13641
## 3 female         lo 49.44673
## 4   male         lo 48.65008
## 5 female        mid 48.68655
## 6   male        mid 48.14230
aggregate(age ~ sex + income_cat, FUN="sd", data=dat)
##      sex income_cat      age
## 1 female         hi 14.45683
## 2   male         hi 16.25358
## 3 female         lo 19.11131
## 4   male         lo 18.48916
## 5 female        mid 18.02443
## 6   male        mid 17.39359
aggregate(age ~ sex + income_cat, FUN=function(x) round(mean(x)), data=dat)
##      sex income_cat age
## 1 female         hi  49
## 2   male         hi  49
## 3 female         lo  49
## 4   male         lo  49
## 5 female        mid  49
## 6   male        mid  48
# write a custom function for aggregating
custom_summary <- function(x){
  c(mean=round(mean(x)), sd=round(sd(x)), min=min(x), max=max(x))
}
aggregate(age ~ sex + income_cat, FUN=custom_summary, data=dat)
##      sex income_cat age.mean age.sd age.min age.max
## 1 female         hi       49     14      18      99
## 2   male         hi       49     16      18      99
## 3 female         lo       49     19      19      89
## 4   male         lo       49     18      18      89
## 5 female        mid       49     18      18      99
## 6   male        mid       48     17      18      99
# 


# and finally, my drug of choice is dplyr::summarize()/group_by() 
library("dplyr")
dat %>% group_by(sex, income_cat) %>% summarize(
  num_obs = length(response),
  num_toolittle = sum(response=="too little", na.rm=TRUE),
  prop_toolittle  = num_toolittle / num_obs,
  num_aboutright = sum(response=="about right", na.rm=TRUE),
  prop_aboutright = num_aboutright / num_obs,
  num_toomuch = sum(response=="too much", na.rm=TRUE),
  prop_toomuch = num_toomuch / num_obs
  # the `mutate_if()` is just to make it display nicer
) %>% mutate_if(is.numeric, round, digits=3) %>% data.frame()
##      sex income_cat num_obs num_toolittle prop_toolittle num_aboutright
## 1 female         hi     816           368          0.451            294
## 2 female         lo     779           386          0.496            269
## 3 female        mid    2297          1027          0.447            769
## 4 female       <NA>     468           205          0.438            156
## 5   male         hi     953           350          0.367            384
## 6   male         lo     603           272          0.451            203
## 7   male        mid    1785           794          0.445            618
## 8   male       <NA>     258           114          0.442             88
##   prop_aboutright num_toomuch prop_toomuch
## 1           0.360         154        0.189
## 2           0.345         124        0.159
## 3           0.335         501        0.218
## 4           0.333         107        0.229
## 5           0.403         219        0.230
## 6           0.337         128        0.212
## 7           0.346         373        0.209
## 8           0.341          56        0.217
dat %>% group_by(item, form) %>% summarize(
  num_obs = length(response),
  num_toolittle = sum(response=="too little", na.rm=TRUE),
  prop_toolittle  = num_toolittle / num_obs,
  num_aboutright = sum(response=="about right", na.rm=TRUE),
  prop_aboutright = num_aboutright / num_obs,
  num_toomuch = sum(response=="too much", na.rm=TRUE),
  prop_toomuch = num_toomuch / num_obs
) %>% mutate_if(is.numeric, round, digits=3) %>% data.frame()
##      item       form num_obs num_toolittle prop_toolittle num_aboutright
## 1  natcit assistance    1277           309          0.242            529
## 2  natcit   standard    1315           683          0.519            455
## 3 natfare assistance    1403          1002          0.714            301
## 4 natfare   standard    1385           316          0.228            477
## 5 natrace assistance    1285           534          0.416            535
## 6 natrace   standard    1294           672          0.519            484
##   prop_aboutright num_toomuch prop_toomuch
## 1           0.414         439        0.344
## 2           0.346         177        0.135
## 3           0.215         100        0.071
## 4           0.344         592        0.427
## 5           0.416         216        0.168
## 6           0.374         138        0.107

exercise: what does the following command yield, and why does it yield that? Does this make sense as a summary for some combination of our variables? command: xtabs(age ~ sex + income_cat, data=dat)

note: you’ll see the formula interface used in xtabs() in a few different areas in R – most notably in regression modeling. The form is usually y ~ x1 + ... + xn, with y being the outcome variable and the xi’s being predictors.

Now let’s get to plotting folks!!!

4.2 plotting with base functions

There are several base functions for making plots. They’re great for exploratory analysis because they’re so quick and easy. They can be customized endlessly, but if you want to customize the appearance, you’re better off using ggplot().

# barplots made from a table
barplot(table(dat$sex))

barplot(table(dat$income_cat))

# basic scatterplot
plot(iris$Sepal.Width, iris$Sepal.Length)

# with color coding
plot(iris$Sepal.Width, iris$Sepal.Length, col=iris$Species)

xvar <- 1:10
yvar <- rnorm(xvar)
# line plot (nice for time-series)
plot(xvar, yvar, type="l")

# scatter
plot(xvar, yvar, type="p")

# both
plot(xvar, yvar, type="b")

# other options are "c", "o", "S"/"s", "h", "n"
# histograms for numeric univariate distributions
# this is probably the base plotting function i use the most
hist(dat$age)

exercise: merge the 2012, 2014, and 2016 data using a *_join() operation from dplyr:: (see the top-right of the second page of the data wrangling cheatsheet). Then make a line-plot where the x-axis is year, and the y-axis is proportion “too much” responses across all three items.

exercise: after doing the previous exercise, read in the file called "datasets/gss_complete_test.csv" and make the corresponding plot. Does it look identical to yours? (hint: it should!)

4.3 plotting with ggplot()

Alright, bringing out the big guns now. ggplot2:: has its own syntax, because you interact with its plotting capabilities via a (very small/limited) domain-specific language (DSL), which you can think of as kind of a mini-language that’s very limited in scope because it’s only meant to do one thing. For example the formula interface we saw with aggregate() can be considered a DSL. The same formula-language (with some extra capabilities) is also used for most of R’s regression modeling functions, e.g. lm(), glm(), lme4::lmer(), and lme4::glmer().

Before plotting, we’ll make a couple of summary datasets, which will make some aspects easier moving forward. There are many ways to do this.

exercise: In the chunk below, I start to make a summary dataset using base functions like aggregate(). This can be tedious but is very important to know how to do. I’m therefore leaving most of it as an exercise. Essentially, here’s the goal: inspect the code that creates pdat (see week 2 appendix for background on pipe-chains with %>% if you want to understand what’s going on here), figure out exactly what it does, and then finish writing the code at the beginning of the chunk, which is meant to create a summary data frame with the same information as pdat has. You can check your work by confirming that your summary data frame is identical to pdat (rounding aside!).

# for each group defined by the grouping vars, how many observations?
num_obs  <- aggregate(response ~ sex + income_cat + item + form, data=dat, FUN="length")
# and what's the mean age?
mean_age <- aggregate(age ~ sex + income_cat + item + form, data=dat, FUN="mean")
# how many of the 'response' vals are "too little"? in each group
prop_toolittle <- aggregate(response=="too little" ~ sex + income_cat + item + form, data=dat, FUN="sum")
# change the name of the column we calculated
names(prop_toolittle)[5] <- "too_little"
# now use it and mean_age to get the proportion of too_little responses
prop_toolittle$too_little <- prop_toolittle$too_little / mean_age$age
# ...
# ...
# ...

hint to exercise: you’ll probably need to create some “intermediate” data frames like mean_age and prop_toolittle, then figure out how to put them together to get somethign that looks like pdat below:

# a very nice, compact, and efficient way to build the summary data frame
pdat <- dat %>% group_by(sex, income_cat, item, form) %>% summarize(
  num_obs = length(response),
  mean_age = mean(age, na.rm=TRUE),
  prop_toolittle  = sum(response=="too little", na.rm=TRUE) / num_obs,
  prop_aboutright = sum(response=="about right", na.rm=TRUE) / num_obs,
  prop_toomuch = sum(response=="too much", na.rm=TRUE) / num_obs
) %>% mutate_if(is.numeric, round, digits=3) %>% 
  melt(measure.vars=c("prop_toolittle","prop_aboutright","prop_toomuch")) %>% 
  data.frame()


# make sure income is ordered how we want it (the others dont matter)
pdat$income_cat <- factor(pdat$income_cat, levels=c("lo","mid","hi"))

exercise: rewrite the summary code by combining aggregate() with a function that you created (use the example with custom_summary() above as a starting point).

The best way to learn ggplot is by example, so here we go. note that some of the plots below are more useful than others, but they’re all valid examples of stuff you might have/want to do eventually.

library("ggplot2")

# scatterplots

# regular
ggplot(pdat, aes(x=mean_age, y=value)) +
  geom_point()

# color-coded
ggplot(pdat, aes(x=mean_age, y=value, color=variable)) +
  geom_point()

ggplot(pdat, aes(x=mean_age, y=value, color=income_cat)) +
  geom_point()

# size-coded
ggplot(pdat, aes(x=mean_age, y=value, size=num_obs)) + 
  geom_point()

# size- and color-coded
ggplot(pdat, aes(x=mean_age, y=value, size=num_obs, color=income_cat)) + 
  geom_point()

exercise: how could we get rid of the NA’s in the plot with color coding for income_cat?

Now let’s look at how the responses are distributed for individual items:

# first group the data by item and get proportions for each response
boosh <- dat %>% group_by(item) %>% summarize(
  prop_toolittle  = sum(response=="too little", na.rm=TRUE) / n(),
  prop_aboutright = sum(response=="about right", na.rm=TRUE) / n(),
  prop_toomuch = sum(response=="too much", na.rm=TRUE) / n()
) %>% melt(id.vars="item")

# now make some plots
ggplot(boosh, aes(x=item, y=value, fill=variable)) + geom_bar(stat="identity")

ggplot(boosh, aes(x=item, y=value, fill=variable)) +
  geom_bar(stat="identity", position="dodge") 

ggplot(boosh, aes(x="", y=value, fill=variable)) +
  geom_bar(stat="identity", position="dodge") +
  facet_wrap(~item)

Now let’s see if form looks like it has an effect on responses!

# first group the data by item and get proportions for each response
boosh <- dat %>% group_by(item, form) %>% summarize(
  prop_toolittle  = sum(response=="too little", na.rm=TRUE) / n(),
  prop_aboutright = sum(response=="about right", na.rm=TRUE) / n(),
  prop_toomuch = sum(response=="too much", na.rm=TRUE) / n()
) %>% melt(id.vars=c("item","form")) %>% data.frame()

knitr::kable(boosh)
item form variable value
natcit assistance prop_toolittle 0.2419734
natcit standard prop_toolittle 0.5193916
natfare assistance prop_toolittle 0.7141839
natfare standard prop_toolittle 0.2281588
natrace assistance prop_toolittle 0.4155642
natrace standard prop_toolittle 0.5193199
natcit assistance prop_aboutright 0.4142522
natcit standard prop_aboutright 0.3460076
natfare assistance prop_aboutright 0.2145403
natfare standard prop_aboutright 0.3444043
natrace assistance prop_aboutright 0.4163424
natrace standard prop_aboutright 0.3740340
natcit assistance prop_toomuch 0.3437745
natcit standard prop_toomuch 0.1346008
natfare assistance prop_toomuch 0.0712758
natfare standard prop_toomuch 0.4274368
natrace assistance prop_toomuch 0.1680934
natrace standard prop_toomuch 0.1066461

We can plot this in a number of ways – each providing a slightly different perspective on the data. (generally speaking, not all perspectives are equally useful/sensical!)

ggplot(boosh, aes(x=variable, y=value, fill=form)) + 
  geom_bar(stat="identity", position="dodge") +
  facet_wrap(~item) 

ggplot(boosh, aes(x=item, y=value, fill=form)) + 
  geom_bar(stat="identity", position="dodge") +
  facet_wrap(~variable) 

ggplot(boosh, aes(x=variable, y=value)) +
  geom_bar(stat="identity", position="dodge") +
  facet_grid(item ~ form) 

ggplot(boosh, aes(x=variable, y=value, fill=variable)) +
  geom_bar(stat="identity", position="dodge") +
  facet_grid(item~form) 

exercise: what should we expect sum(boosh$value) to evaluate to? Why?

4.4 styling and saving plots!

For notes on styling ggplot2 plots, check out the week3 script. Here’s a few of the things we went over in class, all of which you’ll find in that script:

  • control the color coding
  • move the legend position
  • label the axes and plot
  • add a caption
  • add labels for the values
  • axis limits and axis ticks

Two basic ways to save a ggplot2:: plot object are:

  1. via the “export” interface in R Studio
  2. via the function ggsave() (see week3 script)

You can specify the dimensions and filetype etc. with the “export” interface. You can do the same with ggsave(), but the latter has a few advantages:

  • if you save a plot with a command instead of point-and-click, you’ll never forget what size you saved it as! That way, if you need to update the figure, you can just re-run the code.
  • you can save a bunch of plots automatically as a by-product of just running your script.
  • you can do fun stuff like the animation in the appendix :p


next week



Appendix: using for-loops to make animated gif plots!


Title is hopefully self-explanatory! Check out the super fun + cool magick:: package (esp. this nice vignette), which allows you to call image-magick functions directly from R.

exercise: mark up the code below with comments explaining what’s happening in each step or line of code, and why. (places where I’d suggest putting comments are marked with # [comment here] – writing lots of comments is a good way to reinforce stuff in your memory)

library("ggplot2")
library("magick")

# [comment here]
df <- data.frame(
  time_worked = rep(seq(0, 1.4, .2), times=2),
  amount_done = c(seq(0, 4, .8)^2, 50, 100, seq(0, 9, 2)^2, 85, 90, 95),
  strategy    = rep(c("writing code", "point and click"), each=8)
)
# [comment here]
df <- arrange(df, time_worked)

# [comment here]
for (x in seq(from=2, to=nrow(df), by=2)){
  # [comment here]
  myplot <- 
    ggplot(df[1:x, ], aes(x=time_worked, y=amount_done, color=strategy)) + 
    # [comment here]
    geom_point() + 
    # [comment here]
    geom_line() +
    # [comment here]
    scale_x_continuous(limits=c(0, 1.5)) +
    scale_y_continuous(limits=c(0, 100)) +
    # [commment here]
    theme(legend.position="top", 
          axis.text=element_blank())
  
  # [comment here]
  fname_digit <- ifelse(x < 10, paste0("0", x), as.character(x))
  
  # [comment here]
  ggsave(
    filename=paste0("plots/myplot", fname_digit, ".png"), plot=myplot, 
    width=7, height=5, units="in"
  )
  
}

# [comment here]
gifcap <- c("yayyy, animated gif plot! \nwoop woop ~~ <(o_@)^")

# [comment here]
dir("plots/")[endsWith(dir("plots/"), ".png")] %>% 
  (function(x) paste0("plots/", x)) %>% 
  # [comment here]
  image_read() %>% 
  image_scale("x600") %>% 
  # [comment here]
  image_join() %>%
  image_animate(fps=2) %>% 
  # [comment here]
  image_annotate(gifcap, location="+100+100", color="green", size=30) %>% 
  # [comment here]
  image_write(path="output/animate_woop_woop.gif", quality=10)

Aaaand here’s what we end up with: