In these notes we’ll be looking at some sample data from the General Social Survey (GSS), a survey measuring American opinion on a variety of issues. The survey has been conducted bi-annually by NORC for many years. Many thanks to Ben Schapiro for prepping this stuff! It’s very cool data, and quite good for illustrating many common data wrangling tasks. After some discussion about planning in Sections 1-2, we start writing actual code to manipulate the data in Section 3. Feel free to skip directly to Section 3 if you just want to practice coding.
note on the datasets: we have sample data from three years: 2012, 2014, and 2016. You can find the corresponding datasets here (2012), here (2014), and here (2016). In these notes we’ll look just at the 2016 data, but there’s so much more we could look at if we merged all three files. There are a couple of exercises below that will involve doing exactly that. But in case you just want to practice with the combined data, here’s a link to a file with all three years together. Or just get everything as a .zip archive, including a partial codebook and a script.
Keep your dataset in mind when working through these notes. Then try to make a list of things you want to do to the data. For example: recode or plot some of the variables, maybe reshape it with a melt()
command, make a frequency table for a couple variables of interest, etc.
Try to make the list manageable – identify maybe five operations total. Then next week, we’ll implement those operations and build a nice lil output document from an R-Markdown file (which is just a fancy type of R script, basically).
We want to see if the wording of a question has an effect on how people (identified by an id
) respond to it. The (categorical) grouping variable that determines what kind of wording they got is form
, which takes the values 1 or 2 (which aren’t categorical – we’ll recode them so that they are).
We have three questions with alternative wordings, which we’d like to investigate (manipulated between-subjects, across the levels of form
). The questions of interest are:
We are faced with many problems in this country, none of which can be solved easily or inexpensively. I’m going to name some of these problems, and for each one I’d like you to tell me whether you think we’re spending too much money on it, too little money, or about the right amount…
[A]re we spending too much, too little, or about the right amount on (ITEM)?
where (ITEM) is one of the following:
natcity
: [A]re we spending too much, too little, or about the right amount on…
form==1
, response is in natcity
)form==2
, response is in natcityy
)natrace
: [A]re we spending too much, too little, or about the right amount on…
form==1
, response is in natrace
)form==2
, response is in natracey
)natfare
: [A]re we spending too much, too little, or about the right amount on…
form==1
, response is in natfare
)form==2
, response is in natfarey
)An important question: should we be interested in variation across form
for each individual questionnaire item, or can we consider each question an “item” in the sense that a “trial” instantiating it can appear in multiple “conditions” (here form==1
or form==2
)? This really depends on our purpose, which here is to demonstrate how to manipulate data. In other words, the question is important if our goals are scientific, but not so much presently…
Since the form==2
questions for nat*
all involve the same wording – namely “assistance to …” – maybe it makes sense to consider form==1
an indication of a “variable” that we could call something like non-uniform wording. Or maybe a slightly better option would be to view form
as encoding an abstract variable describable as (roughly!) “contains the word assistance in the questionnaire text.” Viewed this way, instead of having \(n=2867\) observations of three different outcome variables, we would now have \(n=2867\times3=8601\) observations of a single outcome variable.
So what we want to do is to investigate relationships between item responses and other characteristics of the data that responses are “nested” inside of: which of the three items it is; which wording the respondent got (form 1 or form 2); demographic characteristics of the respondents; etc.
a note on the importance of domain-specific knowledge: One could certainly argue that the notion of a single, underlying “difference in wording” between the nat*
and nat*y
variables (in our subset of the GSS) isn’t very well motivated. It’s not like the nat*
versions all have more words than their nat*t
counterparts, for example…These are the types of issues where domain-specific knowledge/expertise is really important. Since I’m not super well-versed in the GSS, my strategy might not be the best one for the particular problem we’re looking at (particularly if we’re fitting statistical models to the data and making questionable assumptions about what the nature of form
is). But nevertheless, doing things the way we’re going to do them will illustrate a pretty decent sample of the types of data-munging and visualization tasks you’ll likely have to deal with. This is all to say the following: hacking skills are very useful, but are absolutely not a substitute for nuanced understanding of the subject matter a dataset is “about.” Ability to manipulate, transform, and model data is a complementary skill to the theorizing that makes up the foundation of all scientific fields. (FWIW, I’d say it’s questionable at best whether there is truly something both interesting and uniform being manipulated across the levels of form
, at least if we look at just these three items. I don’t think that the survey creator’s intent was to manipulate the items in this way, but my background in psycholinguistic experimentation kind of forces me to think of everything as an experiment with some kind of factorial design…)
Okay, back to something concrete now!
First we’ll need to read in the dataset and do some basic cleaning operations to make it easier to work with (e.g. recoding vars that need recoded). After that, we need to map out a gameplan for getting the data in the format we want it in.
How do we determine what the format is that we want it in? I think a good strategy is to think visually: what kind of graph could you imagine making from the data? Once you have an idea of that, work backwards and try to figure out what shape (e.g. columns) the data would need to have to create that graph. This definitely requires practice! And in fact, the more you practice plotting, the better intuitions you’ll have about how to get started cleaning/transforming a dataset.
Here’s a gameplan that came to my mind (and again, heed above disclaimer on domain knowledge): to investigate the impact of wording, item, etc. on responses, we’ll need to have our outcome variable response
as a single column of the dataset. This column will have the possible values "too much"
, "about the right amount"
, and "too little"
. But there’s a problem: the outcome is currently spread across six columns…What’d be nice would be to have it instead as two columns: one indicating the question (i.e. natcity
or natrace
or natfare
), and one indicating the response (i.e. 0
, 1
, 2
, 3
, 8
, or 9
, which we’ll recode as NA
, "too much"
, "about the right amount"
, and "too little"
). Since we already checked that form
encodes wording type, we can safely just remove the y
suffixes from the columns of interest and use form
to represent “the wording manipulation.”
So how do we arrive at our target format?
There are many ways. Here’s a natural one to implement, where dat16
is the dataset in the file datasets/gss_2016_test.csv
:
ncol(dat16) - 6 + 2
columns apiece, where the final six columns are deleted and replaced by two new columns item
and response
such that for each set of six rows i:(i+5)
(all of whose values in columns 1:(ncol(dat16)-6)
are identical to the original value in the unexpanded row i
):
dat16$item[i:(i+5)]
is the length-six character vector consisting of the names of the columns we’re chopping off (i.e. c("natcity","natcityy","natrace","natracey","natfare","natfarey")
); anddat16$response[i:(i+5)]
contains the value of the chopped-off column with the name dat16$item
specifies for that row.dat16$value[i] %in% c(0,8,9)
(which will all be coded as NA
), since:
0
(not applicable), since they didn’t provide an answer to the corresponding dat16$item
– in other words, these aren’t real data points in the first place; andAfter Step 2, we’ll have erased all the rows with empty responses, so what we’re left with is a dataset where dat16$item
has one set of values for one level of form
, and another set of distinct but corresponding values for the other level of form
. But conceptually, the situation is that there’s only three values of dat16$item
– not six – and that dat16$form
tells us what kind of wording a particular subject got for all three items. This justifies identifying "nat*y"
values in dat16$item
with their corresponding "nat*"
values. In other words,
"y"
’s from the dat16$item
column.note: if we’re being pedantic, we shouldn’t just delete all "y"
’s, since "natcityy"
has one that isn’t part of the original coding scheme (and hence we’re not targeting it for off-chopping!). In practice we’re going to, but see exercise on regular expressions below.
Now our data is (are?!) really really beautiful and we’re ready to start exploring it (although in the colloquial sense I’d say we’ve already explored it quite a bit!).
note: the steps we mapped out above were motivated not just by intuition, but also by the concept of “tidy data,” championed by Hadley Wickham et al. If you’re interested, I’d highly recommend reading this excellent paper (.pdf here), which is kind of a manifesto for a certain (efficient) kind of “tidiness” through all phases of a data-related endeavor.
We also want to see if demographic variables modulate the effect of wording on response. To do this, we’ll need to first clean up several columns, e.g. age
, sex
, and INCOME16
. These are pretty self explanatory demographic concepts, but other than age
we don’t really know the correspondence between the numeric codes in the data and the real-world values they represent. To make the correspondence visible, we’ll recode each of them to values we can look at and immediately understand.
# read in the data and do some initial inspection
dat <- read.csv("datasets/gss_2016_test.csv")
# check for na's
# (none, yay!...but are there *really* none?! to be continued...)
colSums(is.na(dat))
## id year form sex age racecen1 racecen2 INCOME16
## 0 0 0 0 0 0 0 0
## natcity natcityy natrace natracey natfare natfarey
## 0 0 0 0 0 0
str(dat)
## 'data.frame': 2867 obs. of 14 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ year : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
## $ form : int 1 1 1 1 2 1 1 1 1 2 ...
## $ sex : int 1 1 1 2 2 2 1 2 1 1 ...
## $ age : int 47 61 72 43 55 53 50 23 45 71 ...
## $ racecen1: int 1 1 1 1 1 1 1 16 2 1 ...
## $ racecen2: int 0 0 0 0 0 0 0 0 0 0 ...
## $ INCOME16: int 26 19 21 26 26 20 26 16 20 20 ...
## $ natcity : int 2 8 8 2 0 1 2 1 2 0 ...
## $ natcityy: int 0 0 0 0 1 0 0 0 0 2 ...
## $ natrace : int 1 1 2 2 0 1 1 2 8 0 ...
## $ natracey: int 0 0 0 0 8 0 0 0 0 3 ...
## $ natfare : int 2 8 8 3 0 1 3 3 3 0 ...
## $ natfarey: int 0 0 0 0 1 0 0 0 0 2 ...
head(dat, n=5)
## id year form sex age racecen1 racecen2 INCOME16 natcity natcityy natrace
## 1 1 2016 1 1 47 1 0 26 2 0 1
## 2 2 2016 1 1 61 1 0 19 8 0 1
## 3 3 2016 1 1 72 1 0 21 8 0 2
## 4 4 2016 1 2 43 1 0 26 2 0 2
## 5 5 2016 2 2 55 1 0 26 0 1 0
## natracey natfare natfarey
## 1 0 2 0
## 2 0 8 0
## 3 0 8 0
## 4 0 3 0
## 5 8 0 1
exercise: use the commented code in the next chunk to make formatted tables, and compile an R-markdown document to print them nicely. You’ll need to do some googling – start with “formatted tables R-Markdown knitr kable html”. Next week how to do this will be made completely explicit, so feel free to skip for now :)
# to make nicer-looking tables:
# knitr::kable(t(as.table(colSums(is.na(dat)))))
# knitr::kable(table(dat$form), col.names=c("form","count"))
# knitr::kable(table(dat$sex), col.names=c("sex", "count"))
# now check + clean up + transform columns as necessary
# `id` is an identifier, not a number
dat$id <- paste0("resp", dat$id)
# `form` has just two levels
table(dat$form)
##
## 1 2
## 1437 1430
# so we can recode it with `ifelse()`
dat$form <- ifelse(dat$form==1, "standard", "assistance")
# `sex` has just two levels
table(dat$sex)
##
## 1 2
## 1276 1591
# so we can recode it with `ifelse()`
dat$sex <- ifelse(dat$sex==1, "male", "female")
# income has lots of different values
# ALSO BTW YAY FIRST PLOT!
barplot(table(dat$INCOME16))
# so we'll have to recode it in a more mechanical fashion...
# one option:
# recode income to `hi` or `mid` or `lo`, according to external values
lo <- 1:12; mid <- 13:21; hi <- 22:26
dat$income_cat <- ifelse(
dat$INCOME16 %in% lo, "lo", ifelse(
dat$INCOME16 %in% mid, "mid", ifelse(
dat$INCOME16 %in% hi, "hi", NA
)
)
); rm(lo); rm(mid); rm(hi)
# another option:
# read in external vals as a table, and just sub in values for numeric codes
income <- read.csv(
"datasets/lookup_income.csv", header=FALSE, col.names=c("num","val"),
stringsAsFactors=FALSE
)
# **note**:
# the following three lines won't feel intuitive right away,
# but try to break down what's happening step by step.
# this recoding method is very general and very useful.
# make a vector with values equal to income$val, the desired values
income_lkup <- as.vector(income$val)
# set the names of that vector to the current values, i.e. income$num
names(income_lkup) <- income$num
# now recode income by matching the current values w the lookup table names
dat$income_range <- income_lkup[match(dat$INCOME16, names(income_lkup))]
# remove the original income col
dat$INCOME16 <- NULL
note: semicolons just “end” the current line – they’re used to put a small set of similar commands on a single line to save space and emphasize that they’re all involved in a single operation.
note: notice that when recoding income, we assigning the new values to the new column income_range
, instead of the existing one, INCOME16
. This was deliberate: if we do it this way, then we’ll (temporarily) have both columns to look at, so we can check our work by comparing the old column to the new one.
exercise: make a new column, derived from income_range
, that contains the same information as income_cat
(which was computed directly from INCOME16
)
# let's check out the race cols again
table(dat$racecen1, useNA="always")
##
## 1 2 3 4 5 6 7 8 9 10 11 12 14 15 16
## 2088 487 58 25 19 22 7 3 4 10 1 1 2 13 103
## 98 99 <NA>
## 9 15 0
table(dat$racecen2, useNA="always")
##
## 0 1 2 3 4 5 6 7 8 9 10 13 14 15 16
## 2617 35 24 140 1 2 3 3 3 1 4 3 3 4 8
## 98 99 <NA>
## 1 15 0
# kinda messy...let's just extract the census's notion of "ethnicity"
# (i.e. hispanic or non-hispanic)
# first read in our associative array of keys and values
race <- read.csv(
"datasets/lookup_race.csv", header=FALSE, col.names=c("num","val"),
stringsAsFactors=FALSE
)
race_lkup <- as.vector(race$val)
names(race_lkup) <- race$num
dat$racecen1 <- ifelse(dat$racecen1 %in% c(0,98,99), NA, dat$racecen1)
dat$racecen2 <- ifelse(dat$racecen2 %in% c(0,98,99), NA, dat$racecen2)
dat$racecen1 <- race_lkup[match(dat$racecen1, names(race_lkup))]
dat$racecen2 <- race_lkup[match(dat$racecen2, names(race_lkup))]
dat$hisp <- ifelse(
dat$racecen1=="Hispanic", "hispanic", "nonhispanic"
)
dat$hisp[dat$racecen2=="Hispanic"] <- "hispanic"
# now we can safely eliminate the `racecen1/2` columns
dat$racecen1 <- NULL; dat$racecen2 <- NULL
exercise: if we wanted to investigate the effect of “race” on responses, how would/should/could we represent “race”? In other words, how many columns, what is in each column, etc. etc. What would be some benefits and drawbacks of two competing approaches?
# now we have everything except the outcome vars
# reorganize the df the way we want it
# vector of the demographic columns we want to keep around
dem_cols <- c("id", "year", "form", "sex", "age", "income_cat", "hisp")
# vector of the questionnaire items
item_cols <- names(dat)[startsWith(names(dat), "nat")]
# subsetting the data by columns, keeping just `dem_cols` and `item_cols`
dat <- dat[, c(dem_cols, item_cols)]
reshape2::melt()
Now we’ll implement the strategy from Section 2 above for representing the responses. This involves a process called “melting” the dataset. The result is moving from the current “wide format” to a “long format” dataset. Google around and you’ll find some nice explainers quickly.
The terminology should become more clear as you see more instances of this very common transformation.
# NATFARE/NATFAREY/NATCITY/NATCITYY/NATRACE/NATRACEY
# 9 "No answer"
# 8 "Don't know"
# 3 "Too much"
# 2 "About right"
# 1 "Too little"
# 0 "Not applicable"
# `melt()` comes from the package `reshape2::`
library("reshape2")
# hold `dem_cols` (see above) constant, collapse the other columns
dat <- melt(dat, id.vars=dem_cols)
# recode the new "column-column" as character (instead of factor)
dat$variable <- as.character(dat$variable)
# rename the column-column ("variable") and the response-column ("value")
names(dat)[names(dat)=="variable"] <- "item"
names(dat)[names(dat)=="value"] <- "response"
# check out what we have now:
knitr::kable(head(dat, n=10))
id | year | form | sex | age | income_cat | hisp | item | response |
---|---|---|---|---|---|---|---|---|
resp1 | 2016 | standard | male | 47 | hi | nonhispanic | natcity | 2 |
resp2 | 2016 | standard | male | 61 | mid | nonhispanic | natcity | 8 |
resp3 | 2016 | standard | male | 72 | mid | nonhispanic | natcity | 8 |
resp4 | 2016 | standard | female | 43 | hi | nonhispanic | natcity | 2 |
resp5 | 2016 | assistance | female | 55 | hi | nonhispanic | natcity | 0 |
resp6 | 2016 | standard | female | 53 | mid | nonhispanic | natcity | 1 |
resp7 | 2016 | standard | male | 50 | hi | nonhispanic | natcity | 2 |
resp8 | 2016 | standard | female | 23 | mid | hispanic | natcity | 1 |
resp9 | 2016 | standard | male | 45 | mid | nonhispanic | natcity | 2 |
resp10 | 2016 | assistance | male | 71 | mid | nonhispanic | natcity | 0 |
exercise: how does the line of code renaming variable
to item
work? Try the command "item" -> names(dat)[names(dat)=="variable"]
. What is the symbol ->
and how is it related to <-
?
# toss all the rows w/o real answers
# (don't forget the comma before the closing square bracked -- `dat` is a df!)
dat <- dat[!dat$response %in% c(9, 8, 0), ]
# make a lookup table for the responses
val_lkup <- c(`1`="too little", `2`="about right", `3`="too much")
# recode responses as character
dat$response <- as.character(val_lkup[match(dat$response, names(val_lkup))])
# look at a table
knitr::kable(table(dat$response, useNA="ifany"), col.names=c("resp", "freq"), pad=0)
resp | freq |
---|---|
about right | 2781 |
too little | 3516 |
too much | 1662 |
note: the “backtick” syntax (` and `) allows you to use numbers, spaces, and other stuff that you wouldn’t normally be able to directly say without a syntax error. For example if you had a column of dat
called favorite color
, you couldn’t say dat$favorite color
because there’s a space in the column name. But you could say dat$`favorite color`
to refer to that column.
Recall point 3. in our gameplan above – we want just 3 unique vals in item
. But currently we have six:
knitr::kable(table(dat$item, useNA="ifany"), col.names=c("item", "freq"), pad=0)
item | freq |
---|---|
natcity | 1315 |
natcityy | 1277 |
natfare | 1385 |
natfarey | 1403 |
natrace | 1294 |
natracey | 1285 |
A simple – albeit kind of sloppy – solution would be to just remove all occurrences of the letter “y” from the item
column. Let’s just do that (see exercise though):
# for now we'll just toss all y's to make life easier
dat$item <- gsub("y", "", dat$item)
# confirm that we have just three vals now
knitr::kable(table(dat$item, useNA="ifany"), col.names=c("item", "freq"), pad=0)
item | freq |
---|---|
natcit | 2592 |
natfare | 2788 |
natrace | 2579 |
exercise (important): Remove all trailing "y"
’s from the dat$response
column. (hint: use the regular expressions cheatsheet on the course page. another hint-slash-exercise: google for examples of gsub()
, grep()
, grepl()
, and gregexpr()
in action. Also peep their cousins sub()
and regexpr()
too, though they’re not quite as useful imo. If this sounds too painful, peruse the intro vignette for the stringr::
package) instead.
exercise: what if anything could or would go wrong if we just chopped off all the "y"
’s in the current situation? What conditions would definitely cause this to be a bad idea?
Here’s a variety of ways we can summarize the data, using a few different strategies.
# many ways to summarize data by groups!
# frequency tables
table(dat$sex, dat$response)
##
## about right too little too much
## female 1488 1986 886
## male 1293 1530 776
table(dat$income_cat, dat$response)
##
## about right too little too much
## hi 678 718 373
## lo 472 658 252
## mid 1387 1821 874
# tables can have arbitrary dimensions, but even three is tough to parse...
# table(dat$sex, dat$income_cat, dat$response)
# cross-tabulate
xtabs( ~ sex + income_cat, data=dat)
## income_cat
## sex hi lo mid
## female 816 779 2297
## male 953 603 1785
xtabs( ~ response + sex + income_cat, data=dat)
## , , income_cat = hi
##
## sex
## response female male
## about right 294 384
## too little 368 350
## too much 154 219
##
## , , income_cat = lo
##
## sex
## response female male
## about right 269 203
## too little 386 272
## too much 124 128
##
## , , income_cat = mid
##
## sex
## response female male
## about right 769 618
## too little 1027 794
## too much 501 373
# marginal proportions
# **exercise** try setting margin=2
prop.table(xtabs( ~ response + income_cat, data=dat), margin=1)
## income_cat
## response hi lo mid
## about right 0.2672448 0.1860465 0.5467087
## too little 0.2245855 0.2058180 0.5695965
## too much 0.2488326 0.1681121 0.5830554
prop.table(table(dat$response, dat$income_cat), margin=1)
##
## hi lo mid
## about right 0.2672448 0.1860465 0.5467087
## too little 0.2245855 0.2058180 0.5695965
## too much 0.2488326 0.1681121 0.5830554
# can use formula interface with `aggregate()`
aggregate(age ~ sex + income_cat, FUN="mean", data=dat)
## sex income_cat age
## 1 female hi 49.33946
## 2 male hi 49.13641
## 3 female lo 49.44673
## 4 male lo 48.65008
## 5 female mid 48.68655
## 6 male mid 48.14230
aggregate(age ~ sex + income_cat, FUN="sd", data=dat)
## sex income_cat age
## 1 female hi 14.45683
## 2 male hi 16.25358
## 3 female lo 19.11131
## 4 male lo 18.48916
## 5 female mid 18.02443
## 6 male mid 17.39359
aggregate(age ~ sex + income_cat, FUN=function(x) round(mean(x)), data=dat)
## sex income_cat age
## 1 female hi 49
## 2 male hi 49
## 3 female lo 49
## 4 male lo 49
## 5 female mid 49
## 6 male mid 48
# write a custom function for aggregating
custom_summary <- function(x){
c(mean=round(mean(x)), sd=round(sd(x)), min=min(x), max=max(x))
}
aggregate(age ~ sex + income_cat, FUN=custom_summary, data=dat)
## sex income_cat age.mean age.sd age.min age.max
## 1 female hi 49 14 18 99
## 2 male hi 49 16 18 99
## 3 female lo 49 19 19 89
## 4 male lo 49 18 18 89
## 5 female mid 49 18 18 99
## 6 male mid 48 17 18 99
#
# and finally, my drug of choice is dplyr::summarize()/group_by()
library("dplyr")
dat %>% group_by(sex, income_cat) %>% summarize(
num_obs = length(response),
num_toolittle = sum(response=="too little", na.rm=TRUE),
prop_toolittle = num_toolittle / num_obs,
num_aboutright = sum(response=="about right", na.rm=TRUE),
prop_aboutright = num_aboutright / num_obs,
num_toomuch = sum(response=="too much", na.rm=TRUE),
prop_toomuch = num_toomuch / num_obs
# the `mutate_if()` is just to make it display nicer
) %>% mutate_if(is.numeric, round, digits=3) %>% data.frame()
## sex income_cat num_obs num_toolittle prop_toolittle num_aboutright
## 1 female hi 816 368 0.451 294
## 2 female lo 779 386 0.496 269
## 3 female mid 2297 1027 0.447 769
## 4 female <NA> 468 205 0.438 156
## 5 male hi 953 350 0.367 384
## 6 male lo 603 272 0.451 203
## 7 male mid 1785 794 0.445 618
## 8 male <NA> 258 114 0.442 88
## prop_aboutright num_toomuch prop_toomuch
## 1 0.360 154 0.189
## 2 0.345 124 0.159
## 3 0.335 501 0.218
## 4 0.333 107 0.229
## 5 0.403 219 0.230
## 6 0.337 128 0.212
## 7 0.346 373 0.209
## 8 0.341 56 0.217
dat %>% group_by(item, form) %>% summarize(
num_obs = length(response),
num_toolittle = sum(response=="too little", na.rm=TRUE),
prop_toolittle = num_toolittle / num_obs,
num_aboutright = sum(response=="about right", na.rm=TRUE),
prop_aboutright = num_aboutright / num_obs,
num_toomuch = sum(response=="too much", na.rm=TRUE),
prop_toomuch = num_toomuch / num_obs
) %>% mutate_if(is.numeric, round, digits=3) %>% data.frame()
## item form num_obs num_toolittle prop_toolittle num_aboutright
## 1 natcit assistance 1277 309 0.242 529
## 2 natcit standard 1315 683 0.519 455
## 3 natfare assistance 1403 1002 0.714 301
## 4 natfare standard 1385 316 0.228 477
## 5 natrace assistance 1285 534 0.416 535
## 6 natrace standard 1294 672 0.519 484
## prop_aboutright num_toomuch prop_toomuch
## 1 0.414 439 0.344
## 2 0.346 177 0.135
## 3 0.215 100 0.071
## 4 0.344 592 0.427
## 5 0.416 216 0.168
## 6 0.374 138 0.107
exercise: what does the following command yield, and why does it yield that? Does this make sense as a summary for some combination of our variables? command: xtabs(age ~ sex + income_cat, data=dat)
note: you’ll see the formula interface used in xtabs()
in a few different areas in R – most notably in regression modeling. The form is usually y ~ x1 + ... + xn
, with y
being the outcome variable and the xi
’s being predictors.
Now let’s get to plotting folks!!!
There are several base functions for making plots. They’re great for exploratory analysis because they’re so quick and easy. They can be customized endlessly, but if you want to customize the appearance, you’re better off using ggplot()
.
# barplots made from a table
barplot(table(dat$sex))
barplot(table(dat$income_cat))
# basic scatterplot
plot(iris$Sepal.Width, iris$Sepal.Length)
# with color coding
plot(iris$Sepal.Width, iris$Sepal.Length, col=iris$Species)
xvar <- 1:10
yvar <- rnorm(xvar)
# line plot (nice for time-series)
plot(xvar, yvar, type="l")
# scatter
plot(xvar, yvar, type="p")
# both
plot(xvar, yvar, type="b")
# other options are "c", "o", "S"/"s", "h", "n"
# histograms for numeric univariate distributions
# this is probably the base plotting function i use the most
hist(dat$age)
exercise: merge the 2012, 2014, and 2016 data using a *_join()
operation from dplyr::
(see the top-right of the second page of the data wrangling cheatsheet). Then make a line-plot where the x-axis is year
, and the y-axis is proportion “too much” responses across all three items.
exercise: after doing the previous exercise, read in the file called "datasets/gss_complete_test.csv"
and make the corresponding plot. Does it look identical to yours? (hint: it should!)
ggplot()
Alright, bringing out the big guns now. ggplot2::
has its own syntax, because you interact with its plotting capabilities via a (very small/limited) domain-specific language (DSL), which you can think of as kind of a mini-language that’s very limited in scope because it’s only meant to do one thing. For example the formula interface we saw with aggregate()
can be considered a DSL. The same formula-language (with some extra capabilities) is also used for most of R’s regression modeling functions, e.g. lm()
, glm()
, lme4::lmer()
, and lme4::glmer()
.
Before plotting, we’ll make a couple of summary datasets, which will make some aspects easier moving forward. There are many ways to do this.
exercise: In the chunk below, I start to make a summary dataset using base functions like aggregate()
. This can be tedious but is very important to know how to do. I’m therefore leaving most of it as an exercise. Essentially, here’s the goal: inspect the code that creates pdat
(see week 2 appendix for background on pipe-chains with %>%
if you want to understand what’s going on here), figure out exactly what it does, and then finish writing the code at the beginning of the chunk, which is meant to create a summary data frame with the same information as pdat
has. You can check your work by confirming that your summary data frame is identical to pdat
(rounding aside!).
# for each group defined by the grouping vars, how many observations?
num_obs <- aggregate(response ~ sex + income_cat + item + form, data=dat, FUN="length")
# and what's the mean age?
mean_age <- aggregate(age ~ sex + income_cat + item + form, data=dat, FUN="mean")
# how many of the 'response' vals are "too little"? in each group
prop_toolittle <- aggregate(response=="too little" ~ sex + income_cat + item + form, data=dat, FUN="sum")
# change the name of the column we calculated
names(prop_toolittle)[5] <- "too_little"
# now use it and mean_age to get the proportion of too_little responses
prop_toolittle$too_little <- prop_toolittle$too_little / mean_age$age
# ...
# ...
# ...
hint to exercise: you’ll probably need to create some “intermediate” data frames like mean_age
and prop_toolittle
, then figure out how to put them together to get somethign that looks like pdat
below:
# a very nice, compact, and efficient way to build the summary data frame
pdat <- dat %>% group_by(sex, income_cat, item, form) %>% summarize(
num_obs = length(response),
mean_age = mean(age, na.rm=TRUE),
prop_toolittle = sum(response=="too little", na.rm=TRUE) / num_obs,
prop_aboutright = sum(response=="about right", na.rm=TRUE) / num_obs,
prop_toomuch = sum(response=="too much", na.rm=TRUE) / num_obs
) %>% mutate_if(is.numeric, round, digits=3) %>%
melt(measure.vars=c("prop_toolittle","prop_aboutright","prop_toomuch")) %>%
data.frame()
# make sure income is ordered how we want it (the others dont matter)
pdat$income_cat <- factor(pdat$income_cat, levels=c("lo","mid","hi"))
exercise: rewrite the summary code by combining aggregate()
with a function that you created (use the example with custom_summary()
above as a starting point).
The best way to learn ggplot is by example, so here we go. note that some of the plots below are more useful than others, but they’re all valid examples of stuff you might have/want to do eventually.
library("ggplot2")
# scatterplots
# regular
ggplot(pdat, aes(x=mean_age, y=value)) +
geom_point()
# color-coded
ggplot(pdat, aes(x=mean_age, y=value, color=variable)) +
geom_point()
ggplot(pdat, aes(x=mean_age, y=value, color=income_cat)) +
geom_point()
# size-coded
ggplot(pdat, aes(x=mean_age, y=value, size=num_obs)) +
geom_point()
# size- and color-coded
ggplot(pdat, aes(x=mean_age, y=value, size=num_obs, color=income_cat)) +
geom_point()
exercise: how could we get rid of the NA
’s in the plot with color coding for income_cat
?
Now let’s look at how the responses are distributed for individual items:
# first group the data by item and get proportions for each response
boosh <- dat %>% group_by(item) %>% summarize(
prop_toolittle = sum(response=="too little", na.rm=TRUE) / n(),
prop_aboutright = sum(response=="about right", na.rm=TRUE) / n(),
prop_toomuch = sum(response=="too much", na.rm=TRUE) / n()
) %>% melt(id.vars="item")
# now make some plots
ggplot(boosh, aes(x=item, y=value, fill=variable)) + geom_bar(stat="identity")
ggplot(boosh, aes(x=item, y=value, fill=variable)) +
geom_bar(stat="identity", position="dodge")
ggplot(boosh, aes(x="", y=value, fill=variable)) +
geom_bar(stat="identity", position="dodge") +
facet_wrap(~item)
Now let’s see if form
looks like it has an effect on responses!
# first group the data by item and get proportions for each response
boosh <- dat %>% group_by(item, form) %>% summarize(
prop_toolittle = sum(response=="too little", na.rm=TRUE) / n(),
prop_aboutright = sum(response=="about right", na.rm=TRUE) / n(),
prop_toomuch = sum(response=="too much", na.rm=TRUE) / n()
) %>% melt(id.vars=c("item","form")) %>% data.frame()
knitr::kable(boosh)
item | form | variable | value |
---|---|---|---|
natcit | assistance | prop_toolittle | 0.2419734 |
natcit | standard | prop_toolittle | 0.5193916 |
natfare | assistance | prop_toolittle | 0.7141839 |
natfare | standard | prop_toolittle | 0.2281588 |
natrace | assistance | prop_toolittle | 0.4155642 |
natrace | standard | prop_toolittle | 0.5193199 |
natcit | assistance | prop_aboutright | 0.4142522 |
natcit | standard | prop_aboutright | 0.3460076 |
natfare | assistance | prop_aboutright | 0.2145403 |
natfare | standard | prop_aboutright | 0.3444043 |
natrace | assistance | prop_aboutright | 0.4163424 |
natrace | standard | prop_aboutright | 0.3740340 |
natcit | assistance | prop_toomuch | 0.3437745 |
natcit | standard | prop_toomuch | 0.1346008 |
natfare | assistance | prop_toomuch | 0.0712758 |
natfare | standard | prop_toomuch | 0.4274368 |
natrace | assistance | prop_toomuch | 0.1680934 |
natrace | standard | prop_toomuch | 0.1066461 |
We can plot this in a number of ways – each providing a slightly different perspective on the data. (generally speaking, not all perspectives are equally useful/sensical!)
ggplot(boosh, aes(x=variable, y=value, fill=form)) +
geom_bar(stat="identity", position="dodge") +
facet_wrap(~item)
ggplot(boosh, aes(x=item, y=value, fill=form)) +
geom_bar(stat="identity", position="dodge") +
facet_wrap(~variable)
ggplot(boosh, aes(x=variable, y=value)) +
geom_bar(stat="identity", position="dodge") +
facet_grid(item ~ form)
ggplot(boosh, aes(x=variable, y=value, fill=variable)) +
geom_bar(stat="identity", position="dodge") +
facet_grid(item~form)
exercise: what should we expect sum(boosh$value)
to evaluate to? Why?
For notes on styling ggplot2
plots, check out the week3 script. Here’s a few of the things we went over in class, all of which you’ll find in that script:
Two basic ways to save a ggplot2::
plot object are:
ggsave()
(see week3 script)You can specify the dimensions and filetype etc. with the “export” interface. You can do the same with ggsave()
, but the latter has a few advantages:
if time, very quick/incomplete intro to web-scraping in R
if time, whatever else yall want to talk about/do!
for
-loops to make animated gif plots!Title is hopefully self-explanatory! Check out the super fun + cool magick::
package (esp. this nice vignette), which allows you to call image-magick functions directly from R.
exercise: mark up the code below with comments explaining what’s happening in each step or line of code, and why. (places where I’d suggest putting comments are marked with # [comment here]
– writing lots of comments is a good way to reinforce stuff in your memory)
library("ggplot2")
library("magick")
# [comment here]
df <- data.frame(
time_worked = rep(seq(0, 1.4, .2), times=2),
amount_done = c(seq(0, 4, .8)^2, 50, 100, seq(0, 9, 2)^2, 85, 90, 95),
strategy = rep(c("writing code", "point and click"), each=8)
)
# [comment here]
df <- arrange(df, time_worked)
# [comment here]
for (x in seq(from=2, to=nrow(df), by=2)){
# [comment here]
myplot <-
ggplot(df[1:x, ], aes(x=time_worked, y=amount_done, color=strategy)) +
# [comment here]
geom_point() +
# [comment here]
geom_line() +
# [comment here]
scale_x_continuous(limits=c(0, 1.5)) +
scale_y_continuous(limits=c(0, 100)) +
# [commment here]
theme(legend.position="top",
axis.text=element_blank())
# [comment here]
fname_digit <- ifelse(x < 10, paste0("0", x), as.character(x))
# [comment here]
ggsave(
filename=paste0("plots/myplot", fname_digit, ".png"), plot=myplot,
width=7, height=5, units="in"
)
}
# [comment here]
gifcap <- c("yayyy, animated gif plot! \nwoop woop ~~ <(o_@)^")
# [comment here]
dir("plots/")[endsWith(dir("plots/"), ".png")] %>%
(function(x) paste0("plots/", x)) %>%
# [comment here]
image_read() %>%
image_scale("x600") %>%
# [comment here]
image_join() %>%
image_animate(fps=2) %>%
# [comment here]
image_annotate(gifcap, location="+100+100", color="green", size=30) %>%
# [comment here]
image_write(path="output/animate_woop_woop.gif", quality=10)
Aaaand here’s what we end up with: