In these notes we’ll be looking at some sample data from the General Social Survey (GSS), a survey measuring American opinion on a variety of issues. The survey has been conducted bi-annually by NORC for many years. Many thanks to Ben Schapiro for prepping this stuff! It’s very cool data, and quite good for illustrating many common data wrangling tasks. After some discussion about planning in Sections 1-2, we start writing actual code to manipulate the data in Section 3. Feel free to skip directly to Section 3 if you just want to practice coding.

note on the datasets: we have sample data from three years: 2012, 2014, and 2016. You can find the corresponding datasets here (2012), here (2014), and here (2016). In these notes we’ll look just at the 2016 data, but there’s so much more we could look at if we merged all three files. There are a couple of exercises below that will involve doing exactly that. But in case you just want to practice with the combined data, here’s a link to a file with all three years together. Or just get everything as a .zip archive, including a partial codebook and a script.

### 0. for next week

Keep your dataset in mind when working through these notes. Then try to make a list of things you want to do to the data. For example: recode or plot some of the variables, maybe reshape it with a melt() command, make a frequency table for a couple variables of interest, etc.

Try to make the list manageable – identify maybe five operations total. Then next week, we’ll implement those operations and build a nice lil output document from an R-Markdown file (which is just a fancy type of R script, basically).

### 1. getting the data and defining the problem

We want to see if the wording of a question has an effect on how people (identified by an id) respond to it. The (categorical) grouping variable that determines what kind of wording they got is form, which takes the values 1 or 2 (which aren’t categorical – we’ll recode them so that they are).

We have three questions with alternative wordings, which we’d like to investigate (manipulated between-subjects, across the levels of form). The questions of interest are:

We are faced with many problems in this country, none of which can be solved easily or inexpensively. I’m going to name some of these problems, and for each one I’d like you to tell me whether you think we’re spending too much money on it, too little money, or about the right amount…

[A]re we spending too much, too little, or about the right amount on (ITEM)?

where (ITEM) is one of the following:

• natcity: [A]re we spending too much, too little, or about the right amount on…
• …SOLVING PROBLEMS OF BIG CITIES (if form==1, response is in natcity)
• …ASSISTANCE TO BIG CITIES (if form==2, response is in natcityy)
• natrace: [A]re we spending too much, too little, or about the right amount on…
• …IMPROVING THE CONDITIONS OF BLACKS (if form==1, response is in natrace)
• …ASSISTANCE TO BLACKS (if form==2, response is in natracey)
• natfare: [A]re we spending too much, too little, or about the right amount on…
• …WELFARE (if form==1, response is in natfare)
• …ASSISTANCE TO THE POOR (if form==2, response is in natfarey)

An important question: should we be interested in variation across form for each individual questionnaire item, or can we consider each question an “item” in the sense that a “trial” instantiating it can appear in multiple “conditions” (here form==1 or form==2)? This really depends on our purpose, which here is to demonstrate how to manipulate data. In other words, the question is important if our goals are scientific, but not so much presently…

Since the form==2 questions for nat* all involve the same wording – namely “assistance to …” – maybe it makes sense to consider form==1 an indication of a “variable” that we could call something like non-uniform wording. Or maybe a slightly better option would be to view form as encoding an abstract variable describable as (roughly!) “contains the word assistance in the questionnaire text.” Viewed this way, instead of having $$n=2867$$ observations of three different outcome variables, we would now have $$n=2867\times3=8601$$ observations of a single outcome variable.

So what we want to do is to investigate relationships between item responses and other characteristics of the data that responses are “nested” inside of: which of the three items it is; which wording the respondent got (form 1 or form 2); demographic characteristics of the respondents; etc.

a note on the importance of domain-specific knowledge: One could certainly argue that the notion of a single, underlying “difference in wording” between the nat* and nat*y variables (in our subset of the GSS) isn’t very well motivated. It’s not like the nat* versions all have more words than their nat*t counterparts, for example…These are the types of issues where domain-specific knowledge/expertise is really important. Since I’m not super well-versed in the GSS, my strategy might not be the best one for the particular problem we’re looking at (particularly if we’re fitting statistical models to the data and making questionable assumptions about what the nature of form is). But nevertheless, doing things the way we’re going to do them will illustrate a pretty decent sample of the types of data-munging and visualization tasks you’ll likely have to deal with. This is all to say the following: hacking skills are very useful, but are absolutely not a substitute for nuanced understanding of the subject matter a dataset is “about.” Ability to manipulate, transform, and model data is a complementary skill to the theorizing that makes up the foundation of all scientific fields. (FWIW, I’d say it’s questionable at best whether there is truly something both interesting and uniform being manipulated across the levels of form, at least if we look at just these three items. I don’t think that the survey creator’s intent was to manipulate the items in this way, but my background in psycholinguistic experimentation kind of forces me to think of everything as an experiment with some kind of factorial design…)

Okay, back to something concrete now!

### 2. developing a gameplan

First we’ll need to read in the dataset and do some basic cleaning operations to make it easier to work with (e.g. recoding vars that need recoded). After that, we need to map out a gameplan for getting the data in the format we want it in.

How do we determine what the format is that we want it in? I think a good strategy is to think visually: what kind of graph could you imagine making from the data? Once you have an idea of that, work backwards and try to figure out what shape (e.g. columns) the data would need to have to create that graph. This definitely requires practice! And in fact, the more you practice plotting, the better intuitions you’ll have about how to get started cleaning/transforming a dataset.

Here’s a gameplan that came to my mind (and again, heed above disclaimer on domain knowledge): to investigate the impact of wording, item, etc. on responses, we’ll need to have our outcome variable response as a single column of the dataset. This column will have the possible values "too much", "about the right amount", and "too little". But there’s a problem: the outcome is currently spread across six columns…What’d be nice would be to have it instead as two columns: one indicating the question (i.e. natcity or natrace or natfare), and one indicating the response (i.e. 0, 1, 2, 3, 8, or 9, which we’ll recode as NA, "too much", "about the right amount", and "too little"). Since we already checked that form encodes wording type, we can safely just remove the y suffixes from the columns of interest and use form to represent “the wording manipulation.”

So how do we arrive at our target format?

There are many ways. Here’s a natural one to implement, where dat16 is the dataset in the file datasets/gss_2016_test.csv:

• Step 1. expand each row into six separate rows of ncol(dat16) - 6 + 2 columns apiece, where the final six columns are deleted and replaced by two new columns item and response such that for each set of six rows i:(i+5) (all of whose values in columns 1:(ncol(dat16)-6) are identical to the original value in the unexpanded row i):
• dat16$item[i:(i+5)] is the length-six character vector consisting of the names of the columns we’re chopping off (i.e. c("natcity","natcityy","natrace","natracey","natfare","natfarey")); and • dat16$response[i:(i+5)] contains the value of the chopped-off column with the name dat16$item specifies for that row. • Step 2. delete all rows $$i$$ for which dat16$value[i] %in% c(0,8,9) (which will all be coded as NA), since:
• half of all the rows we generated in step 1 will have value 0 (not applicable), since they didn’t provide an answer to the corresponding dat16$item – in other words, these aren’t real data points in the first place; and • we can’t use the other non-answer categories anyway; for present purposes, they are dead to us. After Step 2, we’ll have erased all the rows with empty responses, so what we’re left with is a dataset where dat16$item has one set of values for one level of form, and another set of distinct but corresponding values for the other level of form. But conceptually, the situation is that there’s only three values of dat16$item – not six – and that dat16$form tells us what kind of wording a particular subject got for all three items. This justifies identifying "nat*y" values in dat16$item with their corresponding "nat*" values. In other words, • Step 3. we delete all the trailing "y"’s from the dat16$item column.

note: if we’re being pedantic, we shouldn’t just delete all "y"’s, since "natcityy" has one that isn’t part of the original coding scheme (and hence we’re not targeting it for off-chopping!). In practice we’re going to, but see exercise on regular expressions below.

Now our data is (are?!) really really beautiful and we’re ready to start exploring it (although in the colloquial sense I’d say we’ve already explored it quite a bit!).

note: the steps we mapped out above were motivated not just by intuition, but also by the concept of “tidy data,” championed by Hadley Wickham et al. If you’re interested, I’d highly recommend reading this excellent paper (.pdf here), which is kind of a manifesto for a certain (efficient) kind of “tidiness” through all phases of a data-related endeavor.

We also want to see if demographic variables modulate the effect of wording on response. To do this, we’ll need to first clean up several columns, e.g. age, sex, and INCOME16. These are pretty self explanatory demographic concepts, but other than age we don’t really know the correspondence between the numeric codes in the data and the real-world values they represent. To make the correspondence visible, we’ll recode each of them to values we can look at and immediately understand.

### 3. reading and cleaning and transforming the data

#### 3.1 read in data and inspect

# read in the data and do some initial inspection

# check for na's
# (none, yay!...but are there *really* none?! to be continued...)
colSums(is.na(dat))
##       id     year     form      sex      age racecen1 racecen2 INCOME16
##        0        0        0        0        0        0        0        0
##  natcity natcityy  natrace natracey  natfare natfarey
##        0        0        0        0        0        0
str(dat)
## 'data.frame':    2867 obs. of  14 variables:
##  $id : int 1 2 3 4 5 6 7 8 9 10 ... ##$ year    : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $form : int 1 1 1 1 2 1 1 1 1 2 ... ##$ sex     : int  1 1 1 2 2 2 1 2 1 1 ...
##  $age : int 47 61 72 43 55 53 50 23 45 71 ... ##$ racecen1: int  1 1 1 1 1 1 1 16 2 1 ...
##  $racecen2: int 0 0 0 0 0 0 0 0 0 0 ... ##$ INCOME16: int  26 19 21 26 26 20 26 16 20 20 ...
##  $natcity : int 2 8 8 2 0 1 2 1 2 0 ... ##$ natcityy: int  0 0 0 0 1 0 0 0 0 2 ...
##  $natrace : int 1 1 2 2 0 1 1 2 8 0 ... ##$ natracey: int  0 0 0 0 8 0 0 0 0 3 ...
##  $natfare : int 2 8 8 3 0 1 3 3 3 0 ... ##$ natfarey: int  0 0 0 0 1 0 0 0 0 2 ...
head(dat, n=5)
##   id year form sex age racecen1 racecen2 INCOME16 natcity natcityy natrace
## 1  1 2016    1   1  47        1        0       26       2        0       1
## 2  2 2016    1   1  61        1        0       19       8        0       1
## 3  3 2016    1   1  72        1        0       21       8        0       2
## 4  4 2016    1   2  43        1        0       26       2        0       2
## 5  5 2016    2   2  55        1        0       26       0        1       0
##   natracey natfare natfarey
## 1        0       2        0
## 2        0       8        0
## 3        0       8        0
## 4        0       3        0
## 5        8       0        1

exercise: use the commented code in the next chunk to make formatted tables, and compile an R-markdown document to print them nicely. You’ll need to do some googling – start with “formatted tables R-Markdown knitr kable html”. Next week how to do this will be made completely explicit, so feel free to skip for now :)

# to make nicer-looking tables:
# knitr::kable(t(as.table(colSums(is.na(dat)))))
# knitr::kable(table(dat$form), col.names=c("form","count")) # knitr::kable(table(dat$sex), col.names=c("sex", "count"))

#### 3.2 basic cleanup and recoding

# now check + clean up + transform columns as necessary

# id is an identifier, not a number
dat$id <- paste0("resp", dat$id)

# form has just two levels
table(dat$form) ## ## 1 2 ## 1437 1430 # so we can recode it with ifelse() dat$form <- ifelse(dat$form==1, "standard", "assistance") # sex has just two levels table(dat$sex)
##
##    1    2
## 1276 1591
# so we can recode it with ifelse()
dat$sex <- ifelse(dat$sex==1, "male", "female")

# income has lots of different values
# ALSO BTW YAY FIRST PLOT!
barplot(table(dat$INCOME16)) # so we'll have to recode it in a more mechanical fashion... # one option: # recode income to hi or mid or lo, according to external values lo <- 1:12; mid <- 13:21; hi <- 22:26 dat$income_cat <- ifelse(
dat$INCOME16 %in% lo, "lo", ifelse( dat$INCOME16 %in% mid, "mid", ifelse(
dat$INCOME16 %in% hi, "hi", NA ) ) ); rm(lo); rm(mid); rm(hi) # another option: # read in external vals as a table, and just sub in values for numeric codes income <- read.csv( "datasets/lookup_income.csv", header=FALSE, col.names=c("num","val"), stringsAsFactors=FALSE ) # **note**: # the following three lines won't feel intuitive right away, # but try to break down what's happening step by step. # this recoding method is very general and very useful. # make a vector with values equal to income$val, the desired values
income_lkup        <- as.vector(income$val) # set the names of that vector to the current values, i.e. income$num
names(income_lkup) <- income$num # now recode income by matching the current values w the lookup table names dat$income_range <- income_lkup[match(dat$INCOME16, names(income_lkup))] # remove the original income col dat$INCOME16 <- NULL

note: semicolons just “end” the current line – they’re used to put a small set of similar commands on a single line to save space and emphasize that they’re all involved in a single operation.

note: notice that when recoding income, we assigning the new values to the new column income_range, instead of the existing one, INCOME16. This was deliberate: if we do it this way, then we’ll (temporarily) have both columns to look at, so we can check our work by comparing the old column to the new one.

exercise: make a new column, derived from income_range, that contains the same information as income_cat (which was computed directly from INCOME16)

#### 3.3 a slightly more complicated case

# let's check out the race cols again
table(dat$racecen1, useNA="always") ## ## 1 2 3 4 5 6 7 8 9 10 11 12 14 15 16 ## 2088 487 58 25 19 22 7 3 4 10 1 1 2 13 103 ## 98 99 <NA> ## 9 15 0 table(dat$racecen2, useNA="always")
##
##    0    1    2    3    4    5    6    7    8    9   10   13   14   15   16
## 2617   35   24  140    1    2    3    3    3    1    4    3    3    4    8
##   98   99 <NA>
##    1   15    0
# kinda messy...let's just extract the census's notion of "ethnicity"
# (i.e. hispanic or non-hispanic)

# first read in our associative array of keys and values
stringsAsFactors=FALSE
)
race_lkup        <- as.vector(race$val) names(race_lkup) <- race$num

dat$racecen1 <- ifelse(dat$racecen1 %in% c(0,98,99), NA, dat$racecen1) dat$racecen2 <- ifelse(dat$racecen2 %in% c(0,98,99), NA, dat$racecen2)

dat$racecen1 <- race_lkup[match(dat$racecen1, names(race_lkup))]
dat$racecen2 <- race_lkup[match(dat$racecen2, names(race_lkup))]

dat$hisp <- ifelse( dat$racecen1=="Hispanic", "hispanic", "nonhispanic"
)
dat$hisp[dat$racecen2=="Hispanic"] <- "hispanic"

# now we can safely eliminate the racecen1/2 columns
dat$racecen1 <- NULL; dat$racecen2 <- NULL

exercise: if we wanted to investigate the effect of “race” on responses, how would/should/could we represent “race”? In other words, how many columns, what is in each column, etc. etc. What would be some benefits and drawbacks of two competing approaches?

#### 3.4 reorganize columns as desired, toss unused ones

# now we have everything except the outcome vars
# reorganize the df the way we want it

# vector of the demographic columns we want to keep around
dem_cols  <- c("id", "year", "form", "sex", "age", "income_cat", "hisp")

# vector of the questionnaire items
item_cols <- names(dat)[startsWith(names(dat), "nat")]

# subsetting the data by columns, keeping just dem_cols and item_cols
dat <- dat[, c(dem_cols, item_cols)]

#### 3.5 creating the desired outcome variable with reshape2::melt()

Now we’ll implement the strategy from Section 2 above for representing the responses. This involves a process called “melting” the dataset. The result is moving from the current “wide format” to a “long format” dataset. Google around and you’ll find some nice explainers quickly.

The terminology should become more clear as you see more instances of this very common transformation.

# NATFARE/NATFAREY/NATCITY/NATCITYY/NATRACE/NATRACEY
#    8        "Don't know"
#    3        "Too much"
#    1        "Too little"
#    0        "Not applicable"

# melt() comes from the package reshape2::
library("reshape2")

# hold dem_cols (see above) constant, collapse the other columns
dat <- melt(dat, id.vars=dem_cols)

# recode the new "column-column" as character (instead of factor)
dat$variable <- as.character(dat$variable)

# rename the column-column ("variable") and the response-column ("value")
names(dat)[names(dat)=="variable"] <- "item"
names(dat)[names(dat)=="value"]    <- "response"

# check out what we have now:
knitr::kable(head(dat, n=10))
id year form sex age income_cat hisp item response
resp1 2016 standard male 47 hi nonhispanic natcity 2
resp2 2016 standard male 61 mid nonhispanic natcity 8
resp3 2016 standard male 72 mid nonhispanic natcity 8
resp4 2016 standard female 43 hi nonhispanic natcity 2
resp5 2016 assistance female 55 hi nonhispanic natcity 0
resp6 2016 standard female 53 mid nonhispanic natcity 1
resp7 2016 standard male 50 hi nonhispanic natcity 2
resp8 2016 standard female 23 mid hispanic natcity 1
resp9 2016 standard male 45 mid nonhispanic natcity 2
resp10 2016 assistance male 71 mid nonhispanic natcity 0

exercise: how does the line of code renaming variable to item work? Try the command "item" -> names(dat)[names(dat)=="variable"]. What is the symbol -> and how is it related to <-?

# toss all the rows w/o real answers
# (don't forget the comma before the closing square bracked -- dat is a df!)
dat <- dat[!dat$response %in% c(9, 8, 0), ] # make a lookup table for the responses val_lkup <- c(1="too little", 2="about right", 3="too much") # recode responses as character dat$response <- as.character(val_lkup[match(dat$response, names(val_lkup))]) # look at a table knitr::kable(table(dat$response, useNA="ifany"), col.names=c("resp", "freq"), pad=0)
resp freq
too little 3516
too much 1662

note: the “backtick” syntax ( and ) allows you to use numbers, spaces, and other stuff that you wouldn’t normally be able to directly say without a syntax error. For example if you had a column of dat called favorite color, you couldn’t say dat$favorite color because there’s a space in the column name. But you could say dat$favorite color to refer to that column.

Recall point 3. in our gameplan above – we want just 3 unique vals in item. But currently we have six:

knitr::kable(table(dat$item, useNA="ifany"), col.names=c("item", "freq"), pad=0) item freq natcity 1315 natcityy 1277 natfare 1385 natfarey 1403 natrace 1294 natracey 1285 A simple – albeit kind of sloppy – solution would be to just remove all occurrences of the letter “y” from the item column. Let’s just do that (see exercise though): # for now we'll just toss all y's to make life easier dat$item <- gsub("y", "", dat$item) # confirm that we have just three vals now knitr::kable(table(dat$item, useNA="ifany"), col.names=c("item", "freq"), pad=0)
item freq
natcit 2592
natfare 2788
natrace 2579

exercise (important): Remove all trailing "y"’s from the dat$response column. (hint: use the regular expressions cheatsheet on the course page. another hint-slash-exercise: google for examples of gsub(), grep(), grepl(), and gregexpr() in action. Also peep their cousins sub() and regexpr() too, though they’re not quite as useful imo. If this sounds too painful, peruse the intro vignette for the stringr:: package) instead. exercise: what if anything could or would go wrong if we just chopped off all the "y"’s in the current situation? What conditions would definitely cause this to be a bad idea? ### 4. summarizing and plotting the data #### 4.1 summarizing data with a variety of methods Here’s a variety of ways we can summarize the data, using a few different strategies. # many ways to summarize data by groups! # frequency tables table(dat$sex, dat$response) ## ## about right too little too much ## female 1488 1986 886 ## male 1293 1530 776 table(dat$income_cat, dat$response) ## ## about right too little too much ## hi 678 718 373 ## lo 472 658 252 ## mid 1387 1821 874 # tables can have arbitrary dimensions, but even three is tough to parse... # table(dat$sex, dat$income_cat, dat$response)

# cross-tabulate
xtabs( ~ sex + income_cat, data=dat)
##         income_cat
## sex        hi   lo  mid
##   female  816  779 2297
##   male    953  603 1785
xtabs( ~ response + sex + income_cat, data=dat)
## , , income_cat = hi
##
##              sex
## response      female male
##   too little     368  350
##   too much       154  219
##
## , , income_cat = lo
##
##              sex
## response      female male
##   too little     386  272
##   too much       124  128
##
## , , income_cat = mid
##
##              sex
## response      female male
##   too little    1027  794
##   too much       501  373
# marginal proportions
# **exercise** try setting margin=2
prop.table(xtabs( ~ response + income_cat, data=dat), margin=1)
##              income_cat
## response             hi        lo       mid
##   about right 0.2672448 0.1860465 0.5467087
##   too little  0.2245855 0.2058180 0.5695965
##   too much    0.2488326 0.1681121 0.5830554
prop.table(table(dat$response, dat$income_cat), margin=1)
##
##                      hi        lo       mid
##   about right 0.2672448 0.1860465 0.5467087
##   too little  0.2245855 0.2058180 0.5695965
##   too much    0.2488326 0.1681121 0.5830554
# can use formula interface with aggregate()
aggregate(age ~ sex + income_cat, FUN="mean", data=dat)
##      sex income_cat      age
## 1 female         hi 49.33946
## 2   male         hi 49.13641
## 3 female         lo 49.44673
## 4   male         lo 48.65008
## 5 female        mid 48.68655
## 6   male        mid 48.14230
aggregate(age ~ sex + income_cat, FUN="sd", data=dat)
##      sex income_cat      age
## 1 female         hi 14.45683
## 2   male         hi 16.25358
## 3 female         lo 19.11131
## 4   male         lo 18.48916
## 5 female        mid 18.02443
## 6   male        mid 17.39359
aggregate(age ~ sex + income_cat, FUN=function(x) round(mean(x)), data=dat)
##      sex income_cat age
## 1 female         hi  49
## 2   male         hi  49
## 3 female         lo  49
## 4   male         lo  49
## 5 female        mid  49
## 6   male        mid  48
# write a custom function for aggregating
custom_summary <- function(x){
c(mean=round(mean(x)), sd=round(sd(x)), min=min(x), max=max(x))
}
aggregate(age ~ sex + income_cat, FUN=custom_summary, data=dat)
##      sex income_cat age.mean age.sd age.min age.max
## 1 female         hi       49     14      18      99
## 2   male         hi       49     16      18      99
## 3 female         lo       49     19      19      89
## 4   male         lo       49     18      18      89
## 5 female        mid       49     18      18      99
## 6   male        mid       48     17      18      99
#

# and finally, my drug of choice is dplyr::summarize()/group_by()
library("dplyr")
dat %>% group_by(sex, income_cat) %>% summarize(
num_obs = length(response),
num_toolittle = sum(response=="too little", na.rm=TRUE),
prop_toolittle  = num_toolittle / num_obs,
num_toomuch = sum(response=="too much", na.rm=TRUE),
prop_toomuch = num_toomuch / num_obs
# the mutate_if() is just to make it display nicer
) %>% mutate_if(is.numeric, round, digits=3) %>% data.frame()
##      sex income_cat num_obs num_toolittle prop_toolittle num_aboutright
## 1 female         hi     816           368          0.451            294
## 2 female         lo     779           386          0.496            269
## 3 female        mid    2297          1027          0.447            769
## 4 female       <NA>     468           205          0.438            156
## 5   male         hi     953           350          0.367            384
## 6   male         lo     603           272          0.451            203
## 7   male        mid    1785           794          0.445            618
## 8   male       <NA>     258           114          0.442             88
## 1           0.360         154        0.189
## 2           0.345         124        0.159
## 3           0.335         501        0.218
## 4           0.333         107        0.229
## 5           0.403         219        0.230
## 6           0.337         128        0.212
## 7           0.346         373        0.209
## 8           0.341          56        0.217
dat %>% group_by(item, form) %>% summarize(
num_obs = length(response),
num_toolittle = sum(response=="too little", na.rm=TRUE),
prop_toolittle  = num_toolittle / num_obs,
num_toomuch = sum(response=="too much", na.rm=TRUE),
prop_toomuch = num_toomuch / num_obs
) %>% mutate_if(is.numeric, round, digits=3) %>% data.frame()
##      item       form num_obs num_toolittle prop_toolittle num_aboutright
## 1  natcit assistance    1277           309          0.242            529
## 2  natcit   standard    1315           683          0.519            455
## 3 natfare assistance    1403          1002          0.714            301
## 4 natfare   standard    1385           316          0.228            477
## 5 natrace assistance    1285           534          0.416            535
## 6 natrace   standard    1294           672          0.519            484
## 1           0.414         439        0.344
## 2           0.346         177        0.135
## 3           0.215         100        0.071
## 4           0.344         592        0.427
## 5           0.416         216        0.168
## 6           0.374         138        0.107

exercise: what does the following command yield, and why does it yield that? Does this make sense as a summary for some combination of our variables? command: xtabs(age ~ sex + income_cat, data=dat)

note: you’ll see the formula interface used in xtabs() in a few different areas in R – most notably in regression modeling. The form is usually y ~ x1 + ... + xn, with y being the outcome variable and the xi’s being predictors.

Now let’s get to plotting folks!!!

#### 4.2 plotting with base functions

There are several base functions for making plots. They’re great for exploratory analysis because they’re so quick and easy. They can be customized endlessly, but if you want to customize the appearance, you’re better off using ggplot().

# barplots made from a table
barplot(table(dat$sex)) barplot(table(dat$income_cat))

# basic scatterplot
plot(iris$Sepal.Width, iris$Sepal.Length)

# with color coding
plot(iris$Sepal.Width, iris$Sepal.Length, col=iris$Species) xvar <- 1:10 yvar <- rnorm(xvar) # line plot (nice for time-series) plot(xvar, yvar, type="l") # scatter plot(xvar, yvar, type="p") # both plot(xvar, yvar, type="b") # other options are "c", "o", "S"/"s", "h", "n" # histograms for numeric univariate distributions # this is probably the base plotting function i use the most hist(dat$age)

exercise: merge the 2012, 2014, and 2016 data using a *_join() operation from dplyr:: (see the top-right of the second page of the data wrangling cheatsheet). Then make a line-plot where the x-axis is year, and the y-axis is proportion “too much” responses across all three items.

exercise: after doing the previous exercise, read in the file called "datasets/gss_complete_test.csv" and make the corresponding plot. Does it look identical to yours? (hint: it should!)

#### 4.3 plotting with ggplot()

Alright, bringing out the big guns now. ggplot2:: has its own syntax, because you interact with its plotting capabilities via a (very small/limited) domain-specific language (DSL), which you can think of as kind of a mini-language that’s very limited in scope because it’s only meant to do one thing. For example the formula interface we saw with aggregate() can be considered a DSL. The same formula-language (with some extra capabilities) is also used for most of R’s regression modeling functions, e.g. lm(), glm(), lme4::lmer(), and lme4::glmer().

Before plotting, we’ll make a couple of summary datasets, which will make some aspects easier moving forward. There are many ways to do this.

exercise: In the chunk below, I start to make a summary dataset using base functions like aggregate(). This can be tedious but is very important to know how to do. I’m therefore leaving most of it as an exercise. Essentially, here’s the goal: inspect the code that creates pdat (see week 2 appendix for background on pipe-chains with %>% if you want to understand what’s going on here), figure out exactly what it does, and then finish writing the code at the beginning of the chunk, which is meant to create a summary data frame with the same information as pdat has. You can check your work by confirming that your summary data frame is identical to pdat (rounding aside!).

# for each group defined by the grouping vars, how many observations?
num_obs  <- aggregate(response ~ sex + income_cat + item + form, data=dat, FUN="length")
# and what's the mean age?
mean_age <- aggregate(age ~ sex + income_cat + item + form, data=dat, FUN="mean")
# how many of the 'response' vals are "too little"? in each group
prop_toolittle <- aggregate(response=="too little" ~ sex + income_cat + item + form, data=dat, FUN="sum")
# change the name of the column we calculated
names(prop_toolittle)[5] <- "too_little"
# now use it and mean_age to get the proportion of too_little responses
prop_toolittle$too_little <- prop_toolittle$too_little / mean_age$age # ... # ... # ... hint to exercise: you’ll probably need to create some “intermediate” data frames like mean_age and prop_toolittle, then figure out how to put them together to get somethign that looks like pdat below: # a very nice, compact, and efficient way to build the summary data frame pdat <- dat %>% group_by(sex, income_cat, item, form) %>% summarize( num_obs = length(response), mean_age = mean(age, na.rm=TRUE), prop_toolittle = sum(response=="too little", na.rm=TRUE) / num_obs, prop_aboutright = sum(response=="about right", na.rm=TRUE) / num_obs, prop_toomuch = sum(response=="too much", na.rm=TRUE) / num_obs ) %>% mutate_if(is.numeric, round, digits=3) %>% melt(measure.vars=c("prop_toolittle","prop_aboutright","prop_toomuch")) %>% data.frame() # make sure income is ordered how we want it (the others dont matter) pdat$income_cat <- factor(pdat\$income_cat, levels=c("lo","mid","hi"))

exercise: rewrite the summary code by combining aggregate() with a function that you created (use the example with custom_summary() above as a starting point).

The best way to learn ggplot is by example, so here we go. note that some of the plots below are more useful than others, but they’re all valid examples of stuff you might have/want to do eventually.

library("ggplot2")

# scatterplots

# regular
ggplot(pdat, aes(x=mean_age, y=value)) +
geom_point()

# color-coded
ggplot(pdat, aes(x=mean_age, y=value, color=variable)) +
geom_point()

ggplot(pdat, aes(x=mean_age, y=value, color=income_cat)) +
geom_point()

# size-coded
ggplot(pdat, aes(x=mean_age, y=value, size=num_obs)) +
geom_point()

# size- and color-coded
ggplot(pdat, aes(x=mean_age, y=value, size=num_obs, color=income_cat)) +
geom_point()

exercise: how could we get rid of the NA’s in the plot with color coding for income_cat?

Now let’s look at how the responses are distributed for individual items:

# first group the data by item and get proportions for each response
boosh <- dat %>% group_by(item) %>% summarize(
prop_toolittle  = sum(response=="too little", na.rm=TRUE) / n(),
prop_toomuch = sum(response=="too much", na.rm=TRUE) / n()
) %>% melt(id.vars="item")

# now make some plots
ggplot(boosh, aes(x=item, y=value, fill=variable)) + geom_bar(stat="identity")