1. setting things up

1.1 set global chunk options and load packages

We’ll end up using functions from the following packages:

library("dplyr"); library("reshape2"); library("ggplot2");
library("ggthemes"); library("knitr");

# also set the ggplot2 theme to something nicer than the default
theme_set(theme_fivethirtyeight(base_size=20))

1.2 describe and read in dataset, and inspect

Here we read in a dataset from a url and inspect the first few rows. For 5000 English words, this dataset lists the word’s frequency (number of occurrences) in the Corpus of Contemporary American English (COCA) in dat$Frequency; specifies what kind of word it is (e.g. noun or verb or article) in dat$PartOfSpeech; and also includes the word’s frequency rank, i.e. what position it would be at if you ordered all words in the list by their corpus frequency. These are the three columns we’ll look at here.

# using two lines so the url doesn't overflow 
# (maximum of 80 characters per line of code is a nice cutoff point)
link <- paste0("http://lefft.xyz/r_minicourse/",
               "datasets/top5k-word-frequency-dot-info.csv")

dat <- read.csv(link, stringsAsFactors=FALSE)

head(dat, n=5)
##   Rank Word PartOfSpeech Frequency Dispersion
## 1    1  the            a  22038615       0.98
## 2    2   be            v  12545825       0.97
## 3    3  and            c  10741073       0.99
## 4    4   of            i  10343885       0.97
## 5    5    a            a  10144200       0.98

1.3 clean up dataset as desired

Let’s change the column names just for demo purposes:

names(dat) <- c("rank","word","part_of_speech","frequency","dispersion")

And then let’s recode part of speech so that we can look at it and understand what it means…

# see what the unique vals of `part_of_speech` are
unique(dat$part_of_speech)
##  [1] "a" "v" "c" "i" "t" "p" "d" "x" "r" "m" "n" "e" "j" "u"

But there’s a problem: we don’t know the corresepondence between the part of speech tags and what they stand for! We can get a hint by using the following (google sapply() – extremely useful – it basically does the same kind of work that a for-loop would do)

# see a few examples of each 
sapply(unique(dat$part_of_speech), function(x){
  paste0(x, " ", dat$word[dat$part_of_speech==x][1:3])
})
##      a       v        c        i      t      p       d        x      
## [1,] "a the" "v be"   "c and"  "i of" "t to" "p it"  "d this" "x not"
## [2,] "a a"   "v have" "c that" "i in" "t NA" "p I"   "d that" "x n't"
## [3,] "a his" "v do"   "c but"  "i to" "t NA" "p you" "d what" "x NA" 
##      r       m         n          e         j         u       
## [1,] "r up"  "m one"   "n time"   "e there" "j other" "u yes" 
## [2,] "r so"  "m two"   "n year"   "e NA"    "j new"   "u oh"  
## [3,] "r out" "m first" "n people" "e NA"    "j good"  "u yeah"

Now we can recode it with a named vector as a “lookup table”, as we have in similar cases in the past:

# set the "key-value" pairs
pos_lkup <- c(
  a="article", v="verb", c="conjunction/coordinator", i="preposition",
  t="infinitive 'to'", p="pronoun", d="determiner", x="negation", r="adverb",
  m="number (cardinal/ordinal/etc.)", n="noun", e="expletive 'there'", 
  j="adjective", u="particle"
)

# reassign `$part_of_speech` to the clean values, matching by the current ones
dat$part_of_speech <- 
  as.character(pos_lkup[match(dat$part_of_speech, names(pos_lkup))])

Ahhh, much better:

kable(head(dat, n=5))
rank word part_of_speech frequency dispersion
1 the article 22038615 0.98
2 be verb 12545825 0.97
3 and conjunction/coordinator 10741073 0.99
4 of preposition 10343885 0.97
5 a article 10144200 0.98

2. summarize and visualize columns of interest

Let’s look at the relationship between dat$frequency and dat$rank, for each part of speech. (yayy, five thirty eight theme via ggthemes::!)

# distribution of pos's
table(dat$part_of_speech, useNA="ifany")
## 
##                      adjective                         adverb 
##                            839                            340 
##                        article        conjunction/coordinator 
##                             11                             38 
##                     determiner              expletive 'there' 
##                             34                              1 
##                infinitive 'to'                       negation 
##                              1                              2 
##                           noun number (cardinal/ordinal/etc.) 
##                           2542                             35 
##                       particle                    preposition 
##                             13                             97 
##                        pronoun                           verb 
##                             46                           1001

To simplify things, let’s just look at nouns, verbs, adjectives, and adverbs.

# **this can be done equivalently a number of ways -- what are some others?**
dat <- subset(dat, part_of_speech %in% c("noun","verb","adjective","adverb"))

2.1 display some summary tables

Here’s the distribution of part of speech in the simplified dataset:

# add some stuff to the `table` command above to make a formatted version
kable(as.data.frame(table(dat$part_of_speech, useNA="ifany")),
      col.names=c("part of speech", "count (number of rows of `dat`)"))
part of speech count (number of rows of dat)
adjective 839
adverb 340
noun 2542
verb 1001

And a table for frequency and rank by part of speech:

# use `aggregate()` to get group-wise means
freq_tab <- aggregate(frequency ~ part_of_speech, data=dat, FUN="mean")
rank_tab <- aggregate(rank ~ part_of_speech, data=dat, FUN="mean")

# join the two aggregated df's with `dplyr::left_join()`
both_tab <- left_join(freq_tab, rank_tab, by="part_of_speech")

# now display the table
kable(both_tab, 
      col.names=c("part of speech", "mean corpus freq.", "mean freq. rank"),
      align=c("c","l","l"), digits=0)
part of speech mean corpus freq. mean freq. rank
adjective 23862 2720
adverb 62987 2268
noun 26610 2614
verb 64297 2436

2.2 display some plots

Here’s plots of the distributions of frequency and rank, broken down by part of speech:

# distribution of frequency by part of speech
ggplot(dat, aes(x=part_of_speech, y=frequency)) +
  geom_boxplot() + 
  scale_y_log10() +
  theme(axis.text.x=element_text(angle=45, hjust=1, vjust=1)) +
  labs(title="log word frequency by part of speech",
       subtitle="for 5000 most frequent words in COCA")

# distribution of rank by part of speech
ggplot(dat, aes(x=part_of_speech, y=rank)) +
  geom_boxplot() + 
  theme(axis.text.x=element_text(angle=45, hjust=1, vjust=1)) +
  labs(title="frequency rank by part of speech",
       subtitle="for 5000 most frequent words in COCA")

note: here’s the corresponding commands to create these plots with base graphics (this chunk is not evaluated because of the chunk option eval=FALSE; but it is displayed because we set echo=TRUE in the setup chunk):

# same as above plots, but with base graphics
boxplot(frequency ~ part_of_speech, data=dat)
boxplot(rank ~ part_of_speech, data=dat)

It often makes sense to perform a log transformation on frequency data, so let’s add two new columns for the log of frequency and of rank. (note we secretly did this in the first plot above, with scale_y_log10())

dat$log_rank <- log(dat$rank)
dat$log_freq <- log(dat$frequency)

And here is a visual exemplification of Zipf’s Law: corpus frequency is inversely proportional to frequency rank in the same corpus!

# peep another theme
theme_set(theme_minimal(base_size=16))

ggplot(dat, aes(x=log_rank, y=log_freq)) +
  geom_point(shape=1, size=1) +
  facet_wrap(~part_of_speech, nrow=1) +
  labs(x="log frequency rank", y="log corpus frequency",
       title="an illustration of Zipf's Law",
       subtitle="source: Corpus of Contemporary American English (COCA)") +
  theme(plot.subtitle=element_text(size=12, face="italic"))

3. perform some analysis (if desired)

Here we can use statistical models to assess the relationship between different columns of our dataset.

3.1 fit a model to the data

# fit a linear regression model with `lm()` 
# the outcome is before the tilde; and 
# the predictors are after the tilde, separated by "+"
mod <- lm(log_rank ~ log_freq + part_of_speech, data=dat)

3.2 examine the model summary and evaluate relationships

# look at the full model summary
summary(mod)
## 
## Call:
## lm(formula = log_rank ~ log_freq + part_of_speech, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.14963 -0.05703  0.01259  0.06456  0.20723 
## 
## Coefficients:
##                       Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)          16.021254   0.012369 1295.319  < 2e-16 ***
## log_freq             -0.866553   0.001254 -690.953  < 2e-16 ***
## part_of_speechadverb -0.030790   0.005449   -5.651 1.69e-08 ***
## part_of_speechnoun    0.006435   0.003357    1.917 0.055329 .  
## part_of_speechverb   -0.014754   0.003958   -3.727 0.000196 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08428 on 4717 degrees of freedom
## Multiple R-squared:  0.9904, Adjusted R-squared:  0.9904 
## F-statistic: 1.217e+05 on 4 and 4717 DF,  p-value: < 2.2e-16

The coefficient table gives us most of what we need to know.

# print a table of coefficient estimates with se's and t and p-values
kable(summary(mod)$coef, digits=4)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.0213 0.0124 1295.3188 0.0000
log_freq -0.8666 0.0013 -690.9526 0.0000
part_of_speechadverb -0.0308 0.0054 -5.6506 0.0000
part_of_speechnoun 0.0064 0.0034 1.9168 0.0553
part_of_speechverb -0.0148 0.0040 -3.7274 0.0002

3.3 evaluate the overall quality of the model fit

R-squared is bounded by \([0, 1]\), so this fit is basically perfect.

# extract the r-squared value from the model summary
summary(mod)$r.squared
## [1] 0.9904044
# or compute it directly (see week1 notes for pearson's r)
cor(dat$log_rank, fitted(mod))^2
## [1] 0.9904044
# **exercise**: compute the root-mean-square-error for this model
# **question**: does it provide any information that r^2 doesn't in this case?

3.4 summarize the findings

Looks like corpus frequency is inversely proportional to frequency rank! And part of speech seems to matter for rank, too. But is there an interaction between the effect of corpus frequency on rank and the effect of part of speech on rank?! To find out, try substituting * (times) for + (plus) in the lm() command. This will include the interaction terms for log_freq and part_of_speech as predictors in the model.

Okay done now! Woop woop ~~ :p