week4_skeleton.rmd
)We’ll end up using functions from the following packages:
library("dplyr"); library("reshape2"); library("ggplot2");
library("ggthemes"); library("knitr");
# also set the ggplot2 theme to something nicer than the default
theme_set(theme_fivethirtyeight(base_size=20))
Here we read in a dataset from a url and inspect the first few rows. For 5000 English words, this dataset lists the word’s frequency (number of occurrences) in the Corpus of Contemporary American English (COCA) in dat$Frequency
; specifies what kind of word it is (e.g. noun or verb or article) in dat$PartOfSpeech
; and also includes the word’s frequency rank, i.e. what position it would be at if you ordered all words in the list by their corpus frequency. These are the three columns we’ll look at here.
# using two lines so the url doesn't overflow
# (maximum of 80 characters per line of code is a nice cutoff point)
link <- paste0("http://lefft.xyz/r_minicourse/",
"datasets/top5k-word-frequency-dot-info.csv")
dat <- read.csv(link, stringsAsFactors=FALSE)
head(dat, n=5)
## Rank Word PartOfSpeech Frequency Dispersion
## 1 1 the a 22038615 0.98
## 2 2 be v 12545825 0.97
## 3 3 and c 10741073 0.99
## 4 4 of i 10343885 0.97
## 5 5 a a 10144200 0.98
Let’s change the column names just for demo purposes:
names(dat) <- c("rank","word","part_of_speech","frequency","dispersion")
And then let’s recode part of speech so that we can look at it and understand what it means…
# see what the unique vals of `part_of_speech` are
unique(dat$part_of_speech)
## [1] "a" "v" "c" "i" "t" "p" "d" "x" "r" "m" "n" "e" "j" "u"
But there’s a problem: we don’t know the corresepondence between the part of speech tags and what they stand for! We can get a hint by using the following (google sapply()
– extremely useful – it basically does the same kind of work that a for
-loop would do)
# see a few examples of each
sapply(unique(dat$part_of_speech), function(x){
paste0(x, " ", dat$word[dat$part_of_speech==x][1:3])
})
## a v c i t p d x
## [1,] "a the" "v be" "c and" "i of" "t to" "p it" "d this" "x not"
## [2,] "a a" "v have" "c that" "i in" "t NA" "p I" "d that" "x n't"
## [3,] "a his" "v do" "c but" "i to" "t NA" "p you" "d what" "x NA"
## r m n e j u
## [1,] "r up" "m one" "n time" "e there" "j other" "u yes"
## [2,] "r so" "m two" "n year" "e NA" "j new" "u oh"
## [3,] "r out" "m first" "n people" "e NA" "j good" "u yeah"
Now we can recode it with a named vector as a “lookup table”, as we have in similar cases in the past:
# set the "key-value" pairs
pos_lkup <- c(
a="article", v="verb", c="conjunction/coordinator", i="preposition",
t="infinitive 'to'", p="pronoun", d="determiner", x="negation", r="adverb",
m="number (cardinal/ordinal/etc.)", n="noun", e="expletive 'there'",
j="adjective", u="particle"
)
# reassign `$part_of_speech` to the clean values, matching by the current ones
dat$part_of_speech <-
as.character(pos_lkup[match(dat$part_of_speech, names(pos_lkup))])
Ahhh, much better:
kable(head(dat, n=5))
rank | word | part_of_speech | frequency | dispersion |
---|---|---|---|---|
1 | the | article | 22038615 | 0.98 |
2 | be | verb | 12545825 | 0.97 |
3 | and | conjunction/coordinator | 10741073 | 0.99 |
4 | of | preposition | 10343885 | 0.97 |
5 | a | article | 10144200 | 0.98 |
Let’s look at the relationship between dat$frequency
and dat$rank
, for each part of speech. (yayy, five thirty eight theme via ggthemes::
!)
# distribution of pos's
table(dat$part_of_speech, useNA="ifany")
##
## adjective adverb
## 839 340
## article conjunction/coordinator
## 11 38
## determiner expletive 'there'
## 34 1
## infinitive 'to' negation
## 1 2
## noun number (cardinal/ordinal/etc.)
## 2542 35
## particle preposition
## 13 97
## pronoun verb
## 46 1001
To simplify things, let’s just look at nouns, verbs, adjectives, and adverbs.
# **this can be done equivalently a number of ways -- what are some others?**
dat <- subset(dat, part_of_speech %in% c("noun","verb","adjective","adverb"))
Here’s the distribution of part of speech in the simplified dataset:
# add some stuff to the `table` command above to make a formatted version
kable(as.data.frame(table(dat$part_of_speech, useNA="ifany")),
col.names=c("part of speech", "count (number of rows of `dat`)"))
part of speech | count (number of rows of dat ) |
---|---|
adjective | 839 |
adverb | 340 |
noun | 2542 |
verb | 1001 |
And a table for frequency and rank by part of speech:
# use `aggregate()` to get group-wise means
freq_tab <- aggregate(frequency ~ part_of_speech, data=dat, FUN="mean")
rank_tab <- aggregate(rank ~ part_of_speech, data=dat, FUN="mean")
# join the two aggregated df's with `dplyr::left_join()`
both_tab <- left_join(freq_tab, rank_tab, by="part_of_speech")
# now display the table
kable(both_tab,
col.names=c("part of speech", "mean corpus freq.", "mean freq. rank"),
align=c("c","l","l"), digits=0)
part of speech | mean corpus freq. | mean freq. rank |
---|---|---|
adjective | 23862 | 2720 |
adverb | 62987 | 2268 |
noun | 26610 | 2614 |
verb | 64297 | 2436 |
Here’s plots of the distributions of frequency and rank, broken down by part of speech:
# distribution of frequency by part of speech
ggplot(dat, aes(x=part_of_speech, y=frequency)) +
geom_boxplot() +
scale_y_log10() +
theme(axis.text.x=element_text(angle=45, hjust=1, vjust=1)) +
labs(title="log word frequency by part of speech",
subtitle="for 5000 most frequent words in COCA")
# distribution of rank by part of speech
ggplot(dat, aes(x=part_of_speech, y=rank)) +
geom_boxplot() +
theme(axis.text.x=element_text(angle=45, hjust=1, vjust=1)) +
labs(title="frequency rank by part of speech",
subtitle="for 5000 most frequent words in COCA")
note: here’s the corresponding commands to create these plots with base graphics (this chunk is not evaluated because of the chunk option eval=FALSE
; but it is displayed because we set echo=TRUE
in the setup chunk):
# same as above plots, but with base graphics
boxplot(frequency ~ part_of_speech, data=dat)
boxplot(rank ~ part_of_speech, data=dat)
It often makes sense to perform a log transformation on frequency data, so let’s add two new columns for the log of frequency and of rank. (note we secretly did this in the first plot above, with scale_y_log10()
)
dat$log_rank <- log(dat$rank)
dat$log_freq <- log(dat$frequency)
And here is a visual exemplification of Zipf’s Law: corpus frequency is inversely proportional to frequency rank in the same corpus!
# peep another theme
theme_set(theme_minimal(base_size=16))
ggplot(dat, aes(x=log_rank, y=log_freq)) +
geom_point(shape=1, size=1) +
facet_wrap(~part_of_speech, nrow=1) +
labs(x="log frequency rank", y="log corpus frequency",
title="an illustration of Zipf's Law",
subtitle="source: Corpus of Contemporary American English (COCA)") +
theme(plot.subtitle=element_text(size=12, face="italic"))
Here we can use statistical models to assess the relationship between different columns of our dataset.
# fit a linear regression model with `lm()`
# the outcome is before the tilde; and
# the predictors are after the tilde, separated by "+"
mod <- lm(log_rank ~ log_freq + part_of_speech, data=dat)
# look at the full model summary
summary(mod)
##
## Call:
## lm(formula = log_rank ~ log_freq + part_of_speech, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.14963 -0.05703 0.01259 0.06456 0.20723
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.021254 0.012369 1295.319 < 2e-16 ***
## log_freq -0.866553 0.001254 -690.953 < 2e-16 ***
## part_of_speechadverb -0.030790 0.005449 -5.651 1.69e-08 ***
## part_of_speechnoun 0.006435 0.003357 1.917 0.055329 .
## part_of_speechverb -0.014754 0.003958 -3.727 0.000196 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08428 on 4717 degrees of freedom
## Multiple R-squared: 0.9904, Adjusted R-squared: 0.9904
## F-statistic: 1.217e+05 on 4 and 4717 DF, p-value: < 2.2e-16
The coefficient table gives us most of what we need to know.
# print a table of coefficient estimates with se's and t and p-values
kable(summary(mod)$coef, digits=4)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 16.0213 | 0.0124 | 1295.3188 | 0.0000 |
log_freq | -0.8666 | 0.0013 | -690.9526 | 0.0000 |
part_of_speechadverb | -0.0308 | 0.0054 | -5.6506 | 0.0000 |
part_of_speechnoun | 0.0064 | 0.0034 | 1.9168 | 0.0553 |
part_of_speechverb | -0.0148 | 0.0040 | -3.7274 | 0.0002 |
R-squared is bounded by \([0, 1]\), so this fit is basically perfect.
# extract the r-squared value from the model summary
summary(mod)$r.squared
## [1] 0.9904044
# or compute it directly (see week1 notes for pearson's r)
cor(dat$log_rank, fitted(mod))^2
## [1] 0.9904044
# **exercise**: compute the root-mean-square-error for this model
# **question**: does it provide any information that r^2 doesn't in this case?
Looks like corpus frequency is inversely proportional to frequency rank! And part of speech seems to matter for rank, too. But is there an interaction between the effect of corpus frequency on rank and the effect of part of speech on rank?! To find out, try substituting *
(times) for +
(plus) in the lm()
command. This will include the interaction terms for log_freq
and part_of_speech
as predictors in the model.
Okay done now! Woop woop ~~ :p