The use of twitter data for estimating public sentiment about issues can still be considered a fledgling-ish field. Here we assess the degree to which a fairly naive sentiment classifier can be used to predict the trajectory of the average bitcoin trading price, over time. The catch is that we’ll only use data publicly available via the twitter API. This is limiting in a few respects, chief among which is that targeting specific historical date ranges is basically impossible with only twitter's public search API tools. A scaled-up (and idealized) version of this strategy would use firehose output for bitcoin-related keywords at particular time intervals during which bitcoin trading prices are volatile.
A few links to stuff used or mentioned in this post:
monkeylearn::
R package# load dependencies:
# - for data acquisition and processing
library("twitteR"); library("ROAuth"); library("httr")
library("lubridate"); library("reshape2"); library("dplyr")
# - for displaying results
library("ggplot2"); library("scales"); library("knitr")
# set auth cache to false to avoid prompts in `setup_twitter_oauth()`
options(httr_oauth_cache=FALSE)
# read in api keys and access tokens
keyz <- read.csv("../../../../keyz.csv", stringsAsFactors=FALSE)
# set up twitter auth -- call `setup_twitter_oauth()` w keys from `keyz`
setup_twitter_oauth(
consumer_key = keyz$value[keyz$auth_type=="twtr_api_key"],
consumer_secret = keyz$value[keyz$auth_type=="twtr_api_sec"],
access_token = keyz$value[keyz$auth_type=="twtr_acc_tok"],
access_secret = keyz$value[keyz$auth_type=="twtr_acc_sec"]
)
## [1] "Using direct authentication"
The user @bitcoinprice is a bot that tweets out the average BTC trading price on an hourly basis. We want to build an hourly time-series dataset of BTC prices, starting right now and going back as far as the API will let us go.
# want days to be formatted as default abbrevs, but ordered like this
day_labels <- c("Mon","Tue","Wed","Thu","Fri","Sat","Sun")
# query the API for max allowed num of tweets (3200) from user @bitcoinprice,
# then clean up the data, toss all the info we don't want/need, + call it `dat`
dat <- userTimeline(user="bitcoinprice", n=3200, excludeReplies=TRUE) %>%
# convert json-ish tweets to a df
twListToDF() %>%
# not interested in retweeted content
filter(!isRetweet) %>%
# make time-series easier to deal with
mutate(date = date(created),
hour = hour(created),
date_time = as.POSIXct(created, tz="UTC")) %>%
# get day of the week, give them quick labels as defined above
mutate(day = weekdays(date_time, abbreviate=TRUE)) %>%
mutate(day = factor(day, levels=day_labels)) %>%
# put colnames in my preferred format
rename(data_src = screenName,
num_fav = favoriteCount,
num_RT = retweetCount,
is_RT = isRetweet) %>%
# toss everything that's not relevant, arrange as desired
select(data_src, date_time, date, hour, day,
num_fav, num_RT, is_RT, text)
The tweets – now in dat$text
are completely formulaic, and have the following shape:
@bitcoinprice: The average price of Bitcoin across all exchanges is <dddd.dd> USD
where <dddd.dd> is the string that we want to extract and convert to numeric. We’ll extract it with some straightforward regex-ing, performing a couple of quality control checks along the way.
# a tweet consists of: toss[1] + "dddd.dd" + toss[2]
toss <- c("The average price of Bitcoin across all exchanges is ", " USD")
# check that all tweets are formulaic + as expected
if (!all(grepl(paste0(toss[1], "\\d*\\.?\\d*", toss[2]), dat$text))){
warning("careful -- tweets not as formulaic as they may seem! :/ ")
}
# given formulaic tweets, remove `toss` + extract price from `dat$text`
dat$price <- gsub(paste0(toss, collapse="|"), "", dat$text, perl=TRUE)
# check we don't have any words/letters left before converting to numeric
if (sum(grepl("[a-z]", dat$price)) > 0){
warning("careful -- some tweets were not in expected format! :o ")
}
# now convert to numeric
dat$price <- as.numeric(dat$price)
Quick inspection reveals that there’s something anomalous about one of the tweets: on 2017-03-14 15:00:03, @bitcoinprice says the average trading price is $1.25. A cursory glance at the surrounding data points suggests that it’s probably meant to be $1225.00.1
# fix the incorrect data point
dat$price[dat$price==1.25] <- dat$price[dat$price==1.25] * 1000
And make a quick table for a final quality check – seems reasonable for the last 4-ish months of BTC pricing.
kable(table(cut(dat$price, breaks=seq(0, max(dat$price), 500))),
col.names=c("price range (USD)","count"))
price range (USD) | count |
---|---|
(0,500] | 0 |
(500,1e+03] | 478 |
(1e+03,1.5e+03] | 1977 |
(1.5e+03,2e+03] | 407 |
(2e+03,2.5e+03] | 301 |
Now we can look at the price of BTC over recent months, using @bitcoinprice’s tweets. We’ll validate the data with an external source in the next step.
# what's our date range? (3200 hourly tweets so should be 3200/24 days ~ 4mo)
range(dat$date)
## [1] "2017-01-20" "2017-06-04"
# make some plots
# price over time for full range of tweets
ggplot(dat, aes(x=date_time, y=price)) +
geom_line() + scale_y_continuous(limits=c(0, 3000))
# aggregate over day-hour pairs to look at typical structure of a week
pdat <- dat %>% select(day, hour, price) %>% group_by(day, hour) %>% summarize(
mean_price = mean(price)
) %>% data.frame()
ggplot(pdat, aes(x=hour, y=mean_price)) +
geom_line() +
facet_wrap(~day, ncol=7)
Just to make sure our prices are accurate, we can validate them on an external source of data. Here we’ll use bitcoinity.org, which offers a nice little data export feature on the price-tracking portion of their website.
# get external price data from bitcoinity.org
bitcoinity <- read.csv(
"data/bitcoinity_data-price-volume.csv",
col.names=c("date","price","volume"),
colClasses=c("Date","numeric","numeric")
)
bitcoinity$data_src <- "bitcoinity.org"
kable(head(bitcoinity, n=4))
date | price | volume | data_src |
---|---|---|---|
2010-07-17 | 0.0495100 | 0.990200 | bitcoinity.org |
2010-07-18 | 0.0779750 | 5.091994 | bitcoinity.org |
2010-07-19 | 0.0867925 | 49.731775 | bitcoinity.org |
2010-07-20 | 0.0779994 | 20.595480 | bitcoinity.org |
Now plot the bitcoinity data on the same interval as the @bitcoinprice data. The result is sparser than the above plot since there’s fewer data points per 24-hr period, but the pattern looks identical. Nice.
# the external data, plotted on the same interval as `dat` above:
ggplot(bitcoinity[bitcoinity$date >= min(dat$date), ], aes(x=date, y=price)) +
geom_line() + scale_y_continuous(limits=c(0, 3000))
Just for reference, here’s the trajectory of BTC since the very beginning (fml can’t believe I didn’t get in when I first heard about BTC back in 2011 :/). There’s also a trading volume column in the bitcoinity data, which we’ll plot in orange in the same window.
# want to display the all-time high price and date
fig_cap <- paste0(
"all-time high: $", round(max(bitcoinity$price)), ", on ",
bitcoinity$date[bitcoinity$price==max(bitcoinity$price)]
)
# the external data, plotted on the whole lifetime of btc:
ggplot(bitcoinity, aes(x=date, y=price)) +
geom_line() +
geom_line(aes(x=date, y=volume/1e6), color="#ed9953") +
geom_hline(yintercept=max(bitcoinity$price),
color="#8aa8b5", linetype="dashed") +
annotate(geom="text", x=as.Date("2016-02-01"), y=max(bitcoinity$price)+100,
color="#8aa8b5", label=fig_cap) +
scale_y_continuous(limits=c(0, 3000)) +
scale_x_date(date_breaks="6 months") +
theme(axis.text.x=element_text(angle=45, vjust=1, hjust=1),
plot.caption=element_text(color="#ed9953", hjust=0)) +
labs(y="average trading price (USD)",
caption="(trading volume in orange)")
Everything looks fine, but to be totally sure we can merge the external data with the tweeted price data, and plot them against each other.
# first aggregate @bitcoinprice data to the day-level
tdat_simple <- dat %>%
mutate(date=as.character(date)) %>%
select(data_src, date, price) %>%
group_by(data_src, date) %>% summarize(
price = mean(price, na.rm=TRUE)
) %>% data.frame()
# then cut the external data to the same interval and harmonize cols for merge
bitcoinity_simple <- bitcoinity %>%
filter(date >= min(dat$date)) %>%
select(data_src, date, price) %>%
mutate(date=as.character(date)) %>%
data.frame()
# check that a stack-merge would give us what we want -- yep
kable(rbind(head(tdat_simple, 2), head(bitcoinity_simple, 2)))
data_src | date | price |
---|---|---|
bitcoinprice | 2017-01-20 | 899.5345 |
bitcoinprice | 2017-01-21 | 919.9208 |
bitcoinity.org | 2017-01-20 | 896.2196 |
bitcoinity.org | 2017-01-21 | 918.2547 |
# date info got destroyed along the way, so fix it during merge
merged <-
rbind(tdat_simple, bitcoinity_simple) %>%
mutate(date = as.Date(date))
# now plot the merged data -- diff not even visible
ggplot(merged, aes(x=date, y=price, group=data_src, color=data_src)) +
geom_line() +
scale_y_continuous(limits=c(0, 3000)) +
scale_x_date(date_breaks="2 weeks") +
theme(axis.text.x=element_text(angle=45, vjust=1, hjust=1)) +
theme(legend.position="top")
# zoom in to see how big the differences that do exist are
merged_zoom <- merged %>% filter(date >= "2017-05-21" & date <= "2017-05-24")
# get price differences for each day in `merged_zoom`
# (each vec is alerady ordered by date so this is safe to do)
price_diffs <-
round(merged_zoom$price[merged_zoom$data_src=="bitcoinprice"] -
merged_zoom$price[merged_zoom$data_src=="bitcoinity.org"])
ggplot(merged_zoom, aes(x=date, y=price, group=data_src, color=data_src)) +
geom_line() +
scale_y_continuous(limits=c(2000, 2500), breaks=c(2000, 2500)) +
scale_x_date(date_breaks="1 day") +
theme(legend.position="top") +
annotate("text", x=unique(merged_zoom$date), y=2500,
label=paste0("$", abs(price_diffs), " diff"))
Interesting…The prices don’t vary too much, but with automated trading, I imagine one could take huge advantage of knowing about the relative accuracy of different price reporting systems. Another thing to consider is that @bitcoinprice and bitcoinity.org draw samples at slightly different times. Given the volatility of bitcoin pricing, I’d say that even a difference of $17 at a given date is well within a reasonable margin for measurement error.
Monkeylearn is a nice web-based NLP service that allows you to programmatically send unstructured text to classifiers, taggers, and extractors that are hosted on the platform. The R bindings to the monkeylearn API (via the monkeylearn::
package) take in some text and a specification of which classifier/extractor you want to use, and spit back classifications, sentiment scores, or whatever you requested from the model you specified.
# load monkeylearn api bindings
library("monkeylearn")
# get my private monkeylearn key from the keyz df
ml_key <- keyz$value[keyz$auth_type=="monkey_pkey"]
What seem to be the two most popular sentiment classifiers designed specifically for tweet text are the following:
"cl_qkjxv9Ly"
)"cl_RWBfoo2B"
)I’d expect the English-only classifier to be a bit better all else equal. A quick look at the confusion matrices for each classifier suggests that the English-specific one is more trustworthy.
knitr::include_graphics("input/confusion-matrix-cl_qkjxv9Ly.png")
knitr::include_graphics("input/confusion-matrix-cl_RWBfoo2B.png")
Since we’ll just be using a couple of classifiers in a couple of specific ways, it makes sense to wrap the monkeylearn_classify()
function with some defaults we’ll be using.2
ml_classify <- function(text, id){
# send the text and classifier id to ML, get analysis back
monkeylearn_classify(
request=text,
key=ml_key,
classifier_id=id,
verbose=TRUE
)
}
# set variables for the classifiers, for easier arg passing
clf_english <- "cl_qkjxv9Ly"
clf_multilg <- "cl_RWBfoo2B"
Let’s see if the classifier(s) do anything remotely like what we want them to do. To be explicit: what we want it/them to do is to take in a bunch of tweets about bitcoin, and assign each of them a sentiment category (with a measure of confidence so we can filter out what seem to be pure guesses).
One thing we could do is see how the classifiers do on tweets about topics for which we have clear expectations – e.g. tweets about Kaitlyn Jenner should be positive (lol), tweets about Donald Trump should be…er…idk unfortunately. That’ll be the eventual strategy, but in the interest of not losing interest, we’ll validate with some more specific and longer texts.
I’d say that one of the most “negative” things ever written in the English language would be The Unabomber Manifesto. So let’s make sure the classifiers consider it “negative.”
# read in local copy of the unabomber manifesto -- "kaczynski (1995)"
unabomber_manif <- readLines("input/kaczynski1995.txt")
# query each of the two classifiers with the unabomber manifesto
unabomber_english_clf <- ml_classify(unabomber_manif, id=clf_english)
unabomber_multilg_clf <- ml_classify(unabomber_manif, id=clf_multilg)
Now make a quick table summarizing the Unabomber Manifesto results for each classifier.
una_tbl_english <- unabomber_english_clf %>% group_by(label) %>% summarize(
num_preds = length(label),
prop_preds = num_preds / nrow(unabomber_english_clf),
avg_prob = mean(probability, na.rm=TRUE)
) %>% mutate_if(is.numeric, round, digits=2) %>% mutate(clf = "clf_english")
una_tbl_multilg <- unabomber_multilg_clf %>% group_by(label) %>% summarize(
num_preds = length(label),
prop_preds = num_preds / nrow(unabomber_multilg_clf),
avg_prob = mean(probability, na.rm=TRUE)
) %>% mutate_if(is.numeric, round, digits=2) %>% mutate(clf = "clf_multilg")
As expected, looks like the English-only one matches our intuitions way better – it classifies most portions of the manifesto as “negative” – 66% – and only 12% as “positive.” The multilingual sentiment analyzer classifies 45% as “Negative”, but even more – 53% – as “Positive”!
kable(rbind(una_tbl_english, una_tbl_multilg))
label | num_preds | prop_preds | avg_prob | clf |
---|---|---|---|---|
negative | 208 | 0.66 | 0.54 | clf_english |
neutral | 68 | 0.22 | 0.47 | clf_english |
positive | 38 | 0.12 | 0.45 | clf_english |
Negative | 141 | 0.45 | 0.64 | clf_multilg |
Neutral | 8 | 0.03 | 0.48 | clf_multilg |
Positive | 165 | 0.53 | 0.66 | clf_multilg |
Let’s also make sure the classifiers consider something clearly positive to be as such. Song of Solomon from the Old Testament is pretty “positive” in a certain sense, I’d say :p
# a positive text
song_of_solomon <- readLines("input/song_of_solomon.txt")
# query each of the two classifiers with the unabomber manifesto
songos_english_clf <- ml_classify(song_of_solomon, id=clf_english)
songos_multilg_clf <- ml_classify(song_of_solomon, id=clf_multilg)
Now make a quick table summarizing the Song of Solomon results for each classifier.
sos_tbl_english <- songos_english_clf %>% group_by(label) %>% summarize(
num_preds = length(label),
prop_preds = num_preds / nrow(unabomber_english_clf),
avg_prob = mean(probability, na.rm=TRUE)
) %>% mutate_if(is.numeric, round, digits=2) %>% mutate(clf = "clf_english")
sos_tbl_multilg <- songos_multilg_clf %>% group_by(label) %>% summarize(
num_preds = length(label),
prop_preds = num_preds / nrow(unabomber_multilg_clf),
avg_prob = mean(probability, na.rm=TRUE)
) %>% mutate_if(is.numeric, round, digits=2) %>% mutate(clf = "clf_multilg")
The results here are less clear than in the previous case. But still, the multilingual classifier thinks ~30% of SoS is “negative.” In light of that and in light of its apparent avoidance of the Neutral
label, I’d say we can safely conclude that the English-only classifier is more likely to give us what we want.3
kable(rbind(sos_tbl_english, sos_tbl_multilg))
label | num_preds | prop_preds | avg_prob | clf |
---|---|---|---|---|
negative | 43 | 0.14 | 0.46 | clf_english |
neutral | 140 | 0.45 | 0.48 | clf_english |
positive | 91 | 0.29 | 0.58 | clf_english |
Negative | 91 | 0.29 | 0.56 | clf_multilg |
Neutral | 23 | 0.07 | 0.49 | clf_multilg |
Positive | 160 | 0.51 | 0.60 | clf_multilg |
With our tweet sentiment classifier now selected, let’s get a random sample of bitcoin-related tweets from the last few months.
# get random sample of dates from first 100 days of 2017
dates <- as.Date(sample(1:100, size=10), origin="2017-01-01") %>% as.character()
# quick wrapper for searchTwitter() with desired args
get_tweets <- function(date, term="bitcoin", num=100){
tweets <- searchTwitter(term, n=num, resultType="popular", since=date)
return(tweets)
}
# collect tweets for each date, convert the collections to a single df
tweets <- twListToDF(sapply(dates, get_tweets))
# apply the same transformations we did to the btc price tweets, to clean them
tweets <- tweets %>% filter(!isRetweet) %>%
mutate(date=date(created), hour=hour(created),
date_time=as.POSIXct(created, tz="UTC")) %>%
mutate(day=weekdays(date_time, abbreviate=TRUE)) %>%
mutate(day=factor(day, levels=day_labels)) %>%
rename(data_src=screenName, num_fav=favoriteCount,
num_RT=retweetCount, is_RT=isRetweet) %>%
select(data_src, date_time, date, hour, day, num_fav, num_RT, is_RT, text)
# peep the first couple of rows
head(tweets, n=2)
## data_src date_time date hour day num_fav num_RT
## 1 ErikVoorhees 2017-06-02 20:12:03 2017-06-02 20 Fri 324 177
## 2 KimDotcom 2017-06-03 18:01:05 2017-06-03 18 Sat 637 229
## is_RT
## 1 FALSE
## 2 FALSE
## text
## 1 Best article on Bitcoin I've seen at ZeroHedge: Projecting The Price Of Bitcoin | Zero Hedge https://t.co/f6dbMhwpcr #finance #blockchain
## 2 Bitcoin $2553 —————— \xed\xa0\xbe\xed\xb5\x82\xed\xa0\xbc\xed\xbd\xbe https://t.co/GuJ2gKCuMO
Now we can use clf_english
to classify the tweets:
# classify the tweets
tweets_clf <- ml_classify(text=tweets$text, id=clf_english)
# summarize the classification results
tweets_tbl <- tweets_clf %>% group_by(label) %>% summarize(
num_preds = length(label),
prop_preds = num_preds / nrow(tweets_clf),
avg_prob = mean(probability, na.rm=TRUE)
) %>% mutate_if(is.numeric, round, digits=2)
# look at the table to see results
kable(tweets_tbl)
label | num_preds | prop_preds | avg_prob |
---|---|---|---|
negative | 50 | 0.08 | 0.45 |
neutral | 490 | 0.82 | 0.66 |
positive | 60 | 0.10 | 0.53 |
Interesting – we’re getting around 85% “neutral” classifications, with a decent average degree of confidence…What if we filter out all rows with less than, say, 50% confidence?
tweets_confident <- tweets_clf %>% filter(probability > .5)
kable(tweets_confident %>% group_by(label) %>% summarize(
num_preds = length(label),
prop_preds = num_preds / nrow(tweets_confident),
avg_prob = mean(probability, na.rm=TRUE)
) %>% mutate_if(is.numeric, round, digits=2))
label | num_preds | prop_preds | avg_prob |
---|---|---|---|
negative | 10 | 0.02 | 0.64 |
neutral | 440 | 0.90 | 0.68 |
positive | 40 | 0.08 | 0.60 |
Not a whole lot more illuminating, and this ate up a lot of our sample. So, acknowledging that we live in an imperfect world, we’ll just go with the raw classifications. Eventually we’d want to define a sentiment metric given a table of classifications like the ones we get from ml_classify()
. But the output is simple enough that we can just eyeball it and quasi-subjectively evaluate whether what we get is reasonable.
We should also probably search for related terms, including those with hashtags – for example “#bitcoin”, “btc”, “#btc”, “cryptocurrency”, etc. For now that’s left for future work. Shoot me an email if you have some ideas about how this could be done systematically.
The natural next step is to run a couple of quick significance tests to help understand the nature of the result/difference…
Here’s a list of some things that are both important for this idea to work, and are also totally mysterious and hard to figure out!
What is the direction of causation between tweet sentiment and price? e.g. do people freak out and that causes prices to go down, or do prices go down for other reasons and then that gets reflected in people’s tweets? Interesting questions but elusive! Relative timing issues are super important here – see point 2. for discussion.
Can we get finer-grained temporal resolution on topic-related tweets? That’d be tremendously helpful since it’s easy to get temporally fine-grained price data. The only way is probably some kind of bridge to the twitter firehose. Probably will need to look more into services like gnip to solve this problem.
Relevant to this issue is the following quote (which might be outdated at this point – not totally sure):
“A question that I’m often asked is how to retrieve data from the past, generally people are doing a study on some major event that has already happened (e.g. Arab Spring, an election, etc). Using the Twitter API this is impossible as you can only go back a small amount. However, if you have the ability to look ahead, it is easy to enable a prospective study by collecting data and automatically persisting it to a database.” 2015, from
twitteR::
package documentation
How can we get a higher-quality sentiment classifier?! (without having to write one myself!) In other words, if you’re an NLP guru interested in this stuff, shoot me an email
What even is the nature of the signal?! A lot of the stuff you see in these tweets reads like an ad, so the degree to which these tweets can even be used to measure “public sentiment” or even “sentiment” about bitcoin might be kind of questionable in the first place. To be continued…maybe.
exercise: what does this suggest about how the @bitcoinprice bot works?!↩
note: ml_classify()
assumes that the global variable ml_key
is set to a valid monkeylearn API key. If it’s not, you won’t be able to send queries :/↩
Whether that should count as “likely” at all in the first place is definitely debatable. Buuuut let’s just assume it gives us a legitimate measure of positivity/negativity – if we find a much better classifier in the future, we can just sub it in and then we’ll have results we can trust better.↩
I think the mean over each hourly tweet should be fine; maybe median could give us a slightly different perspective. Not sure, but worth thinking about some more.↩