timothy leffel // http://lefft.xyz // april05/2018

## logistics

• most of today will be live demos
• clone this repository to follow along: https://github.com/lefft/nlp_intro
• to run the code on your own machine, you'll need:
• Python 3.3+, with sklearn, pandas, numpy
• R 3.4+, with dplyr, reshape2, Rtsne, ggplot2, ggrepel, gridExtra

(dependencies also listed in the repo readme)

## outline

1. overview of NLP (10-15min)
• what is natural language processing? (a bit of navel gazing)
2. modern tooling for NLP (5min)
• languages and packages
3. live demos (40min)
• document classification (Python)
• word embeddings (R)

## what is natural language processing? (a linguist's take)

two branches of computational linguistics:

• symbolic computational linguistics: using computers to construct and evaluate discrete models of language structure and meaning – the goal is to understand and represent human language
• statistical computational linguistics: using computers to learn what actual language use is like by exposing them to statistical regularities that are present in large, naturally-occurring corpora

## what is natural language processing? (a linguist's take)

I think of Natural Language Processing as basically synonymous with "statistical computational linguistics"

NLP is less of a unified field with an established theoretical tradition than it is an amalgamation of approaches to tasks that are derived from theoretical constructs from computer science, linguistics, and information theory.

## what is natural language processing?

high-level distinction between NLU and NLG

• Natural language understanding (NLU)
• Natural language generation (NLG)

Usually when people talk about "natural language processing" they are talking about NLU. And more specifically, NLU on text data (i.e. not audio)

We'll focus on text-based NLU today

## what is natural language processing?

easier to characterize NLP in terms of tasks and objectives than an abstract def'n

• document classification* ("is this text about Chicago or Trump or vaping?")
• sentiment analysis ("is author endorsing or criticizing something?")
• topic modeling ("what 'topics' are discussed in this set of documents?")
• named entity recognition ("who/what/where does this text talk about?")
• dependency parsing ("what is the structure of the sentences in this text?")
• and many many more…

*document classification is the most common NLP task at NORC -- e.g. coding of free-form survey responses, detecting topics or sentiment in social media posts

## what is natural language processing?

dependency parsing:

• tokenization
• lemmatization/stemming
• POS tagging
• build parse trees

• usually the unit of information in an NLP task is the "document" (e.g. a tweet or a webpage or an Amazon product review)
• a dataset is a set of documents, aka a corpus
• the name of the game is usually:
• do some preprocessing on the corpus (e.g. lowecasing, lemmatization);
• transform corpus into a numerical matrix (e.g. DTM, TCM);
• do a bunch of linear algebra over that matrix until you can either make the predictions you need to make, or view a visualization of the clustering you need to understand; and
• evaluate model predictions on the test/holdout set against human judgments/labels (for supervised problems like document classification)

## caveat

• most NLP tasks are very computationally expensive (we'll see examples later)
• high-level languages like Python and R are great for exploring data and deploying small- to medium-scale models
• but reference implementations are often written in lower-level languages -- e.g. both word2vec and GloVe are written in C; Stanford CoreNLP is in Java

Today we'll just look at modern tooling for Python and R

• pro tip: you can usually search github and find nice/convenient Python bindings to important algorithms implemented in other languages

## Python or R?

• as of 2018, the NLP ecosystem is much more mature in Python than in R
• R's memory management system makes it less than ideal for large-scale text analysis endeavors
• BUT some recent R packages for NLP are written largely in C++ or Rcpp::, and are shockingly fast/efficient -- something worth keeping an eye on (e.g. shouts to text2vec::)

bottom line: if you'll be dealing with more than a few MB of text, use Python

## workflow

For simple text classification problems, a nice workflow might be (informally):

1. use R to explore a sample of your corpus (visualization, summary statistics)
2. use Python to build preprocessing pipeline, develop, select, and train model
3. use Python to apply model to documents, record predictions
4. use R to evaluate model performance, visualize, and report results (.Rmd/ggplot2::)

NOTE: Jupyter Notebooks make this approach smoother than you might expect!

NOTE: or if you want to stay in the R Studio ecosystem, check out their new reticulate:: package – makes it easy to integrate Python and R into a single workflow (also smoother than you might expect)

#### NLTK

• tried and true – contains lots of corpora and datasets too
• implements statistical and symbolic models (unique in this respect)
• feels/smells old, originally designed as a teaching resource

#### gensim

• implements several cutting edge models
• fast and efficient – good for large-scale models

#### spacy

• nice API, easy(ish) to start using, fast + efficient
• out of the box, stuff just works – but not always super flexible

#### sklearn

• there are sklearn models designed for text – all use the sklearn API
• good starting point, and excellent for preprocessing text

#### text2vec (still in beta!)       [homepage] [CRAN]

• my favorite R package for NLP: fast and clean and lightweight (sklearny API)
• main goal is to provide an interface for training term embeddings (implements GloVe and word2vec in RCpp::) – but also provides nice R6:: classes for key data structures like DTM's and co-occurrence matrices

#### quanteda       [homepage] [CRAN]

• a modern end-to-end framework for working with text data – has a nice interface and seems like decent performance, but hasn't gained a ton of traction in the R community

#### tidytext       [intro vignette] [CRAN]

• excellent for exploration + learning, esp. w/ dplyr:: + Robinson & Silge (2016)
• can be slow + poor scaling so not a good choice for production-grade models

#### tm       [intro vignette] [CRAN]

• an older framework – I haven't used it much (anyone??)
## # A tibble: 5,000 x 5
##     Rank Word  PartOfSpeech Frequency Dispersion
##    <int> <chr> <chr>            <int>      <dbl>
##  1     1 the   a             22038615      0.980
##  2     2 be    v             12545825      0.970
##  3     3 and   c             10741073      0.990
##  4     4 of    i             10343885      0.970
##  5     5 a     a             10144200      0.980
##  6     6 in    i              6996437      0.980
##  7     7 to    t              6332195      0.980
##  8     8 have  v              4303955      0.970
##  9     9 to    i              3856916      0.990
## 10    10 it    p              3872477      0.960
## # ... with 4,990 more rows

## NLP project life cycle

there are often at least the following broad phases in an NLP project life cycle (details + steps vary wildly depending on task/context/goals):

1. data acquisition: define and obtain or construct your corpus (+ get labels for supervised tasks); set aside a subset (~20-40%) for evaluation
2. preprocessing: regularize text; remove stop words and other junk; transform corpus into a format you can compute on; calculate some summary statistics to be used downstream (e.g. number of unique words, total word count)
3. model development: define objective; select algorithm that generates predictions you can use to measure accuracy relative to objective; tune + select model using training data
4. model evaluation: using some appropriate metric (e.g. F1-score), measure model performance on data points not seen during training
5. model "deployment": use model to generate predictions on unseen documents in the future (retraining/updating on some schedule).