Language Modeling with N-grams

Class 2 — July 14, 2020

What are N-grams?

N-length sequences of words
Solve the problem of language modeling/next word prediction
- Given a sequence of n words, what word comes next?
What are potential applications of this problem?
- Speech recognition
  - When stuck deciding what someone said, pick the most probable option
  - Downside: infrequent words are very difficult to pick up! For example, try asking Siri about “the church’s nave” (credit to Rover Levy)
- Predictive text on smartphone keyboards, in Gmail, etc.

Pick your training corpus
Generate a list of all of the N-word-long sequences in your corpus
Count all of the times that $w_N$ follows $(w_1, …, w_{N-1})$
Turn your counts from step 3 into a probability distribution: $P(w_N \mid w_1, …, w_{N-1})$
Given a prompt, randomly sample your probability distribution to pick the next word
Repeat step 5 until you’ve reached your desired length, or until you’ve reached a desired token, e.g. a period

(Very small) corpus: “I think I can do it.”
Set $N=2$ (bigram model) → (I think), (think I), (I can), (can do), (do it), (it .)
Set prompt = “I” → think: 1, can: 1
$P(\text{think} \mid \text{I}) = 0.5$, $P(\text{can} \mid \text{I}) = 0.5$
Randomly select “can” → “I can”
Repeat until end of string → “I can do it.”

$P(w_1, …, w_N) = P(w_N \mid w_{N-1})…P(w_2 \mid w_1)$
$P(\text{I can do it.})$ = $P(\text{can} \mid \text{I})P(\text{do} \mid \text{can})P(\text{it} \mid \text{do})P(\text{.} \mid \text{it})=0.5$

Models with small values of N are very forgetful
- End of sentence often won’t complete the beginning
- Sentences don’t build on each other
Models with large values of N are too similar to source text
- Most 7-grams are basically unique (credit to Roger Levy)