Language Modeling with N-grams
Class 2 — July 14, 2020
What are N-grams?
- N-length sequences of words
- Solve the problem of language modeling/next word prediction
- Given a sequence of n words, what word comes next?
- What are potential applications of this problem?
- Speech recognition
- When stuck deciding what someone said, pick the most probable option
- Downside: infrequent words are very difficult to pick up! For example, try asking Siri about “the church’s nave” (credit to Rover Levy)
- Predictive text on smartphone keyboards, in Gmail, etc.
- Speech recognition
How do N-grams work?
- Pick your training corpus
- Generate a list of all of the N-word-long sequences in your corpus
- Count all of the times that $w_N$ follows $(w_1, …, w_{N-1})$
- Turn your counts from step 3 into a probability distribution: $P(w_N \mid w_1, …, w_{N-1})$
- Given a prompt, randomly sample your probability distribution to pick the next word
- Repeat step 5 until you’ve reached your desired length, or until you’ve reached a desired token, e.g. a period
Simplified example of above
- (Very small) corpus: “I think I can do it.”
- Set $N=2$ (bigram model) → (I think), (think I), (I can), (can do), (do it), (it .)
- Set prompt = “I” → think: 1, can: 1
- $P(\text{think} \mid \text{I}) = 0.5$, $P(\text{can} \mid \text{I}) = 0.5$
- Randomly select “can” → “I can”
- Repeat until end of string → “I can do it.”
Discussion questions
- What’s the probability of a sentence in a bigram model?
- $P(w_1, …, w_N) = P(w_N \mid w_{N-1})…P(w_2 \mid w_1)$
- $P(\text{I can do it.})$ = $P(\text{can} \mid \text{I})P(\text{do} \mid \text{can})P(\text{it} \mid \text{do})P(\text{.} \mid \text{it})=0.5$
- What would a unigram model be?
- What do you expect will happen for larger or smaller values of N?
Exercise
- Exercise notebook (download .ipynb)
- Solution notebook (download .ipynb)
- Solution notebook with NLTK (download .ipynb)
Pros of N-gram models
- Very easy to build and quick to execute
- Highly explainable
Cons of N-gram models
- Models with small values of N are very forgetful
- End of sentence often won’t complete the beginning
- Sentences don’t build on each other
- Models with large values of N are too similar to source text
- Most 7-grams are basically unique (credit to Roger Levy)