# Language Modeling with N-grams

Class 2 — July 14, 2020

## What are N-grams?

• N-length sequences of words
• Solve the problem of language modeling/next word prediction
• Given a sequence of n words, what word comes next?
• What are potential applications of this problem?
• Speech recognition
• When stuck deciding what someone said, pick the most probable option
• Downside: infrequent words are very difficult to pick up! For example, try asking Siri about “the church’s nave” (credit to Rover Levy)
• Predictive text on smartphone keyboards, in Gmail, etc.

## How do N-grams work?

2. Generate a list of all of the N-word-long sequences in your corpus
3. Count all of the times that $w_N$ follows $(w_1, …, w_{N-1})$
4. Turn your counts from step 3 into a probability distribution: $P(w_N \mid w_1, …, w_{N-1})$
5. Given a prompt, randomly sample your probability distribution to pick the next word
6. Repeat step 5 until you’ve reached your desired length, or until you’ve reached a desired token, e.g. a period

## Simplified example of above

1. (Very small) corpus: “I think I can do it.”
2. Set $N=2$ (bigram model) → (I think), (think I), (I can), (can do), (do it), (it .)
3. Set prompt = “I” → think: 1, can: 1
4. $P(\text{think} \mid \text{I}) = 0.5$, $P(\text{can} \mid \text{I}) = 0.5$
5. Randomly select “can” → “I can”
6. Repeat until end of string → “I can do it.”

## Discussion questions

1. What’s the probability of a sentence in a bigram model?
• $P(w_1, …, w_N) = P(w_N \mid w_{N-1})…P(w_2 \mid w_1)$
• $P(\text{I can do it.})$ = $P(\text{can} \mid \text{I})P(\text{do} \mid \text{can})P(\text{it} \mid \text{do})P(\text{.} \mid \text{it})=0.5$
2. What would a unigram model be?
3. What do you expect will happen for larger or smaller values of N?

## Pros of N-gram models

• Very easy to build and quick to execute
• Highly explainable

## Cons of N-gram models

• Models with small values of N are very forgetful
• End of sentence often won’t complete the beginning
• Sentences don’t build on each other
• Models with large values of N are too similar to source text
• Most 7-grams are basically unique (credit to Roger Levy)