Language Modeling with N-grams

Class 2 — July 14, 2020

What are N-grams?

How do N-grams work?

  1. Pick your training corpus
  2. Generate a list of all of the N-word-long sequences in your corpus
  3. Count all of the times that $w_N$ follows $(w_1, …, w_{N-1})$
  4. Turn your counts from step 3 into a probability distribution: $P(w_N \mid w_1, …, w_{N-1})$
  5. Given a prompt, randomly sample your probability distribution to pick the next word
  6. Repeat step 5 until you’ve reached your desired length, or until you’ve reached a desired token, e.g. a period

Simplified example of above

  1. (Very small) corpus: “I think I can do it.”
  2. Set $N=2$ (bigram model) → (I think), (think I), (I can), (can do), (do it), (it .)
  3. Set prompt = “I” → think: 1, can: 1
  4. $P(\text{think} \mid \text{I}) = 0.5$, $P(\text{can} \mid \text{I}) = 0.5$
  5. Randomly select “can” → “I can”
  6. Repeat until end of string → “I can do it.”

Discussion questions

  1. What’s the probability of a sentence in a bigram model?
    • $P(w_1, …, w_N) = P(w_N \mid w_{N-1})…P(w_2 \mid w_1)$
    • $P(\text{I can do it.})$ = $P(\text{can} \mid \text{I})P(\text{do} \mid \text{can})P(\text{it} \mid \text{do})P(\text{.} \mid \text{it})=0.5$
  2. What would a unigram model be?
  3. What do you expect will happen for larger or smaller values of N?


Pros of N-gram models

Cons of N-gram models

Additional Resources