Text Classification with Naive Bayes

Class 3 — July 16, 2020

What is Naive Bayes?

Solves the problem of text classification
- Given a document, classify it as one of $n$ classes
What are potential applications of this problem?
- Spam detection
- Sentiment analysis
- Gmail categorizes emails into four tabs
- Language identification on Google Translate
- Topic labeling for news articles

How does Naive Bayes work?

Pick your training corpus

List of documents with their labels (e.g. list of emails and whether or not each email is spam)

Represent each document as a “bag of words”

Downside: word order isn’t used

Count how many times each word appears
Work through the math

In class, I went through a couple derivations and a simplified example. If you’d like to review these, they’re very well illustrated in the reading below.

Additional Resources

https://web.stanford.edu/~jurafsky/slp3/4.pdf