Word Embeddings
Class 4 — July 21, 2020
What are word embeddings?
- A word embedding is a numerical vector that represents a given word
- These vectors are spatially related to each other, such that more similar words are closer together in the embeddings space
    - e.g. The vector for “man” should be more similar to the vector for “boy” than the vector for “sky”
 
- Why are these useful?
    - Search
        - If I’m searching for “pandemic”, word embeddings could be used to surface documents that contain very similar words, e.g. “epidemic”, “virus”, etc.
- A simple string search wouldn’t be able to catch similar words on its own
 
- Simple text classification
        - For sentiment analysis, could simply calculate whether words in the document are more similar to “good” than they are to “bad”, or something similar
 
- Be careful – these use cases are susceptible to bias (as are virtually all NLP models), read further for more details
 
- Search
        
Analogies
- Because word embeddings are spatially related, we can use them to solve simple analogy problems
- Below we can see GloVe’s predictions to the following problems, along with the associated distance values
- France : Paris :: England : ?
    - London (0.646), Manchester (0.510), Birmingham (0.486)
 
- man : woman :: king : ?
    - queen (0.690), monarch (0.558), throne (0.557)
 
- tall : taller :: warm : ?
    - warmer (0.650), warmed (0.569), cooler (0.554)
 
- author : book :: artist : ?
    - artwork (0.642), painting (0.605), art (0.582)
 
Bias
- Word embeddings are mostly trained on Wikipedia
- 84% of these writers are male –> these articles exhibit biases
- In all prominent word embeddings, distances are skewed based on stereotypes
    - “black” is closer to “good” than “white”
- “female” is closer to “irrational” than “male”
- etc.
 
- Below is a table from some old work I did with GloVe embeddings
    - Distances are Euclidean (you should generally use cosine distances)
- All of these comparisons exhibit negative biases
 
| word | distance to “christianity” | distance to “islam” | 
|---|---|---|
| attractive | 7.83 | 8.10 | 
| good | 6.64 | 6.98 | 
| nice | 7.88 | 8.13 | 
| trust | 7.43 | 7.69 | 
| attack | 7.68 | 6.86 | 
| terrible | 7.61 | 7.41 | 
| terrorism | 8.46 | 7.42 | 
| undermine | 7.47 | 7.09 | 
| violence | 8.01 | 7.48 |