Class 4 — July 21, 2020
What are word embeddings?
- A word embedding is a numerical vector that represents a given word
- These vectors are spatially related to each other, such that more similar words are closer together in the embeddings space
- e.g. The vector for “man” should be more similar to the vector for “boy” than the vector for “sky”
- Why are these useful?
- If I’m searching for “pandemic”, word embeddings could be used to surface documents that contain very similar words, e.g. “epidemic”, “virus”, etc.
- A simple string search wouldn’t be able to catch similar words on its own
- Simple text classification
- For sentiment analysis, could simply calculate whether words in the document are more similar to “good” than they are to “bad”, or something similar
- Be careful – these use cases are susceptible to bias (as are virtually all NLP models), read further for more details
- Because word embeddings are spatially related, we can use them to solve simple analogy problems
- Below we can see GloVe’s predictions to the following problems, along with the associated distance values
- France : Paris :: England : ?
- London (0.646), Manchester (0.510), Birmingham (0.486)
- man : woman :: king : ?
- queen (0.690), monarch (0.558), throne (0.557)
- tall : taller :: warm : ?
- warmer (0.650), warmed (0.569), cooler (0.554)
- author : book :: artist : ?
- artwork (0.642), painting (0.605), art (0.582)
- Word embeddings are mostly trained on Wikipedia
- 84% of these writers are male –> these articles exhibit biases
- In all prominent word embeddings, distances are skewed based on stereotypes
- “black” is closer to “good” than “white”
- “female” is closer to “irrational” than “male”
- Below is a table from some old work I did with GloVe embeddings
- Distances are Euclidean (you should generally use cosine distances)
- All of these comparisons exhibit negative biases
||distance to “christianity”
||distance to “islam”