NLP Part 2 - Words Representations
NLP Part 2 - Words Representation
2.1 The Representation Problem
2.2 Occurrence-Based Methods
All the occurrence-based methods have some fundamental drawbacks:
- They depend on the size of the vocabulary |V| which can lead to a very high dimensionality problem (and we don’t want that).
- Treat the words as atomic units and they are not able to capture the relative meaning of each word.
- If we build an NLP Model, we would like it to work on most of the real-life problems (not only on our experiments). If in a text it appears a word that is not present in our training corpus (the so-called Out of Vocabulary words, or OOV words), we cannot represent it.
So why did I talk about these methods if they have so many problems? Well, firstly is because they are very simple techniques still feasible for very simple problems. Secondly, they can still be useful somewhere to find, for example, documents similarities. But now, let’s go deeper into this section.
Let’s see how to perform a TF-IDF text representation in Python using the popular library scikit-learn
2.3 Word Embedding
John Rupert Firth said:
You shall know a word by the company it keeps
This is the main intuition of word embedding. Let’s take the word tea. The word tea is usually accompanied by the words drank, pot, kettle, hot, steam, etc. So, wouldn’t it make sense for words similar to tea (like coffee, or any other morning buster) should have similar distributions of surrounding words?
With word embedding, we are able to map each word in an input vocabulary to a dense high-dimensional vector (typically 200–300 dimensions), and it encodes both syntactic and semantic word similarities.
Note that the surrounding context of a word is likely to convey its semantic meaning. The hypothesis that the meaning of a word can be derived from the distribution of contexts in which it appears is also called the distributional hypothesis.
How do we represent this kind of encoding? Obviously let’s start with the first most simple option: the co-occurrence matrix.
Start with a vocabulary V. Now let’s create a matrix of size |V|x|V| with all zeros. Now, define a window size (it is the range of left/right words that will be used for the computation). For each word w, count the occurrences that another word w’ is in the window. Now normalize the rows by the sum. This is a very simple document-level co-occurrence matrix. Now the representation of each word is simply one row of this matrix. Usually, larger windows tend to encode more semantic (and at extremes, topic-like) properties, while shorter windows seem to encode syntactic properties.
Let’s say that we have this corpus: “I like Deep Learning”, “I like NLP”, “I enjoy flying”. With a window size of one, the co-occurrence matrix is:
The usage of the explicit counts of words in |V|-sized vectors is not the best idea though for the reasons explained above (curse of dimensionality and blabla). Another issue is that the simple count of words will over-emphasize the importance of very common words like the.
A better technique is the one called Word2Vec. The key idea is to train a neural network to predict the surrounding words of every word in a dictionary (called skipgram word2vec) or to train a neural network to predict the target word given the surrounding words in the dictionary (called Continuous Bag of Words, or CBOW).
Given a large document corpora and a dictionary of the words occurring in the documents, the approach is:
- For each word w in the dictionary, compute its pairwise vector similarity with every word w’ appearing in its contexts C.
- Compute the probabilities p(w’|w) for every w’ in C.
- Adjust word vectors in order to maximize p.
How do we compute the probabilities? Easy peasy, with a simple softmax function!
The structure of a skipgram model is the following:
Now, this is a very inefficient implementation of the word2vec. A more efficient implementation is the skipgram-negative-sampling which I won’t discuss.
Now I talked about two versions of word2vec. But are they the same? Of course not. In general, CBOW is faster to train than Skip-Gram, but Skip-Gram better captures semantic word relationships (while CBOW is more syntactic-oriented). Both methods demonstrate the capacity to capture complex linguistic patterns beyond similarity but failed to make use of the global co-occurrence statistic.
Let’s discuss briefly now about GloVe (Global Vector for Word Representation). It is an embedding model (with a cool name) proposed by Stanford researchers. The key idea is to combine prediction-based neural methods with occurrence-based ones. The model produces a word vector space with a meaningful sub-structure. It achieves better results faster and also obtains the best results irrespective of speed.
Let’s see some examples of code for Word2Vec.
What I’ve done in the code above is the computation of an analogy. Word2Vec is able to compute an analogy in the form a : b :: c : ?
The substitute to the question mark is computed by maximizing the cosine distance between the vector (b-a+c) and the other words in the vocabulary (it will get the one with the highest score). An example is ‘king-man+woman’ will return ‘queen’.
I think it is very important when dealing with such models of talking about Biases (gender, race, sexual orientation, etc.) in Word Vectors. Let’s say we want to compute the analogy man:occupation :: woman:? should be the same in all directions. Results show the contrary though. The analogy man:occupation :: woman:? often gives different results with respect to the inverse woman:occupation :: man: ?. Read more from the article https://medium.com/@alycianoelcarey/gender-race-and-disability-bias-in-word-embeddings-49835110fe24. These kinds of biases are intrinsic in the training datasets that we use to train the model and their presence can lead to discrimination, unfairness, inequality, and exclusion.
Hope you enjoyed the article, stay tuned and see you on the other side!
Commenti
Posta un commento