NLP Part 2 - Words Representation

In the previous article (https://medium.com/@umbertofontana/nlp-part-1-introduction-to-nlp-e686611da3da) I introduced the general meaning of NLP and the common pipeline to process text. In this section, I’m going to talk about text representation.

2.1 The Representation Problem

The big problem in NLP. How can we represent text? Of course, computers cannot understand others than bits. They understand a totally different (and very difficult!) alphabet. So we need a numerical representation of the text in order for the computer to robustly process and/or generate it. Many techniques have been proposed and, according to the task, it can be useful to know them all.

2.2 Occurrence-Based Methods

This is a very simple text representation on which each feature is mapped to a specific textual unit (a word, n-gram, paragraph, etc). We can build a vocabulary V of the text corpus and associate each entry in the vocabulary with a unique ID. We then represent each sentence or document in the corpus as a V-dimensional vector.

The One-Hot Encoding of words is the most trivial representation of a word in a corpus that we can have. Having a vocabulary of cardinality |V|, each word will be represented as a binary vector, where all the elements will be 0s with the exception of the relative position of the represented word in the vocabulary. An example will be for sure more useful. Let’s imagine that we have a vocabulary [“eight”, “dog”, “days”, “week”, “stolen”, “a”]. In this case, the word eight will be represented as [1, 0, 0, 0, 0, 0], while the last word a will be [0, 0, 0, 0, 0, 1]. A sentence like “eight days a week” will be represented as [[1, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 1], [0, 0, 0, 1, 0, 0]].

If you’re familiar with machine learning and the notion of Curse of Dimensionality (in high dimensional space no one can hear you scream), you can spot a pitfall. With a very large vocabulary, the dimensions will start to be too many, creating some problems with some algorithms that rely on the notion of distance (such as KNN). Also, the efficiency of storing a very space matrix is not that optimal.

Bag-of-Words is another occurrence-based method that represents each word in the document as a vector of which the i-th component is the number of times that a word w occurs in the document. Let’s suppose we have a document “Rebel rebel, your face is a mess” and suppose we have a vocabulary [“rebel”, “your”, “face”, “is”, “a” “mess”, “bowie”]. In this case, the document will be represented as a vector [2, 1, 1, 1, 1, 1, 0]. With this representation we will have that documents with the same words will have the vector representation close to each other in the Euclidean space, but still can be problematic with very large vocabularies, and all the words are treated with the same importance.

TF-IDF (or term frequency-inverse document frequency) is a more clever technique that aims to quantify the importance of a given word relative to other words in the document and in the corpus. The idea is that if a word appears many times in a document, but does not occur much in the rest of the documents in the corpus, then that word must be of great importance to the document of concern. This is computed by multiplying the Term Frequency with the Inverse Document Frequency.

All the occurrence-based methods have some fundamental drawbacks:

They depend on the size of the vocabulary |V| which can lead to a very high dimensionality problem (and we don’t want that).
Treat the words as atomic units and they are not able to capture the relative meaning of each word.
If we build an NLP Model, we would like it to work on most of the real-life problems (not only on our experiments). If in a text it appears a word that is not present in our training corpus (the so-called Out of Vocabulary words, or OOV words), we cannot represent it.

So why did I talk about these methods if they have so many problems? Well, firstly is because they are very simple techniques still feasible for very simple problems. Secondly, they can still be useful somewhere to find, for example, documents similarities. But now, let’s go deeper into this section.

Let’s see how to perform a TF-IDF text representation in Python using the popular library scikit-learn

2.3 Word Embedding

John Rupert Firth said:

You shall know a word by the company it keeps

This is the main intuition of word embedding. Let’s take the word tea. The word tea is usually accompanied by the words drank, pot, kettle, hot, steam, etc. So, wouldn’t it make sense for words similar to tea (like coffee, or any other morning buster) should have similar distributions of surrounding words?

With word embedding, we are able to map each word in an input vocabulary to a dense high-dimensional vector (typically 200–300 dimensions), and it encodes both syntactic and semantic word similarities.

Note that the surrounding context of a word is likely to convey its semantic meaning. The hypothesis that the meaning of a word can be derived from the distribution of contexts in which it appears is also called the distributional hypothesis.

How do we represent this kind of encoding? Obviously let’s start with the first most simple option: the co-occurrence matrix.

Start with a vocabulary V. Now let’s create a matrix of size |V|x|V| with all zeros. Now, define a window size (it is the range of left/right words that will be used for the computation). For each word w, count the occurrences that another word w’ is in the window. Now normalize the rows by the sum. This is a very simple document-level co-occurrence matrix. Now the representation of each word is simply one row of this matrix. Usually, larger windows tend to encode more semantic (and at extremes, topic-like) properties, while shorter windows seem to encode syntactic properties.

Let’s say that we have this corpus: “I like Deep Learning”, “I like NLP”, “I enjoy flying”. With a window size of one, the co-occurrence matrix is:

The usage of the explicit counts of words in |V|-sized vectors is not the best idea though for the reasons explained above (curse of dimensionality and blabla). Another issue is that the simple count of words will over-emphasize the importance of very common words like the.

A better technique is the one called Word2Vec. The key idea is to train a neural network to predict the surrounding words of every word in a dictionary (called skipgram word2vec) or to train a neural network to predict the target word given the surrounding words in the dictionary (called Continuous Bag of Words, or CBOW).

Given a large document corpora and a dictionary of the words occurring in the documents, the approach is:

For each word w in the dictionary, compute its pairwise vector similarity with every word w’ appearing in its contexts C.
Compute the probabilities p(w’|w) for every w’ in C.
Adjust word vectors in order to maximize p.

How do we compute the probabilities? Easy peasy, with a simple softmax function!

And the objective? It’s a cross-entropy loss objective with the true distribution P*(w’|w)

The structure of a skipgram model is the following:

Now, this is a very inefficient implementation of the word2vec. A more efficient implementation is the skipgram-negative-sampling which I won’t discuss.

Now I talked about two versions of word2vec. But are they the same? Of course not. In general, CBOW is faster to train than Skip-Gram, but Skip-Gram better captures semantic word relationships (while CBOW is more syntactic-oriented). Both methods demonstrate the capacity to capture complex linguistic patterns beyond similarity but failed to make use of the global co-occurrence statistic.

Let’s discuss briefly now about GloVe (Global Vector for Word Representation). It is an embedding model (with a cool name) proposed by Stanford researchers. The key idea is to combine prediction-based neural methods with occurrence-based ones. The model produces a word vector space with a meaningful sub-structure. It achieves better results faster and also obtains the best results irrespective of speed.

Let’s see some examples of code for Word2Vec.

What I’ve done in the code above is the computation of an analogy. Word2Vec is able to compute an analogy in the form a : b :: c : ?

The substitute to the question mark is computed by maximizing the cosine distance between the vector (b-a+c) and the other words in the vocabulary (it will get the one with the highest score). An example is ‘king-man+woman’ will return ‘queen’.

I think it is very important when dealing with such models of talking about Biases (gender, race, sexual orientation, etc.) in Word Vectors. Let’s say we want to compute the analogy man:occupation :: woman:? should be the same in all directions. Results show the contrary though. The analogy man:occupation :: woman:? often gives different results with respect to the inverse woman:occupation :: man: ?. Read more from the article https://medium.com/@alycianoelcarey/gender-race-and-disability-bias-in-word-embeddings-49835110fe24. These kinds of biases are intrinsic in the training datasets that we use to train the model and their presence can lead to discrimination, unfairness, inequality, and exclusion.

Hope you enjoyed the article, stay tuned and see you on the other side!

Cerca nel blog

The FontaBlog