NLP Part 4 - Contextual Embedding

In the last article (https://medium.com/@umbertofontana/nlp-part-4-toxic-comments-classification-10e7167fa50b) we used the word/sentence embedding technique combined with Logistic Regression to classify the toxicity of a text among 6 possible labels. It seems like we can conclude this series of articles since these methods are very effective and create a global word representation for a machine to understand. Well, wrong… Remember that the research to arrive at chatGPT was long and passed through many little steps. In this article, I will talk about contextual embedding with a main focus on two models: Seq2seq and ELMo.

Context Matters

Yes, our language (doesn’t matter which language) is way more complicated. If I search for a translation from Italian to English of the word “campagna” I obtain the following:

Country, countryside, rural area;
Land, farmland;
Campaign, offensive;
Promotion, campaign;

Effectively, even in the translations above we have that the word “campaign” appear twice but with a different meaning. How can we choose the correct one? Cambridge Dictionary helps us by adding examples below:

Amo la quiete della campagna [I love the peace and quiet of the country]
I frutti della campagna [The fruits of the field]
La campagna d’Africa [The African campaign]
Campagna pubblicitaria/elettorale [Advertising/election campaign]

So, if you’re not fluent in Italian and you want to understand what your friends meant when they said “Andiamo in vacanza in campagna”, you use DeepL and it will output without thinking twice “Let’s go on vacation in the countryside”. How did it get the difference between campagna as countryside and campagna as advertising campaign? Well, the context helped. The power of contextual embedding is the different representation that a word can get depending on the context is put in! Cool, isn’t it? While word embedding models a word, let’s say “mouse”, has a unique representation for the model, now, according to the context, it can be seen as the animal that scares the elephants and the little thing on our screen that we lose too frequently.

ELMo (Embeddings from Language Models)

If you read something about deep NLP, you surely heard at least the name BERT. Well, the trend of giving muppets’ names to architecture possibly started with ELMo, published in 2018 in the paper “Deep Contextualized Word Representation” (https://arxiv.org/pdf/1802.05365.pdf).

Note: I give for granted many times the knowledge of some ML/DL theory. If you’re not familiar with RNN and LSTM refer to this article: https://medium.com/analytics-vidhya/lstms-explained-a-complete-technically-accurate-conceptual-guide-with-keras-2a650327e8f2

In the ELMo architecture, having a set of tokens, each token’s representation is a function of the entire input sentence computed by a deep (multi-layer) bidirectional language model. Its precursors (like TagLM), relied on word embeddings (or character embedding) as input to a recurrent unit to extract context features from a word. ELMo does a similar thing but with some major differences. Its internal structure can be complicated, and it makes use of a combination of character embedding, highway connections, and bi-LSTM layers. The figure below offers a proper description of the architecture.

The big win of ELMo was its usage after it was trained. In particular, the final output is a task-specific weighting of all the Bi-LSTM layers.

The task-specific layers’ output, in particular, is in the form

It is not important to understand the overall formula, but just to know that, for the k-th token, the output depends on the hidden state of all Bi-LSTM layers (h) and the word representation of the language model x. The s term represents the softmax-normalized weights on the hidden representations from the language model, and γ is a task-dependent normalizing factor. In practice, after training the language model, the task-specific model has only to learn the two parameters γ and s.

In practice, to use ELMo:

Train a multi-layer bidirectional language model with character convolutions on raw text. The training consists in predict the next/previous word given a sequence of words. There’s no need for explicit labels as needed in other supervised learning tasks.
Freeze the parameters of the language model.
For each task, train task-dependent softmax weights to combine the layer-wise representations into a single vector. ELMo represents a token t as a linear combination of corresponding hidden layers.

Is important to final mention that ELMo is a good choice also because the model is trained on a very large dataset (the 1B Word Benchmark, https://arxiv.org/abs/1312.3005), which allows the creation of meaningful context-dependent embeddings of words.

Seq2Seq

Sequence to sequence is a paradigm presented in 2014 with the paper “Sequence to Sequence Learning with Neural Network” (https://arxiv.org/pdf/1409.3215.pdf). Seq2Seq models are often referred to as “encoder-decoder models” due to their structure which can be divided into 2 parts:

An encoder, which takes the model’s input sequence as the input end encodes it into a fixed-size context vector;
A decoder, which uses the context vector from above as a “seed” from which to generate an output sequence;

The encoder network reads the embedded representation of an input sequence and generates a fixed-dimensional context vector. In order to do as such, the encoder usually consists of stacked LSTMs.

The decoder is also an LSTM network but with a more complicated usage. The initial hidden state of the first layer is initialized with the context vector produced by the encoder. This will make the decoder “aware” of the words generated so far. Now, the input sequence is enriched with some important tags, a begin tag (“<START>”, “<BOS>”) appended at the beginning of the sequence, and an end tag (“<END>”, “<EOS>”), appended at the end of the sequence (it depends on the direction of the input sequence, sometimes, in fact, they produce it in reverse: in this way, the last thing the encoder sees will roughly correspond to the first thing that the model outputs). The first application of the model was for English-French translation. The input sequence for the decoder, in this case, would then be “<BOS>J’aime les fêtes<EOS>”. The first token (<BOS>) is fed to the decoder and indicates the start of the output sequence. The decoder uses its internal state (initialized with the final state of the encoder) to produce the 1st token in the target sequence at its first time step, which is J’aime (or just J). The second timestep receives the output of the first time step and produces the second token in the sequence les. This concatenation (visually an LSTM centipede) continues till it gets the <EOS> symbol which marks the end of the output sequence (same discussion for the reverse case). Once we have the output sequence, we use the same learning strategy: define a loss, minimize it with gradient descent, and then back-propagate to update the parameters. If you’re wondering how you can create an output sequence from numbers to words, note that the words are encoded with one-hot vectors for the input.

Always Seq2Seq but With Attention

There’s a flaw in using the final RNN hidden state as the single context vector: if we think of the translation task, usually the first word of output is based on the first few words of input, and the last word of output is based on the few last words on input. So we have that the various output tokens have different needs from the token in input. Attention mechanisms make use of this observation by providing the decoder network with a look at the entire input sequence at every decoding step, the decoder can decide what input words are important at every time step.

In the decoder part, we want to compute the hidden states using a formula that depends on the hidden state vector at the previous time step, the word generated at the previous step, and the context vector that capture context from the original sentence that is relevant to the current time step of the decoder.

The figure above encapsulates the main steps of the discussion. In particular, we have that, in the decoder, we generate an attention hidden vector by concatenating the original hidden state vector with the generated context vector.

The context vector CV contains relevant information for the i-th decoding time step. In the figure above, the term a refers to any function, for instance, a single-layer fully-connected neural network. We then normalize these scores into a vector using a softmax layer.

And compute the context vector CVj as the weighted average of the hidden vector (HS in the figure) from the original sentence. The context vector is then concatenated with the hidden state vector to generate the attention hidden vector.

In the context of translation, attention can be thought of as “alignment”: the attention scores α at decoding step i signify the words in the source sentence that align with the word i in the target.

Hope this article has been useful, follow for more content, and see you on the other side!

Cerca nel blog

The FontaBlog