NLP Part 1 — Introduction to NLP

This is the first article I decided to write concerning the sub-subject of Machine Learning, known as Natural Language Processing. I will summarize the main topics and challenges of this fascinating field and try to explain them in the most practical way possible.

1.1 What is Natural Language Processing?

No good book about a topic cannot start without introducing the subject. It is the most boring part usually but is the basement of which without we cannot build anything, so I will be very brief. I asked ChatGPT what is Natural Language Processing, and the answer has been: “Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language in a way that is similar to how humans do”. The chatbot has been very clear I think, not surprisingly since it is one of the greatest production of Machine Learning/NLP of the last years. But it is not the only tool that relies on Natural Language Processing, many others are present in our daily life, such as Virtual Assistants (like Google Assistant or Amazon Alexa), Chatbot, Autocorrect (like my dear friend Grammarly, without which I would create articles with thousands of errors), Search Engines, etc. The complexity of communication enabled by language is uniquely human intelligence among species. Human children acquire language with exceptional sample efficiency (not observing that much language) and compute efficiency (brains are some very efficient computing machines). With all the advances in NLP in the last decades, we are still nowhere close to developing learning machines that have a fraction of the acquisition ability of children.

1.2 Text Processing

Even though nowadays transformer-based architectures are very skilled in identifying some patterns without much preprocessing, it is a good rule to know the main pipeline when approaching an NLP task. But first, let’s take a look at the text structure. Usually, when we learn grammar, we first learn how to conjugate verbs and words, like the present continuous of the verb to eat is eating. We usually don’t store in our minds all these conjugations, but we simply learn the rule and then we apply them. Sometimes machines also can be stupid, and the conjugated word can assume a different meaning with respect to their derived term, and increase the dimensionality of the problem. For this reason, often, we use the lemmas or the stems of a word. The lemma is the canonical form of a word or a multi-word expression, chosen from a set of candidate forms in a dictionary, whereas stems is the base form not necessarily derived from a dictionary. Stemming is a simpler and more rule-based process compared to lemmatization since it doesn’t consider the word’s context and grammatical properties to produce the canonical form. Let’s see how we can extract lemmas in Python:

The output of this code is, for the first sentence, [‘the’, ‘answer’, ‘my’, ‘friend’, ‘be’, ‘blow’, ‘in’, ‘the’, ‘wind’]. Notice how the words “is” and “blowing” changed to their primitive form “be” and “blow”. The library spaCy (https://spacy.io/) is a commonly used library in Python for NLP tasks, it contains pre-trained models in multiple languages and it allows the usage of deep learning models. If we’d performed instead the stemming, the word “is” would have remained “is” without changing.

Now let’s return to the text processing and let’s define a common pipeline. The usual text preprocessing steps are:

1 — Text Cleaning: basic filtering steps to remove noise, errors, and redundant content (remove, for example, external links, or HTML tags).

2 — Tokenization: divide the raw text into units and sub-units. Notice that the tokenization is very dependent on the language.

3 — Stopword Elimination: some English words, such as the, an, of, a, in, etc. do not carry any content and can be removed. Some Deep NLP models do not require stopword elimination.

4 — Part-Of-Speech tagging: annotate the text words with the corresponding role in the sentence. POS tagging labels each word with its corresponding part of speech, such as noun, verb, adjective, adverb, pronoun, conjunction, preposition, and more.

5 — Lemmatization and Stemming: map word inflections and derivates to their canonical form

Here’s a Python code that can help to perform the previous steps:

Notice that the pipeline is just a list of common procedures, which one to apply is very problem dependent!

Some other tools that can be useful for the task are:

TextBlob (https://textblob.readthedocs.io/en/dev/) which provides a simple API for diving into common NLP tasks.
NLTK (https://www.nltk.org/) one of the most established package processing in Python language.
Polyglot (https://github.com/aboSamoor/polyglot) is multilingual-oriented (tokenization in 165 languages, language detection of 196 languages, POS tagging in 16 languages, etc.).
Stanza (https://stanfordnlp.github.io/stanza/) powerful NLP toolkit that allows text analysis and manipulation of as many as 66 different languages.

In the next article, we will talk about text representation. Stay tuned and see you on the other side!

Cerca nel blog

The FontaBlog