NLP Part 1 — Introduction to NLP
NLP Part 1 — Introduction to NLP
This is the first article I decided to write concerning the sub-subject of Machine Learning, known as Natural Language Processing. I will summarize the main topics and challenges of this fascinating field and try to explain them in the most practical way possible.
1.1 What is Natural Language Processing?
1.2 Text Processing
The output of this code is, for the first sentence, [‘the’, ‘answer’, ‘my’, ‘friend’, ‘be’, ‘blow’, ‘in’, ‘the’, ‘wind’]. Notice how the words “is” and “blowing” changed to their primitive form “be” and “blow”. The library spaCy (https://spacy.io/) is a commonly used library in Python for NLP tasks, it contains pre-trained models in multiple languages and it allows the usage of deep learning models. If we’d performed instead the stemming, the word “is” would have remained “is” without changing.
Now let’s return to the text processing and let’s define a common pipeline. The usual text preprocessing steps are:
1 — Text Cleaning: basic filtering steps to remove noise, errors, and redundant content (remove, for example, external links, or HTML tags).
2 — Tokenization: divide the raw text into units and sub-units. Notice that the tokenization is very dependent on the language.
3 — Stopword Elimination: some English words, such as the, an, of, a, in, etc. do not carry any content and can be removed. Some Deep NLP models do not require stopword elimination.
4 — Part-Of-Speech tagging: annotate the text words with the corresponding role in the sentence. POS tagging labels each word with its corresponding part of speech, such as noun, verb, adjective, adverb, pronoun, conjunction, preposition, and more.
5 — Lemmatization and Stemming: map word inflections and derivates to their canonical form
Here’s a Python code that can help to perform the previous steps:
Notice that the pipeline is just a list of common procedures, which one to apply is very problem dependent!
Some other tools that can be useful for the task are:
- TextBlob (https://textblob.readthedocs.io/en/dev/) which provides a simple API for diving into common NLP tasks.
- NLTK (https://www.nltk.org/) one of the most established package processing in Python language.
- Polyglot (https://github.com/aboSamoor/polyglot) is multilingual-oriented (tokenization in 165 languages, language detection of 196 languages, POS tagging in 16 languages, etc.).
- Stanza (https://stanfordnlp.github.io/stanza/) powerful NLP toolkit that allows text analysis and manipulation of as many as 66 different languages.
In the next article, we will talk about text representation. Stay tuned and see you on the other side!
Commenti
Posta un commento