How to preprocess Textual data in machine learning

How to preprocess Textual data in machine learning

Preprocessing textual data is an essential step in machine learning tasks involving natural language processing (NLP). It involves transforming raw text data into a format suitable for training machine learning models.

Most of these tasks can easily be done using a Python library called "nltk".(Natural language toolkit

Why is it necessary?

Our machines are innocent and cannot understand the textual data. Therefore we need to find some numerical representation of our data, meanwhile also retaining the meaning of the given text.

Below are the common steps involved in preprocessing the textual data

Lowercasing

Suppose in a sentence, we have "He", and "he". Both are the same words, but will be treated differently if we don't convert them to lowercase. this will increase our vocabulary size and memory consumption. That is why, it is necessary to convert the given sentence to lowercase

Tokenization

Breaking down the text into smaller units, called tokens. Tokens can be words, sentences, or even characters, depending on the requirements. Tokenization makes it easier to process and analyze the text.

For Eg: If we have the sentence "He is a good boy"
Following will be our word tokens: "He", "is", "a", "good", "boy"

Stop Word Removal:

These are words that don't carry any special meaning. They are the words such as articles (e.g., "a," "an," "the"), pronouns (e.g., "he," "she," "it"), and prepositions (e.g., "in," "on," "at"). These words need to be removed to reduce the size of our data, which increases the computational speed. These words can be removed as they often occur frequently and do not contribute much to the overall context

Stemming and Lemmatization

Reducing words to their base or root form to consolidate words with the same meaning is known as Stemming.
The problem with stemming is that the root word might not be meaningful. For Eg:
Finally, Final, Finallize may get converted To "Fina" which does not have a meaning. That is where Lemmatization comes in.

Lemmatization converts words to their dictionary form (lemma) while stemming reduces words to a common stem. In simpler terms, it ensures that the root word is some meaningful word present in the dictionary. For example, "running," "ran," and "runs" could be reduced to the stem "run."

Since it has to understand the meaning and create a meaningful root word, it consumes more time than stemming and is a more difficult process.

Vectorization

In the introduction, I told you that machines cannot understand textual data. They understand numbers, due to which we will need to derive some numerical representation for our data while retaining the important information. But How can we achieve that exactly?

There are two commonly used Techniques :

  1. Bag of Words

  2. TF/IDF

Let's understand them one by one

Bag Of Words:

The Bag of Words (BoW) is a popular and simple representation model used in natural language processing (NLP) and information retrieval tasks. It is a way to convert textual data into numerical vectors that can be processed by machine learning algorithms.

The basic idea behind the Bag of Words model is to represent a document as an unordered collection or "bag" of its words, disregarding grammar, word order, and context. It focuses solely on the presence or absence of words and their frequency in the document.

For Eg:

Suppose we have three Sentences:

  • He is an intelligent boy

  • She is a nice Girl

  • That girl is dating a boy.

After doing steps other than vectorization, this will be what will become of these sentences.

  • Intelligent, boy

  • nice, girl

  • girl, date, boy

Create a table that stores the frequency of each word in each sentence

f(boy)f(intelligent)f(girl)f(date)f(nice)
sentence 111000
sentence 200101
sentence 310110

each column stores the number of occurrences of the given word in a sentence.
So the representation of sentences will be:

[1,1,0,0,0], [0,0,1,0,1], [1,0,1,1,0] according to their frequencies.

Now as we can observe, both boys and intelligent are given the same priority according to the frequency. While doing sentimental analysis, the word intelligent will be more relevant but this technique does not take this into account. That is why we use TFI/DF technique.

TF/IDF :
It stands for TErm frequency/ Inverse Document frequency. We multiply TF and IDF to obtain the vector representation for our words. Suppose we take the following example, where we have applied lemmatization, and tokenization.

Eg:
Sent 1: Good, Boy
Sent2: Good Girl
Sent 3: Good, boy, girl

  • Step1: Create Term frequency Table

$$Term\space Frequency= \frac{No \space of\space repetition\space of\space words\space in\space sentence}{No:\space of\space words\space in\space a\space sentence}$$

Sent1

Sent2

Sent 3

good

1/2

1/2

1/3

Boy

1/2

0

1/3

girl

0

1/2

1/3

Explanation: In sentence 1, Good is present 1 number of times and there are total 2 words. therefor T.F[good][Sent1] = 1/2. Similarly, We have filled other cells of the table.

Step 2: Create an Inverse Document Frequency table

$$IDF=log(\frac{No\space of\space sentences}{No\space of\space Sentences\space Containing\space the\space selected\space word})$$

Words

IDF

Good

log(3/3)=0

Boy

log(3/2)

Girl

log(3/2)

Now multiply both the tables (each column multiplied by this table

Eg:

Sent 1 IDF Result
1/2 0 0
1/2 log(3/2) 1/2*log(3/2)
0 log(3/2) 0

f1(Good)f2(Boy)f3(Girl)
Sent 101/2*log(3/2)0
Sent 2001/2*log(3/2)
Sent 301/3*Log(3/2)1/3*log(3/2)

Now We can observe that even if two words have the same frequency, the numerical values assigned to them in this also represent their semantic value.

That's all Guys. I hope this post helped you in understanding the different steps of NLM.