TF-IDF in NLP | EntropyObserver

type

status

date

slug

summary

Step 1: Tokenization

Tokenization is the process of breaking down a document into individual words or tokens. The goal here is to extract the words that will later be analyzed for frequency. For each document, we create a list of tokens (words).

Doc 1:

"good quality dog food i have bought several of the vitality"

➤ Tokens:

['good', 'quality', 'dog', 'food', 'i', 'have', 'bought', 'several', 'of', 'the', 'vitality']

➤ Word count: 11

Doc 2:

"not as advertised product arrived labeled as jumbo"

➤ Tokens:

['not', 'as', 'advertised', 'product', 'arrived', 'labeled', 'as', 'jumbo']

➤ Word count: 8

Doc 3:

"delight says it all this is a confection"

➤ Tokens:

['delight', 'says', 'it', 'all', 'this', 'is', 'a', 'confection']

➤ Word count: 8

In the next steps, we’ll calculate the Term Frequency (TF) and Inverse Document Frequency (IDF) for each token in these documents.

Step 2: Term Frequency (TF)

Term Frequency (TF) measures how frequently a term appears in a specific document. The formula is:

TF(t,d)=Total number of terms in document dCount of term t in document d

📄 Doc 1 TF:

Term	Count	TF (count/11)
good	1	0.0909
quality	1	0.0909
dog	1	0.0909
food	1	0.0909
i	1	0.0909
have	1	0.0909
bought	1	0.0909
several	1	0.0909
of	1	0.0909
the	1	0.0909
vitality	1	0.0909

📄 Doc 2 TF:

Term	Count	TF (count/8)
not	1	0.125
as	2	0.25
advertised	1	0.125
product	1	0.125
arrived	1	0.125
labeled	1	0.125
jumbo	1	0.125

📄 Doc 3 TF:

Term	Count	TF (count/8)
delight	1	0.125
says	1	0.125
it	1	0.125
all	1	0.125
this	1	0.125
is	1	0.125
a	1	0.125
confection	1	0.125

This process is repeated for all words in all documents. It allows us to assess how important each word is within the context of a single document.

Step 3: Inverse Document Frequency (IDF)

Inverse Document Frequency (IDF) is a measure of how important a term is across the entire corpus (all documents). It decreases the weight of terms that appear in many documents and increases the weight of terms that appear in fewer documents. The formula for IDF is:

IDF(t)=log(1+Number of documents containing term tTotal number of documents)

Term	DF	IDF = log(3 / (1+DF))	Value
good	1	log(3/2)	0.4055
quality	1	log(3/2)	0.4055
dog	1	log(3/2)	0.4055
food	1	log(3/2)	0.4055
vitality	1	log(3/2)	0.4055
product	1	log(3/2)	0.4055
jumbo	1	log(3/2)	0.4055
confection	1	log(3/2)	0.4055
the, i, of…	1	log(3/2)	0.4055
as	1	log(3/2)	0.4055

IDF helps identify words that are particularly important for differentiating documents in the corpus. Words that appear in every document (like "the", "and", etc.) have a low IDF and are less informative.

Step 4: TF-IDF Calculation

Now that we have both TF and IDF values, we can compute the TF-IDF value for each term in each document. The formula is:

For example:

TF-IDF('dog', Doc 1): Using the earlier values for TF and IDF:

TF('dog', Doc 1) = 0.0800, IDF('dog') = 1.0986
TF-IDF('dog', Doc 1) = 0.0800 × 1.0986 = 0.0879

TF-IDF('jumbo', Doc 2): For "jumbo" in Document 2:

TF('jumbo', Doc 2) = 0.0526, IDF('jumbo') = 1.0986
TF-IDF('jumbo', Doc 2) = 0.0526 × 1.0986 = 0.0578

This step gives us a weighted importance of each term within each document, considering both how often the term appears in the document and how unique it is across the corpus.

Final TF-IDF Matrix

After computing the TF-IDF values for each term in every document, we organize the results in a matrix where each row represents a document and each column represents a unique term from the corpus.

For example, the final TF-IDF matrix could look like this (simplified version based on your data):

Term	Doc 1	Doc 2	Doc 3
good	0.0369	0.0000	0.0000
quality	0.0369	0.0000	0.0000
dog	0.0369	0.0000	0.0000
food	0.0369	0.0000	0.0000
i	0.0369	0.0000	0.0000
have	0.0369	0.0000	0.0000
bought	0.0369	0.0000	0.0000
several	0.0369	0.0000	0.0000
of	0.0369	0.0000	0.0000
the	0.0369	0.0000	0.0000
vitality	0.0369	0.0000	0.0000
not	0.0000	0.0507	0.0000
as	0.0000	0.1014	0.0000
advertised	0.0000	0.0507	0.0000
product	0.0000	0.0507	0.0000
arrived	0.0000	0.0507	0.0000
labeled	0.0000	0.0507	0.0000
jumbo	0.0000	0.0507	0.0000
delight	0.0000	0.0000	0.0507
says	0.0000	0.0000	0.0507
it	0.0000	0.0000	0.0507
all	0.0000	0.0000	0.0507
this	0.0000	0.0000	0.0507
is	0.0000	0.0000	0.0507
a	0.0000	0.0000	0.0507
confection	0.0000	0.0000	0.0507

This matrix provides a way to numerically represent the importance of each term in each document, allowing algorithms to process this data for tasks such as classification, clustering, or similarity comparison.

Key Insights from TF-IDF Matrix

High TF-IDF values indicate important terms: Terms like "better," "dog," "vitality," and "product" appear with higher values in specific documents, indicating their importance in those documents.

Low TF-IDF values indicate common or unimportant terms: Words that appear in many documents, like "the" or "i," would have lower TF-IDF values, though they may not be included in the matrix if they were removed as stop words in pre-processing.

Differentiating terms: The TF-IDF matrix helps us identify which words contribute most to the meaning of each document, distinguishing them from others in the corpus.

Conclusion

This TF-IDF process helps convert raw text data into numerical features that can be used for machine learning algorithms. By assigning higher importance to words that appear less frequently across documents but more often in a given document, TF-IDF helps highlight the unique and distinguishing features of text, which is essential for tasks like document classification, clustering, and information retrieval.