type
status
date
slug
summary
tags
category
icon
password
Step 1: Tokenization
Tokenization is the process of breaking down a document into individual words or tokens. The goal here is to extract the words that will later be analyzed for frequency. For each document, we create a list of tokens (words).
- Doc 1:
"good quality dog food i have bought several of the vitality"
➤ Tokens:
['good', 'quality', 'dog', 'food', 'i', 'have', 'bought', 'several', 'of', 'the', 'vitality']
➤ Word count: 11
- Doc 2:
"not as advertised product arrived labeled as jumbo"
➤ Tokens:
['not', 'as', 'advertised', 'product', 'arrived', 'labeled', 'as', 'jumbo']
➤ Word count: 8
- Doc 3:
"delight says it all this is a confection"
➤ Tokens:
['delight', 'says', 'it', 'all', 'this', 'is', 'a', 'confection']
➤ Word count: 8
In the next steps, we’ll calculate the Term Frequency (TF) and Inverse Document Frequency (IDF) for each token in these documents.
Step 2: Term Frequency (TF)
Term Frequency (TF) measures how frequently a term appears in a specific document. The formula is:
TF(t,d)=Total number of terms in document dCount of term t in document d
📄 Doc 1 TF:
Term | Count | TF (count/11) |
good | 1 | 0.0909 |
quality | 1 | 0.0909 |
dog | 1 | 0.0909 |
food | 1 | 0.0909 |
i | 1 | 0.0909 |
have | 1 | 0.0909 |
bought | 1 | 0.0909 |
several | 1 | 0.0909 |
of | 1 | 0.0909 |
the | 1 | 0.0909 |
vitality | 1 | 0.0909 |
📄 Doc 2 TF:
Term | Count | TF (count/8) |
not | 1 | 0.125 |
as | 2 | 0.25 |
advertised | 1 | 0.125 |
product | 1 | 0.125 |
arrived | 1 | 0.125 |
labeled | 1 | 0.125 |
jumbo | 1 | 0.125 |
📄 Doc 3 TF:
Term | Count | TF (count/8) |
delight | 1 | 0.125 |
says | 1 | 0.125 |
it | 1 | 0.125 |
all | 1 | 0.125 |
this | 1 | 0.125 |
is | 1 | 0.125 |
a | 1 | 0.125 |
confection | 1 | 0.125 |
This process is repeated for all words in all documents. It allows us to assess how important each word is within the context of a single document.
Step 3: Inverse Document Frequency (IDF)
Inverse Document Frequency (IDF) is a measure of how important a term is across the entire corpus (all documents). It decreases the weight of terms that appear in many documents and increases the weight of terms that appear in fewer documents. The formula for IDF is:
IDF(t)=log(1+Number of documents containing term tTotal number of documents)
Term | DF | IDF = log(3 / (1+DF)) | Value |
good | 1 | log(3/2) | 0.4055 |
quality | 1 | log(3/2) | 0.4055 |
dog | 1 | log(3/2) | 0.4055 |
food | 1 | log(3/2) | 0.4055 |
vitality | 1 | log(3/2) | 0.4055 |
product | 1 | log(3/2) | 0.4055 |
jumbo | 1 | log(3/2) | 0.4055 |
confection | 1 | log(3/2) | 0.4055 |
the, i, of… | 1 | log(3/2) | 0.4055 |
as | 1 | log(3/2) | 0.4055 |
IDF helps identify words that are particularly important for differentiating documents in the corpus. Words that appear in every document (like "the", "and", etc.) have a low IDF and are less informative.
Step 4: TF-IDF Calculation
Now that we have both TF and IDF values, we can compute the TF-IDF value for each term in each document. The formula is:
For example:
- TF-IDF('dog', Doc 1): Using the earlier values for TF and IDF:
- TF('dog', Doc 1) = 0.0800, IDF('dog') = 1.0986
- TF-IDF('dog', Doc 1) = 0.0800 × 1.0986 = 0.0879
- TF-IDF('jumbo', Doc 2): For "jumbo" in Document 2:
- TF('jumbo', Doc 2) = 0.0526, IDF('jumbo') = 1.0986
- TF-IDF('jumbo', Doc 2) = 0.0526 × 1.0986 = 0.0578
This step gives us a weighted importance of each term within each document, considering both how often the term appears in the document and how unique it is across the corpus.
Final TF-IDF Matrix
After computing the TF-IDF values for each term in every document, we organize the results in a matrix where each row represents a document and each column represents a unique term from the corpus.
For example, the final TF-IDF matrix could look like this (simplified version based on your data):
Term | Doc 1 | Doc 2 | Doc 3 |
good | 0.0369 | 0.0000 | 0.0000 |
quality | 0.0369 | 0.0000 | 0.0000 |
dog | 0.0369 | 0.0000 | 0.0000 |
food | 0.0369 | 0.0000 | 0.0000 |
i | 0.0369 | 0.0000 | 0.0000 |
have | 0.0369 | 0.0000 | 0.0000 |
bought | 0.0369 | 0.0000 | 0.0000 |
several | 0.0369 | 0.0000 | 0.0000 |
of | 0.0369 | 0.0000 | 0.0000 |
the | 0.0369 | 0.0000 | 0.0000 |
vitality | 0.0369 | 0.0000 | 0.0000 |
not | 0.0000 | 0.0507 | 0.0000 |
as | 0.0000 | 0.1014 | 0.0000 |
advertised | 0.0000 | 0.0507 | 0.0000 |
product | 0.0000 | 0.0507 | 0.0000 |
arrived | 0.0000 | 0.0507 | 0.0000 |
labeled | 0.0000 | 0.0507 | 0.0000 |
jumbo | 0.0000 | 0.0507 | 0.0000 |
delight | 0.0000 | 0.0000 | 0.0507 |
says | 0.0000 | 0.0000 | 0.0507 |
it | 0.0000 | 0.0000 | 0.0507 |
all | 0.0000 | 0.0000 | 0.0507 |
this | 0.0000 | 0.0000 | 0.0507 |
is | 0.0000 | 0.0000 | 0.0507 |
a | 0.0000 | 0.0000 | 0.0507 |
confection | 0.0000 | 0.0000 | 0.0507 |
This matrix provides a way to numerically represent the importance of each term in each document, allowing algorithms to process this data for tasks such as classification, clustering, or similarity comparison.
Key Insights from TF-IDF Matrix
- High TF-IDF values indicate important terms: Terms like "better," "dog," "vitality," and "product" appear with higher values in specific documents, indicating their importance in those documents.
- Low TF-IDF values indicate common or unimportant terms: Words that appear in many documents, like "the" or "i," would have lower TF-IDF values, though they may not be included in the matrix if they were removed as stop words in pre-processing.
- Differentiating terms: The TF-IDF matrix helps us identify which words contribute most to the meaning of each document, distinguishing them from others in the corpus.
Conclusion
This TF-IDF process helps convert raw text data into numerical features that can be used for machine learning algorithms. By assigning higher importance to words that appear less frequently across documents but more often in a given document, TF-IDF helps highlight the unique and distinguishing features of text, which is essential for tasks like document classification, clustering, and information retrieval.
- Author:NotionNext
- URL:http://preview.tangly1024.com/article/1c6d698f-3512-8151-bb6b-f5a2c3b7965a
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!