type
status
date
slug
summary
tags
category
icon
password

- Raw Count: The simplest version, where the term frequency is simply the raw count of the word occurrences in a document. IDF is calculated using the standard formula , where N is the total number of documents and df(t) is the number of documents containing term t.
- Log Normalization: This method applies a log transformation to the term frequency to smooth the influence of frequent terms. This helps control the effect of very frequent words like "the" or "is."
- Double Normalization: Here, TF is scaled using a double log normalization to further mitigate the influence of long documents. It helps prevent longer documents from having a disproportionate effect on the model.
- Augmented Frequency: This method scales the frequency of terms in each document, balancing the word counts to avoid having the most frequent words dominate the TF calculation.
- Smooth IDF: In this method, smoothing is applied to the IDF calculation to prevent division by zero and help prevent words that appear in nearly all documents from having a very low IDF score.
- Boolean: This approach treats the presence or absence of a term as a binary variable (1 or 0), rather than counting the occurrences. It is useful in situations where the mere presence of a word in a document is more important than its frequency.
We'll assume we have a small corpus with the following three documents:
- Doc 1: "the cat sat on the mat"
- Doc 2: "the cat sat"
- Doc 3: "the dog sat on the mat"
We will use the term "cat" as an example for our calculations.
Corpus Overview:
- N (total number of documents) = 3
- df("cat") (number of documents containing "cat") = 2 (Doc 1 and Doc 2)
1. Raw Count:
- TF: This is simply the raw count of the word "cat" in each document.
- IDF: We compute IDF as:IDF(t)=log(df(t)N)IDF(cat)=log(23)=0.1761
IDF(t)=log(Ndf(t))IDF(t) = \log\left(\frac{N}{df(t)}\right)
For "cat":
IDF(cat)=log(32)=0.1761IDF(\text{cat}) = \log\left(\frac{3}{2}\right) = 0.1761
Example TF-IDF for "cat" in Doc 1:
- TF ("cat") = 1 (it appears once in Doc 1)
- TF-IDF ("cat") = 1 * 0.1761 = 0.1761
For Doc 1, the TF-IDF value for "cat" would be 0.1761.
2. Log Normalization:
- TF: The term frequency is log-transformed, which smooths the impact of frequent words.TF(t)=1+log(count)TF(cat)=1+log(1)=1TF(cat)=1+log(1)=1
TF(t)=1+log(count)TF(t) = 1 + \log(\text{count})
For "cat" in Doc 1:
TF(cat)=1+log(1)=1TF(\text{cat}) = 1 + \log(1) = 1
For "cat" in Doc 2:
TF(cat)=1+log(1)=1TF(\text{cat}) = 1 + \log(1) = 1
IDF remains the same as in Raw Count (since IDF formula is unaffected by TF normalization).
- IDF ("cat") = 0.1761 (same as before).
Example TF-IDF for "cat" in Doc 1:
- TF-IDF = 1 * 0.1761 = 0.1761
For Doc 1, the TF-IDF value for "cat" would still be 0.1761.
3. Double Normalization:
- TF: This method normalizes the term frequency to limit the effect of very high values.TF(t)=1+log(max count)1+log(count)TF(cat)=1+log(2)1+log(1)=1.30101=0.768
TF(t)=1+log(count)1+log(max count)TF(t) = \frac{1 + \log(\text{count})}{1 + \log(\text{max count})}
Here the max count is the maximum frequency of any word in the document.
For "cat" in Doc 1:
TF(cat)=1+log(1)1+log(2)=11.3010=0.768TF(\text{cat}) = \frac{1 + \log(1)}{1 + \log(2)} = \frac{1}{1.3010} = 0.768
Example TF-IDF for "cat" in Doc 1:
- TF-IDF = 0.768 * 0.1761 = 0.1355
For Doc 1, the TF-IDF value for "cat" would be 0.1355.
4. Augmented Frequency:
- TF: The term frequency is scaled to limit the influence of very frequent words.TF(t)=0.5+0.5×max countcountTF(cat)=0.5+0.5×21=0.75
TF(t)=0.5+0.5×countmax countTF(t) = 0.5 + 0.5 \times \frac{\text{count}}{\text{max count}}
For "cat" in Doc 1:
TF(cat)=0.5+0.5×12=0.75TF(\text{cat}) = 0.5 + 0.5 \times \frac{1}{2} = 0.75
Example TF-IDF for "cat" in Doc 1:
- TF-IDF = 0.75 * 0.1761 = 0.1321
For Doc 1, the TF-IDF value for "cat" would be 0.1321.
5. Smooth IDF:
- IDF: Adds 1 to the term frequency and document frequency to prevent issues with words that appear in every document.IDF(t)=log(df(t)+1N+1)+1IDF(cat)=log(2+13+1)+1=log(34)+1=0.1249+1=1.1249
IDF(t)=log(N+1df(t)+1)+1IDF(t) = \log\left(\frac{N + 1}{df(t) + 1}\right) + 1
For "cat":
IDF(cat)=log(3+12+1)+1=log(43)+1=0.1249+1=1.1249IDF(\text{cat}) = \log\left(\frac{3 + 1}{2 + 1}\right) + 1 = \log\left(\frac{4}{3}\right) + 1 = 0.1249 + 1 = 1.1249
Example TF-IDF for "cat" in Doc 1:
- TF-IDF = 1 * 1.1249 = 1.1249
For Doc 1, the TF-IDF value for "cat" would be 1.1249.
6. Boolean:
- TF: The term is either present (1) or absent (0).TF(t)=1 if the term is present, else 0TF(cat)=1
TF(t)=1 if the term is present, else 0TF(t) = 1 \text{ if the term is present, else } 0
For "cat" in Doc 1, since it is present:
TF(cat)=1TF(\text{cat}) = 1
Example TF-IDF for "cat" in Doc 1:
- TF-IDF = 1 * 0.1761 = 0.1761
For Doc 1, the TF-IDF value for "cat" would be 0.1761.
- Author:NotionNext
- URL:http://preview.tangly1024.com/article/1c6d698f-3512-81df-bbea-c7732dbec2db
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!