TF-IDF Calulation | EntropyObserver

type

status

date

slug

summary

category

icon

password

Raw Count: The simplest version, where the term frequency is simply the raw count of the word occurrences in a document. IDF is calculated using the standard formula , where N is the total number of documents and df(t) is the number of documents containing term t.

Log Normalization: This method applies a log transformation to the term frequency to smooth the influence of frequent terms. This helps control the effect of very frequent words like "the" or "is."

Double Normalization: Here, TF is scaled using a double log normalization to further mitigate the influence of long documents. It helps prevent longer documents from having a disproportionate effect on the model.

Augmented Frequency: This method scales the frequency of terms in each document, balancing the word counts to avoid having the most frequent words dominate the TF calculation.

Smooth IDF: In this method, smoothing is applied to the IDF calculation to prevent division by zero and help prevent words that appear in nearly all documents from having a very low IDF score.

Boolean: This approach treats the presence or absence of a term as a binary variable (1 or 0), rather than counting the occurrences. It is useful in situations where the mere presence of a word in a document is more important than its frequency.

We'll assume we have a small corpus with the following three documents:

Doc 1: "the cat sat on the mat"

Doc 2: "the cat sat"

Doc 3: "the dog sat on the mat"

We will use the term "cat" as an example for our calculations.

Corpus Overview:

N (total number of documents) = 3

df("cat") (number of documents containing "cat") = 2 (Doc 1 and Doc 2)

1. Raw Count:

TF: This is simply the raw count of the word "cat" in each document.

IDF: We compute IDF as:IDF(t)=log(df(t)N)IDF(cat)=log(23)=0.1761

For "cat":

IDF(cat)=log⁡(32)=0.1761IDF(\text{cat}) = \log\left(\frac{3}{2}\right) = 0.1761

Example TF-IDF for "cat" in Doc 1:

TF ("cat") = 1 (it appears once in Doc 1)

TF-IDF ("cat") = 1 * 0.1761 = 0.1761

For Doc 1, the TF-IDF value for "cat" would be 0.1761.

2. Log Normalization:

TF: The term frequency is log-transformed, which smooths the impact of frequent words.TF(t)=1+log(count)TF(cat)=1+log(1)=1TF(cat)=1+log(1)=1

TF(t)=1+log⁡(count)TF(t) = 1 + \log(\text{count})

For "cat" in Doc 1:

TF(cat)=1+log⁡(1)=1TF(\text{cat}) = 1 + \log(1) = 1

For "cat" in Doc 2:

TF(cat)=1+log⁡(1)=1TF(\text{cat}) = 1 + \log(1) = 1

IDF remains the same as in Raw Count (since IDF formula is unaffected by TF normalization).

IDF ("cat") = 0.1761 (same as before).

Example TF-IDF for "cat" in Doc 1:

TF-IDF = 1 * 0.1761 = 0.1761

For Doc 1, the TF-IDF value for "cat" would still be 0.1761.

3. Double Normalization:

TF: This method normalizes the term frequency to limit the effect of very high values.TF(t)=1+log(max count)1+log(count)TF(cat)=1+log(2)1+log(1)=1.30101=0.768

TF(t)=1+log⁡(count)1+log⁡(max count)TF(t) = \frac{1 + \log(\text{count})}{1 + \log(\text{max count})}

Here the max count is the maximum frequency of any word in the document. For "cat" in Doc 1:

TF(cat)=1+log⁡(1)1+log⁡(2)=11.3010=0.768TF(\text{cat}) = \frac{1 + \log(1)}{1 + \log(2)} = \frac{1}{1.3010} = 0.768

Example TF-IDF for "cat" in Doc 1:

TF-IDF = 0.768 * 0.1761 = 0.1355

For Doc 1, the TF-IDF value for "cat" would be 0.1355.

4. Augmented Frequency:

TF: The term frequency is scaled to limit the influence of very frequent words.TF(t)=0.5+0.5×max countcountTF(cat)=0.5+0.5×21=0.75

TF(t)=0.5+0.5×countmax countTF(t) = 0.5 + 0.5 \times \frac{\text{count}}{\text{max count}}

For "cat" in Doc 1:

TF(cat)=0.5+0.5×12=0.75TF(\text{cat}) = 0.5 + 0.5 \times \frac{1}{2} = 0.75

Example TF-IDF for "cat" in Doc 1:

TF-IDF = 0.75 * 0.1761 = 0.1321

For Doc 1, the TF-IDF value for "cat" would be 0.1321.

5. Smooth IDF:

IDF: Adds 1 to the term frequency and document frequency to prevent issues with words that appear in every document.IDF(t)=log(df(t)+1N+1)+1IDF(cat)=log(2+13+1)+1=log(34)+1=0.1249+1=1.1249

IDF(t)=log⁡(N+1df(t)+1)+1IDF(t) = \log\left(\frac{N + 1}{df(t) + 1}\right) + 1

For "cat":

IDF(cat)=log⁡(3+12+1)+1=log⁡(43)+1=0.1249+1=1.1249IDF(\text{cat}) = \log\left(\frac{3 + 1}{2 + 1}\right) + 1 = \log\left(\frac{4}{3}\right) + 1 = 0.1249 + 1 = 1.1249