Lazy loaded image
Lazy loaded imageSparse vs Dense vectors
Words 1118Read Time 3 min
Apr 21, 2024
Jun 16, 2025
type
status
date
slug
summary
tags
category
icon
password

What is a Sparse Vector?

A sparse vector is a vector in which most of the elements are zero.
We store only the non-zero values and their positions to save memory and computation.
A sentence converted using Bag-of-Words or TF-IDF might look like this:
Only 3 values are non-zero — the rest are zeros. This is a sparse representation

What is a Dense Vector?

A dense vector is a vector where most or all elements have non-zero values.
Every element is stored explicitly, including zeros.
A BERT embedding of a sentence might look like:
This is a dense representation — all values are floats, learned by the model, and none are skipped.

Key Differences Between Sparse and Dense Vectors:

Feature
Sparse Vector
Dense Vector
Values
Mostly 0s
Mostly non-zero
Memory
Efficient (stores only non-zeros)
Takes more memory
Origin
Rule-based (BoW, TF-IDF)
Learned (embeddings, neural nets)
Meaning
Surface-level (word counts/freq)
Semantic-level (contextual meaning)
Used in
Classical ML
Deep learning / neural networks

Training Process with Sparse Vectors

Let’s dive into a logistic regression example where we are using sparse vectors for input features.

1. Initialization

We first initialize the weights www and bias bbb, similar to what you would do for dense vectors. The weights are learned during the training process.
Assume we have the following sparse input vector:
Sparse Vector Example:
  • X_sparse = [0,1,0,0,0,0,1] (This vector represents a document in a sparse format, where only positions 2 and 7 have non-zero values.)
For simplicity, let's initialize the weights and bias as follows:
  • Weights (w) = [0.1,−0.2,0.05,0,0,0,−0.1]
  • Bias (b) = 0.1

2. Prediction Calculation

The model’s prediction is computed by calculating the linear combination of the input features with the weights, plus the bias:
Where are the weights corresponding to the features, and are the feature values in the sparse vector.
For our example, we only need to focus on the non-zero values in X_sparse, which are at positions 2 and 7.
The calculation of zzz:
Since , and are zero, we only compute for the non-zero elements:

3. Sigmoid Activation

Now, we pass the value of zzz through the sigmoid function to get the predicted probability:
So, the predicted probability that this document belongs to class 1 is approximately 45%.

4. Loss Calculation

Next, we compute the loss using the binary cross-entropy loss function:
Where:
  • y is the true label (let’s say it's 1).
  • is the predicted value (0.45 in this case).
The loss is:

5. Gradient Calculation and Update

To minimize the loss, we compute the gradients of the loss with respect to the weights and bias, and then update them using gradient descent.
For each weight wjw_jwj and bias bbb, the gradient is computed as:
Since we only have non-zero values for x2x_2x2 and x7x_7x7, the gradients with respect to the weights corresponding to the non-zero values are:
We can now update the weights using the learning rate ():
So the updated weights for and are:

6. Repeat the Process

This process continues iteratively over multiple training examples (documents) and epochs until the model converges, i.e., the weights and biases stabilize.

Summary

  • Sparse Vectors: Only store non-zero values, saving memory and computational resources.
  • Training Process: For sparse vectors, only the non-zero values are involved in each computation, making it more efficient.
  • Gradient Descent Update: We compute gradients and update weights and biases just for the non-zero features in each vector.
This approach is very common in NLP tasks, where text data is usually represented in a sparse manner (e.g., bag-of-words or TF-IDF representations).
 
 

Dense Vector and Its Use in Machine Learning

Example: Logistic Regression with Dense Vectors
Let’s go through the training process using a dense input vector in a logistic regression model.

1. Initialization

We initialize the weights www and bias bbb. Assume the input vector is:
Dense Vector Example:
(Still 7-dimensional, just like before, but now explicitly represented as dense.)
Weights and bias:

2. Prediction Calculation

We compute:
Since this is dense, we do the full dot product:

3. Sigmoid Activation


4. Loss Calculation

Using the binary cross-entropy loss:
Assume true label y = 1, then:
 

5. Gradient Calculation and Update

We compute gradients for all weights, since the input is dense.
Let’s calculate:
Feature j
x_j
1
0
2
1
3
0
4
0
5
0
6
0
7
1
Bias gradient:
Using learning rate :
Updated weights:
  • (no change)
Updated bias:

6. Repeat the Process

Repeat the above steps for multiple training examples and epochs until the model converges.

Sparse vs Dense Vector Usage Across Machine Learning Models

Model Type
Examples
Input Vector Type
Sparse or Dense?
Supports Sparse Input?
Linear Models
LogisticRegression, LinearSVC
Bag-of-Words, TF-IDF
Sparse
✅ Yes (highly optimized)
Naive Bayes
MultinomialNB, BernoulliNB
Bag-of-Words, TF-IDF
Sparse
✅ Yes
Tree-Based Models
DecisionTree, RandomForest, XGBoost
Any numeric features
🚫 Usually Dense
⚠️ Partial (e.g. XGBoost has sparse optimizations)
K-Nearest Neighbors
KNeighborsClassifier
TF-IDF or other features
🚫 Usually Dense
⚠️ Technically yes, but inefficient
MLP / Shallow Neural Networks
MLPClassifier, Keras, PyTorch
Dense embeddings or numeric
Dense
🚫 No — must convert to dense
Transformers
BERT, RoBERTa, GPT
Token embeddings
Dense
🚫 No — only dense tensors supported
RNN / LSTM / GRU
NLP sequence models
Embedding sequences
Dense
🚫 No
CNN (Text/Image)
TextCNN, ResNet, etc.
Dense embeddings or image tensors
Dense
🚫 No
Recommendation Models
Matrix Factorization, LightFM
User-item interaction matrix
✅ Often Sparse
✅ Yes — optimized for sparse input
 
上一篇
Sklearn Workflow
下一篇
Perceptrons