type
status
date
slug
summary
tags
category
icon
password
K-Means Algorithm
K-Means is a widely used unsupervised machine learning algorithm that partitions a dataset into K clusters. Each cluster is represented by a centroid, and each data point is assigned to the cluster whose centroid is closest to it. The K-Means algorithm iteratively optimizes the clustering by updating the centroids and reassigning the data points.

Principle of K-Means
- Initialization:
- Randomly select K data points as the initial centroids.
- Assignment Step:
- For each data point, compute its distance to each centroid (usually using Euclidean distance). Assign the data point to the nearest centroid's cluster.
- Update Step:
- Recalculate the centroid of each cluster by computing the mean of all data points assigned to that cluster.
- Iterative Process:
- Repeat the Assignment Step and Update Step until the centroids do not change significantly or a maximum number of iterations is reached.
Steps in the K-Means Algorithm
- Choose the value of K: The number of clusters, K, must be specified beforehand. It’s the number of clusters into which the data will be divided.
- Initialize the centroids: Randomly choose K data points as the initial centroids.
- Assign each data point to the nearest centroid: For each data point, compute its distance from each centroid and assign it to the closest centroid.
- Recalculate the centroids: After assigning the data points, update the centroid of each cluster by computing the mean of all data points in the cluster.
- Repeat steps 3 and 4: Repeat the assignment and update steps until convergence (centroids no longer change significantly) or until the maximum number of iterations is reached.
Formulas in K-Means Algorithm
1. Euclidean Distance Formula
In the K-Means algorithm, Euclidean distance is commonly used to measure the distance between a data point and the cluster centroids. For two points and , the Euclidean distance is:
Where and represent the data points, and nnn is the number of features (dimensions).
2. Centroid Calculation Formula
The centroid of a cluster is the mean of all data points assigned to the cluster. Let cluster CkC_kCk contain nkn_knk data points x1,x2,...,xnkx_1, x_2, ..., x_{n_k}x1,x2,...,xnk, the centroid is computed as:
Where is the centroid of cluster , is the number of points in the cluster, and is the i-th data point in cluster .
3. Objective Function (SSE)
The K-Means algorithm minimizes the sum of squared errors (SSE), which is the sum of squared distances from each data point to its assigned centroid. The objective function is:
Where is the cluster k, is the centroid of the cluster, and is a data point assigned to cluster .
4. Stopping Criteria
The stopping criteria for the K-Means algorithm are usually one of the following:
- The centroids no longer change significantly (i.e., the centroids have converged).
- A maximum number of iterations is reached.
Advantages and Disadvantages of K-Means
Advantages:
- Efficiency: K-Means is computationally efficient and works well for large datasets.
- Simplicity: The algorithm is simple to understand and implement.
- Applicability: It is widely used in practical problems such as image compression, customer segmentation, etc.
Disadvantages:
- Choosing K: The user must predefine the number of clusters K, which can be difficult in practice.
- Sensitivity to Initial Centroids: K-Means is sensitive to the initial selection of centroids, which can lead to suboptimal results.
- Assumption of Spherical Clusters: K-Means assumes that clusters are spherical and roughly the same size. It can struggle with non-spherical or unevenly sized clusters.
- Sensitivity to Outliers: K-Means is sensitive to outliers, as outliers can significantly affect the centroids.
Improvement Methods
- K-Means++: A method that improves the initialization of centroids, making the algorithm less sensitive to initial starting points and leading to better convergence.
- Mini-Batch K-Means: A variation of K-Means that uses small random subsets of data to update centroids, speeding up the algorithm for very large datasets.
- Optimal K Selection: Methods like the Elbow Method or Silhouette Score can be used to help select the optimal value of K.
Applications of K-Means
- Customer Segmentation: Segment customers based on purchasing behavior, demographics, etc., for targeted marketing strategies.
- Image Compression: Compress an image by clustering pixels into a limited number of dominant colors.
- Anomaly Detection: Identify data points that do not fit well into any cluster, which could be considered outliers.
- Market Basket Analysis: Identify items that are frequently bought together by segmenting products.
Summary
K-Means is a simple and powerful clustering algorithm that partitions data into K clusters by assigning each data point to the closest cluster centroid and then iteratively updating the centroids. Despite its limitations, such as the need to specify K and sensitivity to outliers, it remains widely used due to its efficiency and simplicity. In practical applications, selecting an appropriate K, mitigating the impact of outliers, and using variants like K-Means++ can enhance the algorithm's performance.
K-Means clustering example
using a small 2D dataset and K = 3.
Dataset (2D Points)
Let’s say we have the following 9 data points:
Point | x | y |
A | 1.0 | 2.0 |
B | 1.5 | 1.8 |
C | 5.0 | 8.0 |
D | 8.0 | 8.0 |
E | 1.0 | 0.6 |
F | 9.0 | 11.0 |
G | 8.0 | 2.0 |
H | 10.0 | 2.5 |
I | 9.5 | 3.0 |
We will cluster them into K = 3 clusters.
Step 1: Initialization
Randomly select 3 centroids:
- Centroid 1 (C1): A (1.0, 2.0)
- Centroid 2 (C2): C (5.0, 8.0)
- Centroid 3 (C3): G (8.0, 2.0)
Step 2: Assignment
We compute the Euclidean distance from each point to each centroid.
Distance Table
Point | Distance to C1 (1.0, 2.0) | Distance to C2 (5.0, 8.0) | Distance to C3 (8.0, 2.0) | Assigned Cluster |
A | 0.00 | 7.21 | 7.00 | C1 |
B | 0.54 | 7.12 | 6.52 | C1 |
C | 7.21 | 0.00 | 6.32 | C2 |
D | 9.22 | 3.00 | 6.00 | C2 |
E | 1.40 | 8.41 | 7.00 | C1 |
F | 12.04 | 4.47 | 9.22 | C2 |
G | 7.00 | 6.32 | 0.00 | C3 |
H | 9.06 | 7.50 | 2.06 | C3 |
I | 8.51 | 7.11 | 1.58 | C3 |
Step 3: Update Centroids
Groupings:
- Cluster 1: A, B, E
- Cluster 2: C, D, F
- Cluster 3: G, H, I
New Centroid 1:
- A (1.0, 2.0), B (1.5, 1.8), E (1.0, 0.6)
- x̄ = (1.0 + 1.5 + 1.0)/3 = 1.17
- ȳ = (2.0 + 1.8 + 0.6)/3 = 1.47
🟢 New C1: (1.17, 1.47)
New Centroid 2:
- C (5.0, 8.0), D (8.0, 8.0), F (9.0, 11.0)
- x̄ = (5 + 8 + 9)/3 = 7.33
- ȳ = (8 + 8 + 11)/3 = 9.00
🔵 New C2: (7.33, 9.00)
New Centroid 3:
- G (8.0, 2.0), H (10.0, 2.5), I (9.5, 3.0)
- x̄ = (8 + 10 + 9.5)/3 = 9.17
- ȳ = (2.0 + 2.5 + 3.0)/3 = 2.5
🟠 New C3: (9.17, 2.5)
Step 4: Second Iteration Assignment
Recalculate distances from all points to the new centroids.
Point | Dist to New C1 | Dist to New C2 | Dist to New C3 | Assigned Cluster |
A | 0.56 | 9.37 | 8.17 | C1 |
B | 0.38 | 8.87 | 7.68 | C1 |
C | 7.48 | 2.68 | 7.28 | C2 |
D | 9.17 | 1.20 | 6.02 | C2 |
E | 0.94 | 10.12 | 8.53 | C1 |
F | 12.02 | 2.50 | 8.55 | C2 |
G | 7.05 | 7.04 | 1.20 | C3 |
H | 9.19 | 7.28 | 0.83 | C3 |
I | 8.66 | 6.83 | 0.50 | C3 |
No change in cluster assignment from the first iteration → Algorithm converges.
Final Clusters
- Cluster 1 (C1): A, B, E → centroid at (1.17, 1.47)
- Cluster 2 (C2): C, D, F → centroid at (7.33, 9.00)
- Cluster 3 (C3): G, H, I → centroid at (9.17, 2.5)
- Author:Entropyobserver
- URL:https://tangly1024.com/article/1dcd698f-3512-80c8-b4b9-d4244fc9c220
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!