K-Means Algorithm

type

status

date

slug

summary

K-Means is a widely used unsupervised machine learning algorithm that partitions a dataset into K clusters. Each cluster is represented by a centroid, and each data point is assigned to the cluster whose centroid is closest to it. The K-Means algorithm iteratively optimizes the clustering by updating the centroids and reassigning the data points.

Principle of K-Means

Initialization:

Randomly select K data points as the initial centroids.

Assignment Step:

For each data point, compute its distance to each centroid (usually using Euclidean distance). Assign the data point to the nearest centroid's cluster.

Update Step:

Recalculate the centroid of each cluster by computing the mean of all data points assigned to that cluster.

Iterative Process:

Repeat the Assignment Step and Update Step until the centroids do not change significantly or a maximum number of iterations is reached.

Steps in the K-Means Algorithm

Choose the value of K: The number of clusters, K, must be specified beforehand. It’s the number of clusters into which the data will be divided.

Initialize the centroids: Randomly choose K data points as the initial centroids.

Assign each data point to the nearest centroid: For each data point, compute its distance from each centroid and assign it to the closest centroid.

Recalculate the centroids: After assigning the data points, update the centroid of each cluster by computing the mean of all data points in the cluster.

Repeat steps 3 and 4: Repeat the assignment and update steps until convergence (centroids no longer change significantly) or until the maximum number of iterations is reached.

Formulas in K-Means Algorithm

1. Euclidean Distance Formula

In the K-Means algorithm, Euclidean distance is commonly used to measure the distance between a data point and the cluster centroids. For two points and , the Euclidean distance is:

Where and represent the data points, and nnn is the number of features (dimensions).

2. Centroid Calculation Formula

The centroid of a cluster is the mean of all data points assigned to the cluster. Let cluster CkC_kCk contain nkn_knk data points x1,x2,...,xnkx_1, x_2, ..., x_{n_k}x1,x2,...,xnk, the centroid is computed as:

Where is the centroid of cluster , is the number of points in the cluster, and is the i-th data point in cluster .

3. Objective Function (SSE)

The K-Means algorithm minimizes the sum of squared errors (SSE), which is the sum of squared distances from each data point to its assigned centroid. The objective function is:

Where is the cluster k, is the centroid of the cluster, and is a data point assigned to cluster .

4. Stopping Criteria

The stopping criteria for the K-Means algorithm are usually one of the following:

The centroids no longer change significantly (i.e., the centroids have converged).

A maximum number of iterations is reached.

Advantages and Disadvantages of K-Means

Advantages:

Efficiency: K-Means is computationally efficient and works well for large datasets.

Simplicity: The algorithm is simple to understand and implement.

Applicability: It is widely used in practical problems such as image compression, customer segmentation, etc.

Disadvantages:

Choosing K: The user must predefine the number of clusters K, which can be difficult in practice.

Sensitivity to Initial Centroids: K-Means is sensitive to the initial selection of centroids, which can lead to suboptimal results.

Assumption of Spherical Clusters: K-Means assumes that clusters are spherical and roughly the same size. It can struggle with non-spherical or unevenly sized clusters.

Sensitivity to Outliers: K-Means is sensitive to outliers, as outliers can significantly affect the centroids.

Improvement Methods

K-Means++: A method that improves the initialization of centroids, making the algorithm less sensitive to initial starting points and leading to better convergence.

Mini-Batch K-Means: A variation of K-Means that uses small random subsets of data to update centroids, speeding up the algorithm for very large datasets.

Optimal K Selection: Methods like the Elbow Method or Silhouette Score can be used to help select the optimal value of K.

Applications of K-Means

Customer Segmentation: Segment customers based on purchasing behavior, demographics, etc., for targeted marketing strategies.

Image Compression: Compress an image by clustering pixels into a limited number of dominant colors.

Anomaly Detection: Identify data points that do not fit well into any cluster, which could be considered outliers.

Market Basket Analysis: Identify items that are frequently bought together by segmenting products.

Summary

K-Means is a simple and powerful clustering algorithm that partitions data into K clusters by assigning each data point to the closest cluster centroid and then iteratively updating the centroids. Despite its limitations, such as the need to specify K and sensitivity to outliers, it remains widely used due to its efficiency and simplicity. In practical applications, selecting an appropriate K, mitigating the impact of outliers, and using variants like K-Means++ can enhance the algorithm's performance.

K-Means clustering example

using a small 2D dataset and K = 3.

Dataset (2D Points)

Let’s say we have the following 9 data points:

Point	x	y
A	1.0	2.0
B	1.5	1.8
C	5.0	8.0
D	8.0	8.0
E	1.0	0.6
F	9.0	11.0
G	8.0	2.0
H	10.0	2.5
I	9.5	3.0

We will cluster them into K = 3 clusters.

Step 1: Initialization

Randomly select 3 centroids:

Centroid 1 (C1): A (1.0, 2.0)

Centroid 2 (C2): C (5.0, 8.0)

Centroid 3 (C3): G (8.0, 2.0)

Step 2: Assignment

We compute the Euclidean distance from each point to each centroid.

Distance Table

Point	Distance to C1 (1.0, 2.0)	Distance to C2 (5.0, 8.0)	Distance to C3 (8.0, 2.0)	Assigned Cluster
A	0.00	7.21	7.00	C1
B	0.54	7.12	6.52	C1
C	7.21	0.00	6.32	C2
D	9.22	3.00	6.00	C2
E	1.40	8.41	7.00	C1
F	12.04	4.47	9.22	C2
G	7.00	6.32	0.00	C3
H	9.06	7.50	2.06	C3
I	8.51	7.11	1.58	C3

Step 3: Update Centroids

Groupings:

Cluster 1: A, B, E

Cluster 2: C, D, F

Cluster 3: G, H, I

New Centroid 1:

A (1.0, 2.0), B (1.5, 1.8), E (1.0, 0.6)

x̄ = (1.0 + 1.5 + 1.0)/3 = 1.17

ȳ = (2.0 + 1.8 + 0.6)/3 = 1.47

🟢 New C1: (1.17, 1.47)

New Centroid 2:

C (5.0, 8.0), D (8.0, 8.0), F (9.0, 11.0)

x̄ = (5 + 8 + 9)/3 = 7.33

ȳ = (8 + 8 + 11)/3 = 9.00

🔵 New C2: (7.33, 9.00)

New Centroid 3:

G (8.0, 2.0), H (10.0, 2.5), I (9.5, 3.0)

x̄ = (8 + 10 + 9.5)/3 = 9.17

ȳ = (2.0 + 2.5 + 3.0)/3 = 2.5

🟠 New C3: (9.17, 2.5)

Step 4: Second Iteration Assignment

Recalculate distances from all points to the new centroids.

Point	Dist to New C1	Dist to New C2	Dist to New C3	Assigned Cluster
A	0.56	9.37	8.17	C1
B	0.38	8.87	7.68	C1
C	7.48	2.68	7.28	C2
D	9.17	1.20	6.02	C2
E	0.94	10.12	8.53	C1
F	12.02	2.50	8.55	C2
G	7.05	7.04	1.20	C3
H	9.19	7.28	0.83	C3
I	8.66	6.83	0.50	C3