RFM Analysis + KMeans Clustering

type

status

date

slug

summary

Step 1: Import Required Libraries

Explanation:

This block imports all the necessary Python libraries that will be used throughout the process. The libraries serve different purposes:

pandas and numpy: For data handling and numerical operations.

matplotlib.pyplot and seaborn: For visualizing data and clustering results.

sklearn.preprocessing.StandardScaler: For standardizing features before clustering.

sklearn.cluster.KMeans, DBSCAN: For performing clustering.

sklearn.decomposition.PCA, sklearn.manifold.TSNE: For dimensionality reduction and visualization.

yellowbrick: For visual diagnostics such as the elbow and silhouette methods.

sklearn.metrics: For evaluating clustering quality using different statistical metrics.

Step 2: Standardize the RFM Dataset

Explanation:

Clustering algorithms like KMeans are sensitive to the scale of input data. If one variable has a much larger range than the others (for example, Monetary values often being much higher than Recency or Frequency), it can dominate the clustering process.

To ensure each feature contributes equally, we standardize the data using StandardScaler. This transformation converts the values to have a mean of zero and a standard deviation of one.

The standardized data is then converted back to a DataFrame with the original column names for ease of interpretation.

Step 3: Determine Optimal Number of Clusters Using the Elbow Method

Explanation:

This step uses the Elbow Method to help determine the optimal number of clusters (k) for KMeans clustering. The Elbow Method calculates the within-cluster sum of squares (WCSS) for a range of k values (here from 1 to 10).

By plotting WCSS against the number of clusters, we look for a point where the curve starts to flatten, which is the "elbow." This is typically the best value for k, because adding more clusters beyond this point gives diminishing returns in terms of reducing WCSS.

KElbowVisualizer from the yellowbrick library automates this process and provides a visual output.

Step 4: Fit the KMeans Clustering Algorithm

Explanation:

After deciding on the number of clusters from the previous step, we fit a KMeans clustering model. The key parameters used are:

n_clusters=5: Number of customer segments.

max_iter=100: Maximum number of iterations allowed per run.

n_init=10: Number of times the algorithm will run with different centroid seeds. The best result in terms of inertia will be selected.

random_state=42: Ensures reproducibility of results.

This step assigns each record (customer) to one of the 5 clusters based on the shortest Euclidean distance to the cluster centroids.

Step 5: Add Cluster Labels to the Dataset

Explanation:

After clustering is complete, we add a new column named Clusters to the dataset. This column contains the cluster number (0 to 4) assigned to each customer.

Displaying the first few rows allows us to verify that the labels have been assigned properly.

Step 6: View Cluster Centroids

Explanation:

This command returns the coordinates of the centroids for each cluster in the standardized RFM space.

Each centroid represents the average location of all the points in a cluster. By interpreting these centroid values, we can characterize the behavior of customers in each segment (for example, Cluster 0 might represent "High Frequency, High Monetary, Low Recency" customers).

Step 7: Visualize the Clusters in Two Dimensions

Explanation:

This visualization plots customers based on their Recency and Frequency values, coloring them by the clusters they belong to.

Although the original RFM data is 3-dimensional, we use only two dimensions here (Recency and Frequency) for simplicity in visualization.

Centroids are marked with yellow stars to show the average position of each cluster. This helps visually assess the separation and compactness of the clusters.

Step 8: Evaluate Clustering Performance Using Silhouette Score

Explanation:

The Silhouette Score is used to measure the quality of clustering. It calculates how similar each point is to its own cluster compared to other clusters.

The score ranges from -1 to 1: