K means

K-Means Clustering Cheat Sheet

1. K-Means Clustering Cheat Sheet

Clustering
Unsupervised Learning

High-Level Overview:

K-Means is an unsupervised learning algorithm used to cluster data points into \( k \) groups (clusters) based on their similarities. It tries to find cluster centers (centroids) that minimize the sum of distances between points and their assigned cluster centroid.

Machine Learning Engineers often use K-Means for exploratory data analysis, segmenting customers, or identifying structure in unlabeled data.

Key Concepts:

  • Centroids: Representative points for each cluster, computed as the mean of points in that cluster.
  • Assignment Step: Assign each data point to the cluster with the closest centroid.
  • Update Step: Recompute centroids as the mean of all points assigned to each cluster.
  • Iterations: Alternate between assignment and update until convergence or a max iteration limit is reached.

Objective Function:

\[ \min_{C_1, \ldots, C_k} \sum_{j=1}^{k} \sum_{\mathbf{x} \in C_j} \|\mathbf{x} - \boldsymbol{\mu}_j\|^2 \]

Here:

  • \( C_j \): The \( j \)-th cluster
  • \(\mathbf{x}\): A data point
  • \(\boldsymbol{\mu}_j\): The centroid (mean) of cluster \( C_j \)
The goal is to find cluster assignments that minimize the sum of squared distances of points to their cluster centroids.

Algorithm Steps (K-Means):

  1. Initialization: Choose \( k \) initial centroids (often randomly from the data).
  2. Assignment Step: Assign each point to the nearest centroid to form clusters.
  3. Update Step: Recalculate the centroid of each cluster as the mean of all assigned points.
  4. Repeat: Alternate between assignment and update steps until assignments don’t change significantly or a max iteration count is reached.

Example (Nail Polish Clustering):

Imagine you have a set of nail polishes described by features like color intensity, shine level, and drying time. You have no labels (e.g., "premium" or "regular"), but you want to group them into similar types.

Applying K-Means with \( k=3 \), you might discover:

  • Cluster 1: Highly shiny, bright polishes
  • Cluster 2: Matte, darker polishes
  • Cluster 3: Moderate shine, average brightness
K-Means helps you discover these natural groupings without any predefined labels.

Differences Compared to K-Nearest Neighbors (KNN):

  • Supervision:
    • K-Means: Unsupervised (no labels). The algorithm finds groups on its own.
    • KNN: Supervised (labels required). KNN uses known labels of neighbors to classify or regress.
  • Goal:
    • K-Means: Find cluster centers that minimize within-cluster variance.
    • KNN: Predict the label of a new point by looking at the nearest labeled points.
  • Output:
    • K-Means: Provides cluster assignments for each point, no predefined classes.
    • KNN: Provides a predicted class or value based on closest labeled neighbors.
  • Distance Usage:
    • K-Means: Uses distance to form and update clusters without prior knowledge of labels.
    • KNN: Uses distance to find neighbors among labeled examples to predict an unknown point’s label.

Step-by-Step Summary:

  1. Choose \(k\): Decide on the number of clusters.
  2. Initialize Centroids: Often randomly pick \( k \) points as initial centroids.
  3. Assignment & Update: Assign points to nearest centroid, then recalculate centroids.
  4. Iterate: Repeat until cluster assignments stabilize or a stopping criterion is met.
  5. Interpret Results: Examine the clusters discovered to understand data structure.

ML Engineer Perspective:

  • Use K-Means to explore unlabeled data and find natural groupings.
  • Good initial clustering method due to its simplicity and speed.
  • Compare outcomes with other clustering methods or use as a preprocessing step for complex tasks.

Code Example (Python with scikit-learn):


import numpy as np
from sklearn.cluster import KMeans

# Example: Nail polish features: [Color Intensity, Shine Level, Drying Time]
X = np.array([
    [0.8, 0.9, 0.5],
    [0.75,0.85,0.52],
    [0.65,0.95,0.49],
    [0.9, 0.7, 0.6],
    [0.7, 0.8, 0.53],
    [0.3, 0.4, 0.5],
    [0.4, 0.35,0.55],
    [0.2, 0.45,0.48],
    [0.5, 0.6, 0.58],
    [0.55,0.65,0.52]
])

kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

print("Cluster Centers:", kmeans.cluster_centers_)
print("Labels:", kmeans.labels_)
                

This example runs K-Means with \( k=3 \) clusters on a small dataset of nail polish features. After fitting, it prints the cluster centers and each data point’s assigned label (0, 1, or 2).

Key Takeaways:

  • K-Means is an unsupervised clustering method that groups data into \( k \) clusters.
  • It minimizes within-cluster sum of squared distances to centroids.
  • Unlike KNN, it does not require labels and is used for discovering structure, not predicting labels.
  • Simple, efficient, and widely used as a baseline clustering technique.
Previous
Previous

K nearest Neighbors

Next
Next

Logistic Regression