Session 9: Clustering Without Crying

Clustering Without Crying

Practical guide to unsupervised learning

Anna Smirnova, November 10, 2025
Session 9: Clustering Without Crying

Today's Plan

15 min: Common clustering pitfalls
30 min: Hands-on clustering exercise

Goal: You can cluster data and know when it's working

Anna Smirnova, November 10, 2025
Session 9: Clustering Without Crying

Supervised vs Unsupervised

Supervised: You have labels

X = [[features]], y = [labels]  # "This is a cat", "This is a dog"
model.fit(X, y)  # Learn from labels

Unsupervised: NO labels

X = [[features]]  # No y!
model.fit(X)  # Find patterns on its own

Clustering = grouping similar things together without being told what the groups are

Anna Smirnova, November 10, 2025
Session 9: Clustering Without Crying

The Clustering Toolkit

K-Means: Fast, simple, assumes spherical clusters

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
labels = kmeans.fit_predict(X)

DBSCAN: Finds arbitrary shapes, handles noise

from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X)

Hierarchical: Builds tree of clusters

from sklearn.cluster import AgglomerativeClustering
hclust = AgglomerativeClustering(n_clusters=3)
labels = hclust.fit_predict(X)
Anna Smirnova, November 10, 2025
Session 9: Clustering Without Crying

K-Means: The Workhorse

Algorithm:

  1. Pick k random points as cluster centers
  2. Assign each point to nearest center
  3. Move centers to mean of assigned points
  4. Repeat until centers stop moving

Pros:

  • Fast and simple
  • Works well for spherical clusters

Cons:

  • Must choose k beforehand
  • Fails on non-spherical shapes
  • Sensitive to outliers
Anna Smirnova, November 10, 2025
Session 9: Clustering Without Crying

Common Error #1: Not Scaling Data

# ❌ WRONG - Features on different scales
# [[income, age], [50000, 25], [60000, 30]]
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)  # Income dominates!

# ✅ RIGHT - Scale first
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans.fit(X_scaled)

K-Means uses distance → always scale your features!

Anna Smirnova, November 10, 2025
Session 9: Clustering Without Crying

Common Error #2: Wrong Number of Clusters

# ❌ Guessing k randomly
kmeans = KMeans(n_clusters=5)  # Why 5?

# ✅ Use elbow method
inertias = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

plt.plot(range(1, 11), inertias)
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
# Look for the "elbow" - where it stops decreasing sharply

Elbow = sweet spot for k

Anna Smirnova, November 10, 2025
Session 9: Clustering Without Crying

Common Error #3: Using K-Means on Non-Spherical Data

# Two half-moon shapes next to each other
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=200, noise=0.05)

# ❌ K-Means fails spectacularly
kmeans = KMeans(n_clusters=2)
labels = kmeans.fit_predict(X)  # Splits each moon in half!

# ✅ DBSCAN works great
dbscan = DBSCAN(eps=0.3)
labels = dbscan.fit_predict(X)  # Finds the moons!

Know your data shape before choosing algorithm

Anna Smirnova, November 10, 2025
Session 9: Clustering Without Crying

DBSCAN: For Weird Shapes

Parameters:

  • eps: How close points need to be to join a cluster
  • min_samples: Minimum points to form a cluster
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X)

Special label: -1 = noise/outliers

Pros:

  • Finds arbitrary shapes
  • Automatically detects outliers
  • Don't need to specify number of clusters

Cons:

  • Sensitive to eps and min_samples
  • Struggles with varying densities
Anna Smirnova, November 10, 2025
Session 9: Clustering Without Crying

Choosing eps for DBSCAN

from sklearn.neighbors import NearestNeighbors

# Plot k-distance graph
neighbors = NearestNeighbors(n_neighbors=5)
neighbors.fit(X_scaled)
distances, indices = neighbors.kneighbors(X_scaled)

distances = np.sort(distances[:, 4], axis=0)
plt.plot(distances)
plt.ylabel('5th Nearest Neighbor Distance')

# Look for the "knee" - that's your eps!

The knee in the curve = good eps value

Anna Smirnova, November 10, 2025
Session 9: Clustering Without Crying

Evaluating Clusters (Without Labels)

Silhouette Score:

  • Measures how similar points are to their own cluster vs other clusters
  • Range: -1 to 1 (higher = better)
from sklearn.metrics import silhouette_score

labels = kmeans.fit_predict(X_scaled)
score = silhouette_score(X_scaled, labels)
print(f"Silhouette Score: {score:.2f}")
# > 0.5 = good clustering
# < 0.2 = poor clustering

Warning: High score doesn't always mean meaningful clusters!

Anna Smirnova, November 10, 2025
Session 9: Clustering Without Crying

Evaluating Clusters (With Ground Truth)

If you DO have labels (for testing):

from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score

# Adjusted Rand Index (0 to 1, higher = better)
ari = adjusted_rand_score(y_true, labels)

# Normalized Mutual Information (0 to 1, higher = better)
nmi = normalized_mutual_info_score(y_true, labels)

Use these to compare clustering algorithms

Anna Smirnova, November 10, 2025
Session 9: Clustering Without Crying

Visualizing Clusters

2D data:

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], 
            kmeans.cluster_centers_[:, 1],
            marker='X', s=200, c='red')  # Show centers

High-dimensional data → reduce to 2D first:

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_scaled)
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='viridis')

Always visualize to sanity-check your clusters!

Anna Smirnova, November 10, 2025
Session 9: Clustering Without Crying

Hierarchical Clustering

Builds a tree (dendrogram):

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# Fit clustering
hclust = AgglomerativeClustering(n_clusters=3)
labels = hclust.fit_predict(X_scaled)

# Visualize dendrogram
linkage_matrix = linkage(X_scaled, method='ward')
dendrogram(linkage_matrix)
plt.show()

Pro: Don't need to pick k beforehand (can cut tree anywhere)
Con: Slow on large datasets

Anna Smirnova, November 10, 2025
Session 9: Clustering Without Crying

Quick Reference: Which Algorithm?

K-Means:

  • Spherical, well-separated clusters
  • Know number of clusters
  • Need speed

DBSCAN:

  • Arbitrary shapes
  • Don't know number of clusters
  • Have outliers to detect

Hierarchical:

  • Want to explore different k values
  • Small-medium datasets
  • Need dendrogram

Start with K-Means, try DBSCAN if it fails

Anna Smirnova, November 10, 2025
Session 9: Clustering Without Crying

Debugging: "My Clusters Make No Sense"

All points in one cluster:

  • K-Means: k too small
  • DBSCAN: eps too large

Every point is its own cluster:

  • K-Means: k too large
  • DBSCAN: eps too small or min_samples too high

Clusters split obvious groups:

  • Forgot to scale features
  • Wrong algorithm (try DBSCAN if K-Means fails)

Silhouette score is negative:

  • Data might not have natural clusters
  • Try different algorithm or k
Anna Smirnova, November 10, 2025
Session 9: Clustering Without Crying

Pro Tips

1. Always Scale

from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)

2. Try Multiple k Values

for k in range(2, 10):
    kmeans = KMeans(n_clusters=k)
    score = silhouette_score(X_scaled, kmeans.fit_predict(X_scaled))
    print(f"k={k}: score={score:.2f}")

3. Visualize Everything

  • Plot elbow curve
  • Plot silhouette scores
  • Plot actual clusters

4. Not All Data Has Clusters

  • Sometimes data is uniformly distributed
  • Clustering will still output labels, but they're meaningless
Anna Smirnova, November 10, 2025
Session 9: Clustering Without Crying

Let's Cluster Some Data!

Questions before we start?

Anna Smirnova, November 10, 2025