Clustering


Clustering is a form of unsupervised machine learning that classifies data into septate categories based on the similarity of the data. There are hundreds of different ways to form clusters with data. One of the simplest ways is through an algorithm called k-means clustering.

k-means Clustering

The k-means algorithm forms cluster by finding k clusters, with a center as the means of the data in each cluster. To get started, we must specify how many clusters (k) we want. By default, sk-learn uses 10 clusters but that number should be adjusted for the data we are clustering. The centers of each of these k clusters are called the centroids. These centroids are not a data point, but just the average of all the data points that are part of the cluster.

Since each centroid is defined by the average (mean) of the data, all data must be numeric. This may limit the datasets we can use with k-means clustering, but many datasets that are not initially numeric can be converted into a numeric format.

The sk-learn clustering k-means model is sklearn.cluster.KMeans.

Clustering Example: Votes in Congress

During the 114th session of the United States Congress (2015 - 2017), the 100 senators held a total of 502 roll call votes that were recorded as part of the congressional record. A senator may vote either "Yes", "No", or "Abstain"; a senator may also be absent from the vote. To convert this data into a numeric format:

  • A "Yes" vote is added to the dataset as a 1,
  • A "No" vote is added to the dataset as a 0,
  • All other votes (abstain/absent) are added to the dataset as a 0.5.

Clustering with sk-learn

Using the same steps as in linear regression, we'll use the same for steps: (1): import the library, (2): initialize the model, (3): fit the data, (4): predict the outcome.

# Step 1: Import `sklearn.cluster.KMeans` 
from sklearn.cluster import KMeans

In the United States, there are two major political parties. We'll use k-means clustering to attempt to cluster the data into two clusters, that will hopefully be representative of the political party they are part of:

# Step 2: Initialize the model
model = KMeans(2)   # k-means with two (2) clusters

The "congressional votes dataset" contains 15 votes during the 114th Congress:

import pandas as pd
df = pd.read_csv("https://waf.cs.illinois.edu/discovery/congress.csv")
df
namepartystatevote1vote2vote3vote4vote5vote6vote7vote8vote9vote10vote11vote12vote13vote14vote15
0AlexanderRTN0.01.01.01.01.00.00.01.01.01.00.00.00.00.00.0
1AyotteRNH0.01.01.01.01.00.00.01.00.01.00.01.00.01.00.0
2BaldwinDWI1.00.00.01.00.01.00.01.00.00.01.01.00.01.01.0
3BarrassoRWY0.01.01.01.01.00.01.01.01.01.00.00.01.00.00.0
4BennetDCO0.00.00.01.00.01.00.01.00.00.00.01.00.01.00.0
.........................................................
95WarnerDVA1.01.00.01.00.01.00.01.00.00.01.01.00.01.00.0
96WarrenDMA1.00.00.01.00.01.00.01.00.00.01.01.00.01.01.0
97WhitehouseDRI1.00.00.01.00.01.00.01.00.00.01.01.00.01.01.0
98WickerRMS0.01.01.01.01.00.01.00.01.01.00.00.01.00.00.0
99WydenDOR1.00.00.01.00.01.00.01.00.00.01.01.00.01.01.0

We want to use only the votes to train our model, which are the columns "vote1"..."vote15":

# Step 3: Train the model with model.fit(...)
model = model.fit(df[ ["vote1", "vote2", "vote3", "vote4", "vote5", "vote6", "vote7", "vote8", "vote9", "vote10", "vote11", "vote12", "vote13", "vote14", "vote15"] ] )

Finally, model.predict(...) will predict the cluster when provided the same 15 variables:

df["cluster"] = model.predict(df[["vote1", "vote2", "vote3", "vote4", "vote5", "vote6", "vote7", "vote8", "vote9", "vote10", "vote11", "vote12", "vote13", "vote14", "vote15"]])
df
namepartystatevote1vote2vote3vote4vote5vote6vote7vote8vote9vote10vote11vote12vote13vote14vote15cluster
0AlexanderRTN0.01.01.01.01.00.00.01.01.01.00.00.00.00.00.00
1AyotteRNH0.01.01.01.01.00.00.01.00.01.00.01.00.01.00.00
2BaldwinDWI1.00.00.01.00.01.00.01.00.00.01.01.00.01.01.01
3BarrassoRWY0.01.01.01.01.00.01.01.01.01.00.00.01.00.00.00
4BennetDCO0.00.00.01.00.01.00.01.00.00.00.01.00.01.00.01
............................................................
95WarnerDVA1.01.00.01.00.01.00.01.00.00.01.01.00.01.00.01
96WarrenDMA1.00.00.01.00.01.00.01.00.00.01.01.00.01.01.01
97WhitehouseDRI1.00.00.01.00.01.00.01.00.00.01.01.00.01.01.01
98WickerRMS0.01.01.01.01.00.01.00.01.01.00.00.01.00.00.00
99WydenDOR1.00.00.01.00.01.00.01.00.00.01.01.00.01.01.01

In the cluster column (found at the far right), every member is clustered into cluster 0 or 1. From observing the first few rows, Python has used 0 to identify members in the "republican party cluster" and 1 to identify "democratic party cluster".


Example Walk-Throughs with Worksheets

Video 1: k-means Clustering Examples

Follow along with the worksheet to work through the problem:

Video 2: k-means Clustering on New Datasets

Follow along with the worksheet to work through the problem:

Practice Questions

Q1: Which of the following is necessary for applying the k-means algorithm?
Q2: For every data point, k-means algorithm assigns it to the centroid :
Q3: The K-means algorithm is a supervised learning algorithm.