Clustering

← Machine Learning Models in Python with sk-learn Next: Towards Machine Learning in Python →

Clustering is a form of unsupervised machine learning that classifies data into septate categories based on the similarity of the data. There are hundreds of different ways to form clusters with data. One of the simplest ways is through an algorithm called k-means clustering.

k-means Clustering

The k-means algorithm forms cluster by finding k clusters, with a center as the means of the data in each cluster. To get started, we must specify how many clusters (k) we want. By default, sk-learn uses 10 clusters but that number should be adjusted for the data we are clustering. The centers of each of these k clusters are called the centroids. These centroids are not a data point, but just the average of all the data points that are part of the cluster.

Since each centroid is defined by the average (mean) of the data, all data must be numeric. This may limit the datasets we can use with k-means clustering, but many datasets that are not initially numeric can be converted into a numeric format.

The sk-learn clustering k-means model is sklearn.cluster.KMeans.

Clustering Example: Votes in Congress

During the 114th session of the United States Congress (2015 - 2017), the 100 senators held a total of 502 roll call votes that were recorded as part of the congressional record. A senator may vote either "Yes", "No", or "Abstain"; a senator may also be absent from the vote. To convert this data into a numeric format:

A "Yes" vote is added to the dataset as a 1,
A "No" vote is added to the dataset as a 0,
All other votes (abstain/absent) are added to the dataset as a 0.5.

Clustering with sk-learn

Using the same steps as in linear regression, we'll use the same for steps: (1): import the library, (2): initialize the model, (3): fit the data, (4): predict the outcome.

# Step 1: Import `sklearn.cluster.KMeans` 
from sklearn.cluster import KMeans

In the United States, there are two major political parties. We'll use k-means clustering to attempt to cluster the data into two clusters, that will hopefully be representative of the political party they are part of:

# Step 2: Initialize the model
model = KMeans(2)   # k-means with two (2) clusters

The "congressional votes dataset" contains 15 votes during the 114th Congress:

import pandas as pd
df = pd.read_csv("https://waf.cs.illinois.edu/discovery/congress.csv")
df

	name	party	state	vote1	vote2	vote3	vote4	vote5	vote6	vote7	vote8	vote9	vote10	vote11	vote12	vote13	vote14	vote15
0	Alexander	R	TN	0.0	1.0	1.0	1.0	1.0	0.0	0.0	1.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0
1	Ayotte	R	NH	0.0	1.0	1.0	1.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0
2	Baldwin	D	WI	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0	1.0	1.0	0.0	1.0	1.0
3	Barrasso	R	WY	0.0	1.0	1.0	1.0	1.0	0.0	1.0	1.0	1.0	1.0	0.0	0.0	1.0	0.0	0.0
4	Bennet	D	CO	0.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	Warner	D	VA	1.0	1.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0	1.0	1.0	0.0	1.0	0.0
96	Warren	D	MA	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0	1.0	1.0	0.0	1.0	1.0
97	Whitehouse	D	RI	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0	1.0	1.0	0.0	1.0	1.0
98	Wicker	R	MS	0.0	1.0	1.0	1.0	1.0	0.0	1.0	0.0	1.0	1.0	0.0	0.0	1.0	0.0	0.0
99	Wyden	D	OR	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0	1.0	1.0	0.0	1.0	1.0

We want to use only the votes to train our model, which are the columns "vote1"..."vote15":

# Step 3: Train the model with model.fit(...)
model = model.fit(df[ ["vote1", "vote2", "vote3", "vote4", "vote5", "vote6", "vote7", "vote8", "vote9", "vote10", "vote11", "vote12", "vote13", "vote14", "vote15"] ] )

Finally, model.predict(...) will predict the cluster when provided the same 15 variables:

df["cluster"] = model.predict(df[["vote1", "vote2", "vote3", "vote4", "vote5", "vote6", "vote7", "vote8", "vote9", "vote10", "vote11", "vote12", "vote13", "vote14", "vote15"]])

df

	name	party	state	vote1	vote2	vote3	vote4	vote5	vote6	vote7	vote8	vote9	vote10	vote11	vote12	vote13	vote14	vote15	cluster
0	Alexander	R	TN	0.0	1.0	1.0	1.0	1.0	0.0	0.0	1.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0
1	Ayotte	R	NH	0.0	1.0	1.0	1.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0
2	Baldwin	D	WI	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0	1.0	1.0	0.0	1.0	1.0	1
3	Barrasso	R	WY	0.0	1.0	1.0	1.0	1.0	0.0	1.0	1.0	1.0	1.0	0.0	0.0	1.0	0.0	0.0	0
4	Bennet	D	CO	0.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	Warner	D	VA	1.0	1.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0	1.0	1.0	0.0	1.0	0.0	1
96	Warren	D	MA	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0	1.0	1.0	0.0	1.0	1.0	1
97	Whitehouse	D	RI	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0	1.0	1.0	0.0	1.0	1.0	1
98	Wicker	R	MS	0.0	1.0	1.0	1.0	1.0	0.0	1.0	0.0	1.0	1.0	0.0	0.0	1.0	0.0	0.0	0
99	Wyden	D	OR	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0	1.0	1.0	0.0	1.0	1.0	1

In the cluster column (found at the far right), every member is clustered into cluster 0 or 1. From observing the first few rows, Python has used 0 to identify members in the "republican party cluster" and 1 to identify "democratic party cluster".