Clustering
Clustering is a form of unsupervised machine learning that classifies data into septate categories based on the similarity of the data. There are hundreds of different ways to form clusters with data. One of the simplest ways is through an algorithm called k-means clustering.
k-means Clustering
The k-means algorithm forms cluster by finding k clusters, with a center as the means of the data in each cluster. To get started, we must specify how many clusters (k) we want. By default, sk-learn uses 8 clusters but that number should be adjusted for the data we are clustering. The centers of each of these k clusters are called the centroids. These centroids are not a data point, but just the average of all the data points that are part of the cluster.
Since each centroid is defined by the average (mean) of the data, all data must be numeric. This may limit the datasets we can use with k-means clustering, but many datasets that are not initially numeric can be converted into a numeric format.
The sk-learn clustering k-means model is sklearn.cluster.KMeans
.
Clustering Example: Votes in Congress
During the 114th session of the United States Congress (2015 - 2017), the 100 senators held a total of 502 roll call votes that were recorded as part of the congressional record. A senator may vote either "Yes", "No", or "Abstain"; a senator may also be absent from the vote. To convert this data into a numeric format:
- A "Yes" vote is added to the dataset as a
1
, - A "No" vote is added to the dataset as a
0
, - All other votes (abstain/absent) are added to the dataset as a
0.5
.
Clustering with sk-learn
Using the same steps as in linear regression, we'll use the same for steps: (1): import the library, (2): initialize the model, (3): fit the data, (4): predict the outcome.
# Step 1: Import `sklearn.cluster.KMeans`
from sklearn.cluster import KMeans
In the United States, there are two major political parties. We'll use k-means clustering to attempt to cluster the data into two clusters, that will hopefully be representative of the political party they are part of:
# Step 2: Initialize the model
model = KMeans(2) # k-means with two (2) clusters
The "congressional votes dataset" contains 15 votes during the 114th Congress:
import pandas as pd
df = pd.read_csv("https://waf.cs.illinois.edu/discovery/congress.csv")
df
name | party | state | vote1 | vote2 | vote3 | vote4 | vote5 | vote6 | vote7 | vote8 | vote9 | vote10 | vote11 | vote12 | vote13 | vote14 | vote15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alexander | R | TN | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | Ayotte | R | NH | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
2 | Baldwin | D | WI | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 |
3 | Barrasso | R | WY | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4 | Bennet | D | CO | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | Warner | D | VA | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 |
96 | Warren | D | MA | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 |
97 | Whitehouse | D | RI | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 |
98 | Wicker | R | MS | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
99 | Wyden | D | OR | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 |
We want to use only the votes to train our model, which are the columns "vote1"
..."vote15"
:
# Step 3: Train the model with model.fit(...)
model = model.fit(df[ ["vote1", "vote2", "vote3", "vote4", "vote5", "vote6", "vote7", "vote8", "vote9", "vote10", "vote11", "vote12", "vote13", "vote14", "vote15"] ] )
Finally, model.predict(...)
will predict the cluster when provided the same 15 variables:
df["cluster"] = model.predict(df[["vote1", "vote2", "vote3", "vote4", "vote5", "vote6", "vote7", "vote8", "vote9", "vote10", "vote11", "vote12", "vote13", "vote14", "vote15"]])
df
name | party | state | vote1 | vote2 | vote3 | vote4 | vote5 | vote6 | vote7 | vote8 | vote9 | vote10 | vote11 | vote12 | vote13 | vote14 | vote15 | cluster | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alexander | R | TN | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 |
1 | Ayotte | R | NH | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0 |
2 | Baldwin | D | WI | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1 |
3 | Barrasso | R | WY | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0 |
4 | Bennet | D | CO | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | Warner | D | VA | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1 |
96 | Warren | D | MA | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1 |
97 | Whitehouse | D | RI | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1 |
98 | Wicker | R | MS | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0 |
99 | Wyden | D | OR | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1 |
In the cluster
column (found at the far right), every member is clustered into cluster 0
or 1
. From observing the first few rows, Python has used 0
to identify members in the "republican party cluster" and 1
to identify "democratic party cluster".
Example Walk-Throughs with Worksheets
Video 1: k-means Clustering Examples
Video 2: k-means Clustering on New Datasets
Practice Questions
Q1: Which of the following is necessary for applying the k-means algorithm?Q2: For every data point, k-means algorithm assigns it to the centroid :
Q3: The K-means algorithm is a supervised learning algorithm.