An Introduction to the kmeans Algorithm

This post will provide an R code-heavy, math-light introduction to selecting the \(k\) in k means. It presents the main idea of kmeans, demonstrates how to fit a kmeans in R, provides some components of the kmeans fit, and displays some methods for selecting k. In addition, the post provides some helpful functions which may make fitting kmeans a bit easier.

kmeans clustering is an example of unsupervised learning, where we do not have an output we’re explicitly trying to predict. We may have reasons to believe that there are latent groups within a dataset, so a clustering method can be a useful way to explore and describe pockets of similar observations within a dataset.

The Algorithm

Here’s basically what kmeans (the algorithm) does (taken from K-means Clustering):

Selects K centroids (K rows chosen at random)
Assigns each data point to its closest centroid
Recalculates the centroids as the average of all data points in a cluster (i.e., the centroids are p-length mean vectors, where p is the number of variables)
Assigns data points to their closest centroids
Continues steps 3 and 4 until the observations are not reassigned or the maximum number of iterations (R uses 10 as a default) is reached.

Here it is in gif form (taken from k-means clustering in a GIF):

A balls and urns explanation

As a statistician, I have hard time avoiding resorting to using balls and urns to describe statistical concepts.

Suppose we have \(n\) balls, and each ball has \(p\) characteristics, like \(shape\), \(size\), \(density\), \(\ldots\), and we want to put those \(n\) balls into \(k\) urns (clusters) according to the \(p\) characteristics.

First, we randomly select \(k\) balls (\(balls_{init}\)), and assign the rest of the balls (\(n-k\)) to whichever \(balls_{init}\) it is closest to. After this first assignment, we calculate the centroid of each (\(k\)) collection of balls. The centroids are the averages of the \(p\) characteristics of the balls in each cluster. So, for each cluster, there will be a vector of length \(p\) with the means of the characteristics of the balls in that cluster.

After the calculation of the centroid, we then calculate (for each ball) the distances between its \(p\) characteristics and the centroids for each cluster. We assign the ball to the cluster with the centroid it is closest to. Then, we recalculate the centroids and repeat the process. We leave the number of clusters (\(k\)) fixed, but we allow the balls to move between the clusters, depending on which cluster they are closest to.

Either the algorithm will “converge” and between time \(t\) and \(t+1\) no reassignments will occur, or we’ll reach the maximum number of iterations allowed by the algorithm.

##kmeans in R

Here’s how we use the kmeans function in R:

kmeans(x, centers, iters.max, nstart)

arguments

x is our data
centers is the k in kmeans
iters.max controls the maximum number of iterations, if the algorithm has not converged, it’s good to bump this number up
nstart controls the initial configurations (step 1 in the algorithm), bumping this number up is a good idea, since kmeans tends to be sensitive to initial conditions (which may remind you of sensitivity to initial conditions in chaos theory)

###values it returns

kmeans returns an object of class “kmeans” which has a print and a fitted method. It is a list with at least the following components:

cluster - A vector of integers (from 1:k) indicating the cluster to which each point is allocated.

centers - A matrix of cluster centers these are the centroids for each cluster

totss - The total sum of squares.

withinss - Vector of within-cluster sum of squares, one component per cluster.

tot.withinss - Total within-cluster sum of squares, i.e. sum(withinss).

betweenss - The between-cluster sum of squares, i.e. totss-tot.withinss.

size - The number of points in each cluster.

To use kmeans, we first need to specify the k. How should we do this?

##Data

For this post, we’ll be using the Boston housing data set. This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston, MA.

##      crim zn indus   nox    rm  age    dis rad tax ptratio  black lstat medv
## 1 0.00632 18  2.31 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98 24.0
## 2 0.02731  0  7.07 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14 21.6
## 3 0.02729  0  7.07 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03 34.7
## 4 0.03237  0  2.18 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94 33.4
## 5 0.06905  0  2.18 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33 36.2
## 6 0.02985  0  2.18 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21 28.7

Visualize within Cluster Error

Use a scree plot to visualize the reduction in within-cluster error:

ss_kmeans <- t(sapply(2:14, 
                  FUN = function(k) 
                  kmeans(x = b_housing, 
                         centers = k, 
                         nstart = 20, 
                         iter.max = 25)[c('tot.withinss','betweenss')]))

plot(2:14, unlist(ss_kmeans[,1]), xlab = 'Clusters', ylab = 'Within Cluster SSE')

When we look at the scree plot, we’re looking for the “elbow”. We can see the SSE dropping, but at some point it discontinues its rapid dropping. At what cluster does it stop dropping abruptly?

Stated more verbosely from Wikipedia:

The elbow method looks at the percentage of variance explained as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t give much better modeling of the data. More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the “elbow criterion”. This “elbow” cannot always be unambiguously identified. Percentage of variance explained is the ratio of the between-group variance to the total variance, also known as an F-test. A slight variation of this method plots the curvature of the within group variance.

We can get the percentage of variance explained by typing:

tot.ss <- sum(apply(b_housing, 2, var)) * (nrow(b_housing) - 1)

var_explained <- unlist(ss_kmeans[,2]) / tot.ss

plot(2:14, var_explained, xlab = 'Clusters', ylab = '% of Total Variation Explained')

Where does the elbow occur in the above plot? That’s pretty subjective (a common theme in unsupervised learning), but for our task we would prefer to have \(\leq 10\) clusters, probably.

Visualize AIC

We could also opt for the AIC, which basically looks at how well the clusters are fitting to the data, while also penalizing how many clusters are in the final fit. The general rule with AIC is that lower values are better.

First, we define a function which calculates the AIC from the output of kmeans.

kmeansAIC <- function(fit){

  m = ncol(fit$centers) 
  k = nrow(fit$centers)
  D = fit$tot.withinss
  return(D + 2*m*k)
  
}

aic_k <- sapply(2:14, FUN = 
         function(k) 
           kmeansAIC(kmeans(b_housing, centers = k, nstart = 20, iter.max = 25)))

plot(2:14, aic_k, xlab = 'Clusters', ylab = 'AIC from kmeans')

Look familiar? It is remarkably similar to looking at the SSE. This is because the main component in calculating AIC is the within-cluster sum of squared errors. Once again, we’re looking for an elbow in the plot, indicating that the decrease in AIC is not happening so rapidly.

Visualize BIC

BIC is related to AIC in that BIC is AIC’s conservative cousin. When we evaluate models using BIC rather than AIC as our metric, we tend to select smaller models. Calculating BIC is rather similar to that of AIC (we replaced 2 in the AIC calculation with log(n)):

kmeansBIC <- function(fit){
  m = ncol(fit$centers) 
  n = length(fit$cluster)
  k = nrow(fit$centers)
  D = fit$tot.withinss
  return(D + log(n) * m * k) # using log(n) instead of 2, penalize model complexity
}

bic_k <- sapply(2:14, FUN = 
         function(k) 
           kmeansBIC(kmeans(b_housing, centers = k, nstart = 20, iter.max = 25)))

plot(2:14, aic_k, xlab = 'Clusters', ylab = 'BIC from kmeans')

Once again, the plots are rather similar for this toy example.

Wrap it all together in `kmeans2`

We can wrap all the previous parts together in a function to get a broad look at the fit of kmeans.

We’ll fit kmeans across a range of centers (center_range). Using the results from these fits, we’ll look at AIC, BIC, within cluster variation, and the % of total variation explained. We can choose to spit out a table to the user (plot = FALSE) or we’ll plot each of the four metrics by the number of clusters (plot = TRUE).

kmeans2 <- function(data, center_range, iter.max, nstart, plot = TRUE){
  
  #fit kmeans for each center
  all_kmeans <- lapply(center_range, 
                       FUN = function(k) 
                         kmeans(data, center = k, iter.max = iter.max, nstart = nstart))
  
  #extract AIC from each
  all_aic <- sapply(all_kmeans, kmeansAIC)
  #extract BIC from each
  all_bic <- sapply(all_kmeans, kmeansBIC)
  #extract tot.withinss
  all_wss <- sapply(all_kmeans, FUN = function(fit) fit$tot.withinss)
  #extract between ss
  btwn_ss <- sapply(all_kmeans, FUN = function(fit) fit$betweenss)
  #extract totall sum of squares
  tot_ss <- all_kmeans[[1]]$totss
  #put in data.frame
  clust_res <- 
    data.frame('Clusters' = center_range, 
             'AIC' = all_aic, 
             'BIC' = all_bic, 
             'WSS' = all_wss,
             'BSS' = btwn_ss,
             'TSS' = tot_ss)
  #plot or no plot?
  if(plot){
    par(mfrow = c(2,2))
    with(clust_res,{
      plot(Clusters, AIC)
      plot(Clusters, BIC)
      plot(Clusters, WSS, ylab = 'Within Cluster SSE')
      plot(Clusters, BSS / TSS, ylab = 'Prop of Var. Explained')
    })
  }else{
    return(clust_res)
  }
  
}


kmeans2(data = b_housing, center_range = 2:15, iter.max = 20, nstart = 25)

Use a package to determine `k`

This is R after all, so surely there must be at least one package to help in determining the “best” number of clusters. NbClust is a viable option.

library(NbClust)

best.clust <- NbClust(data = b_housing, 
                      min.nc = 2, 
                      max.nc = 15, 
                      method = 'kmeans')

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 7 proposed 2 as the best number of clusters 
## * 2 proposed 3 as the best number of clusters 
## * 12 proposed 4 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  4 
##  
##  
## *******************************************************************

NbClust returns a big object with some information that may or may not be useless for this use case (I stored the rest of the output in best.clust, but the package still spit out a bunch of stuff). But, it does tell you the best number of clusters as selected by a slew of indices. This function must iterate through all the possible clusters from min.nc to max.nc, so it may not be very quick, but it does give another way of selecting the number of clusters.

You may want to find a reasonable range for min.nc and max.nc before resorting to the NbClust function. If you know that 3 clusters won’t be enough, don’t make NbClust even consider it as an option.

There’s also an argument called index in the NbClust function. This value controls which indices are used to determine the best number of clusters. The calculation methods differ between indices and if your data isn’t so nice (e.g. variables with few unique values), the function may fail. The default value is all, which is a collection of 30 (!) indices all used to help determine the best number of clusters.

It may be helpful to try different indices such as tracew, kl, dindex or duda. Unfortunately, you’ll need to specify only one index for each NbClust call (unless you use index = 'all' or index = 'alllong').

For more details look at the function’s documentation.

Visualizing the centroids

This function helps to visualize the centroids for each cluster. It can allow for interpretation of clusters.

The arguments for this function are fit, levels, and show_N:

fit: object returned from a call to kmeans
levels: a character vector representing the levels of the variables in the data set used to fit kmeans, this vector will allow a user to control the order in which variables are plotted
show_N: a logical value, if TRUE, the plot will contain information about the size of each cluster, if FALSE, a table of counts will be printed prior to the plot

To use the levels argument, the character vector you supply must have the same number of elements as the number of unique variables in the data set used to fit the kmeans. If you specify levels = c('a','b','c') the plotting device will display (from top to bottom) 'c','b','a'. If you are not satisfied with the plotting order, the rev function may come in handy.

kmeans_viz <- function(fit, levels = NULL, show_N = TRUE){
  require(ggplot2)
  require(dplyr)
  #extract number of clusters
  clusts <- length(unique(fit$cluster))
  #centroids
  kmeans.table <- as.data.frame(t(fit$center), stringsAsFactors = FALSE)
  #variable names
  kmeans.table$Variable <- row.names(kmeans.table)
  #name clusters
  names(kmeans.table)[1:clusts] <- paste0('cluster', 1:clusts)
  #reshape from wide table to long (makes plotting easier)
  kmeans.table <- reshape(kmeans.table, direction = 'long',
                        idvar = 'Variable', 
                        varying = paste0('cluster', 1:clusts),
                        v.names = 'cluster')
  
  #number of observations in each cluster
  #should we show N in the graph or just print it?
  if(show_N){
    #show it in the graph
  kmeans.table$time <- paste0(kmeans.table$time,
                             ' (N = ',
                             fit$size[kmeans.table$time],
                             ')')
  }else{
    #just print it
    print(rbind('Cluster' = 1:clusts,
          'N' = fit$size))
  }
  #standardize the cluster means to make a nice plot
  kmeans.table %>%
    group_by(Variable) %>%
    mutate(cluster_stdzd = (cluster - mean(cluster)) / sd(cluster)) -> kmeans.table
  #did user specify a variable levels vector?
  if(length(levels) == length(unique(kmeans.table$Variable))){
    kmeans.table$Variable <- factor(kmeans.table$Variable, levels = levels)
  }
  
  #make the plot
  ggplot(kmeans.table, aes(x = Variable, y = time))+
    geom_tile(colour = 'black', aes(fill = cluster_stdzd))+
    geom_text(aes(label = round(cluster,2)))+
    coord_flip()+
    xlab('')+ylab('Cluster')+
    scale_fill_gradient(low = 'white', high = 'grey60')+
    theme_bw()+
    theme(legend.position = 'none',
          axis.title.y = element_blank(),
          axis.title.x = element_text(size = 16),
          panel.grid = element_blank(),
          axis.text = element_text(size = 14),
          axis.ticks = element_blank())

}

opt.kmeans <- kmeans(b_housing, centers = 4, nstart = 50, iter.max = 50)

kmeans_viz(opt.kmeans)

Using `kmeans` to predict

We can predict cluster membership using a few techniques. For the simple plug-and-play method, we can use the cl_predict function from the clue package. For those interested in a more manual approach, we can calculate the centroid distances for the new data, and select whichever cluster is the shortest distance away.

I will demonstrate both techniques.

First, we’re going to select a subset of the Boston dataset to fit a kmeans on. Using the result of kmeans fit on b_hous.train, we’ll try to predict the clusters for a “new” dataset, b_hous.test.

#rows to select
set.seed(123)
train_samps <- sample(nrow(b_housing), .7 * nrow(b_housing), replace = F)
#create training and testing set
b_hous.train <- b_housing[train_samps,]
b_hous.test <- b_housing[-train_samps,]

#fit our new kmeans
train.kmeans <- kmeans(b_hous.train, centers = 4, nstart = 50, iter.max = 50)

Use `cl_predict`

The interface is fairly simple to get the predicted values.

We’re going to use system.time to time how long it takes R to do what we want it to.

library(clue)

system.time(
  test_clusters.clue <- cl_predict(object = train.kmeans, newdata = b_hous.test)
  )

##    user  system elapsed 
##    0.02    0.00    0.02

table(test_clusters.clue)

## test_clusters.clue
##  1  2  3  4 
## 84 15 27 26

By hand

Taken from this nice CrossValidated solution.

clusters <- function(x, centers) {
  # compute squared euclidean distance from each sample to each cluster center
  tmp <- sapply(seq_len(nrow(x)),
                function(i) apply(centers, 1,
                                  function(v) sum((x[i, ]-v)^2)))
  max.col(-t(tmp))  # find index of min distance
}

system.time(
  test_clusters.hand <- clusters(x = b_hous.test, centers = train.kmeans$centers)
  )

##    user  system elapsed 
##    1.81    0.00    1.81

table(test_clusters.hand)

## test_clusters.hand
##  1  2  3  4 
## 84 15 27 26

all(test_clusters.hand == test_clusters.clue) #TRUE

## [1] TRUE

We see that clusters is slower than cl_predict, but they return the same result. It would be prudent to use cl_predict.

Wrapping Up

In this post I walked through the kmeans algorithm, and its implementation in R. Additionally, I discussed some of the ways to select the k in kmeans. The process of selecting and evaluating choices of k will vary from project to project and depend strongly on the goals of an analysis.

It is worth noting that one of the drawbacks of kmeans clustering is that it must put every observation into a cluster. There may be anomalies or outliers present in a dataset, so it may not always make sense to enforce the condition that each observation is assigned to a cluster. A different unsupervised learning technique, such as dbscan (density-based spatial clustering of applications with noise) may be more appropriate for tasks in which anomaly detection is necessary. I hope to explore this technique in a future post. In the meantime, here’s an example of Netflix applying dbscan for anomaly detection.