K-Means, Gaussian Mixtures, and the Expectation-Maximization Algorithm

K-Means, Gaussian Mixtures, and the Expectation-Maximization Algorithm#

K-Means Algorithm#

The k-Means algorithm is a specific example of the Expectation-Maximization (EM) algorithm In k-Means, there are two main steps:

  1. Expectation step - we freeze the current cluster centers or “prototypes” and assign each datapoint to a cluster that minimizes the sum of squared distances between the datapoint and the center of the assigner cluster

  2. Maximization step - we freeze the cluster membership and update the location of each cluster center We iterate between these two steps many times until the cluster memberships no longer change.

We can think of k-Means as a “hard clustering” algorithm where each datapoint belongs to exactly one cluster.

We show an example of using K-Means algorithm for image compression, where we reduce the number of colors in an image. In this case, given a color image, our goal is to learn an encoding scheme such that we reduce the number of colors to only 10 colors.

We show how initially, we start with all 10 colors to have the same (0.5, 0.5, 0.5) RGB values. But K-means is able to quickly figure out to adjust the clusters into distinct colors. Thus allowing us to have an image that is similar to our original image but with much fewer colors.

In the animation below, we see how this happens using the EM algorithm we discussed above. We show how the compressed image changes over time and how the different clusters and cluster centers (red star), move over time to optimize the colors.

<Figure size 640x480 with 0 Axes>

Gaussian Mixtures#

There is a softer and more general counterpart to this algorithm which is the Gaussian Mixture model, which gives a “soft membership” such that each datapoint can be a member of more than one cluster with some probability.

This too is learned via the EM algorithm above. In which case, the two steps are

  1. Expectation step - Freeze the means, covariances of k Gaussians. Compute the degrees of membership of each datapoint to each cluster.

  2. Maximization step - Freeze the degrees of membership. Compute the means, covariances, and mixture coefficients via maximum likelihood esimation of the Gaussian mixture model Iterate between these two steps until convergence

We show an example of using Gaussian Mixtures algorithm for learning 10 Gaussian distributions that can plausibly give rise to our given image. In contrast to K-means compression, with Gaussian mixtures, we can come up with 10 colors as well as combinations of these 10 colors.

In this case, we initialize the model with the clusters from the K-means algorithm above. Initializing at random can lead to instability and poor results.

Gaussian mixture model will now be able to figure out to adjust the parameters (means, covariances, and mixture coeffficients) that are used to combine 10 colors into different combinations.

In the animation below, we see how the happens using the EM algorithm. We show how the predicted image color changes over time and how the different soft clusters and cluster centers (red star) move over time to optimize the colors.

<Figure size 640x480 with 0 Axes>

In contrast to our K-means example, we cannot consider this as a compression algorithm because for each pixel, we store the unique probability values corresponding to each of the 10 Gaussian distribution components. And this actually increases the data size from storing 3-element RGB vectors per pixel to storing 10 values per pixel on top of the Gaussian means.

Still, this is an illustration of the power of Gaussian mixture models to model the probability distribution of our data.

Clearly, given the complexity of applying EM on Gaussian mixtures, it is much more computionally intensive than k-Means. In order to improve efficiency, it will make sense to first apply k-Means, then use the resulting centers, cluster covariances, and cluster proportions as the initial means, covariances, and mixture coefficients of the Gaussian mixture. The we can use EM as is.

Note that there is one big problem with Gaussian Mixture optimization via EM or maximum likelihood in general. And that is that it is susceptible to fitting a Gaussian component to a single datapoint where the variance goes to 0. In which case, because of the math, this leads to the likelihood value to go to infinity. This can be addressed by adding a regularization on the parameters of each component to prevent it from going to 0 or overfitting on a single datapoint.