Clustering as a Mixture of Gaussians

Introduction to Model-Based Clustering
There’s another way to deal with clustering problems: a model-based approach, which consists in using certain models for clusters and attempting to optimize the fit between the data and the model.
In practice, each cluster can be mathematically represented by a parametric distribution, like a Gaussian (continuous) or a Poisson (discrete). The entire data set is therefore modelled by a mixture of these distributions. An individual distribution used to model a specific cluster is often referred to as a component distribution.
A mixture model with high likelihood tends to have the following traits:
  • component distributions have high “peaks” (data in one cluster are tight);
  • the mixture model “covers” the data well (dominant patterns in the data are captured by component distributions).
Main advantages of model-based clustering:
  • well-studied statistical inference techniques available;
  • flexibility in choosing the component distribution;
  • obtain a density estimation for each cluster;
  • a “soft” classification is available.
Mixture of Gaussians
The most widely used clustering method of this kind is the one based on learning a mixture of Gaussians: we can actually consider clusters as Gaussian distributions centred on their barycentres, as we can see in this picture, where the grey circle represents the first variance of the distribution:
The algorithm works in this way:
  • it chooses the component (the Gaussian) at random with probability ;
  • it samples a point .
Let’s suppose to have:
  • x1, x2,..., xN
We can obtain the likelihood of the sample: .What we really want to maximise is  (probability of a datum given the centres of the Gaussians).
is the base to write the likelihood function:
Now we should maximise the likelihood function by calculating , but it would be too difficult. That’s why we use a simplified algorithm called EM (Expectation-Maximization).
The EM Algorithm
The algorithm which is used in practice to find the mixture of Gaussians that can model the data set is called EM (Expectation-Maximization) (Dempster, Laird and Rubin, 1977). Let’s see how it works with an example.
Suppose xk are the marks got by the students of a class, with these probabilities:
x1 = 30             
x2 = 18             
x3 = 0               
x4 = 23             
First case: we observe that the marks are so distributed among students:
x1 : a students
x2 : b students
x3 : c students
x4 : d students
We should maximise this function by calculating . Let’s instead calculate the logarithm of the function and maximise it:
Supposing a = 14, b = 6, c = 9 and d = 10 we can calculate that .
Second case: we observe that marks are so distributed among students:
x1 + x2 : h students
x3 : c students
x4 : d students
We have so obtained a circularity which is divided into two steps:
  • expectation: 
  • maximization: 
This circularity can be solved in an iterative way.
Let’s now see how the EM algorithm works for a mixture of Gaussians (parameters estimated at the pth iteration are marked by a superscript (p):
  1. Initialize parameters:

  2. E-step:

  3. M-step:



    where R is the number of records.

Bibliography

Hierarchical Clustering Algorithms

How They Work
Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic process of hierarchical clustering (defined by S.C. Johnson in 1967) is this:
  1. Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain.
  2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less.
  3. Compute distances (similarities) between the new cluster and each of the old clusters.
  4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. (*)
Step 3 can be done in different ways, which is what distinguishes single-linkage from complete-linkage and average-linkage clustering.
In single-linkage clustering (also called the connectedness or minimum method), we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster. If the data consist of similarities, we consider the similarity between one cluster and another cluster to be equal to the greatest similarity from any member of one cluster to any member of the other cluster.
In complete-linkage clustering (also called the diameter or maximum method), we consider the distance between one cluster and another cluster to be equal to the greatest distance from any member of one cluster to any member of the other cluster.
In average-linkage clustering, we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.
A variation on average-link clustering is the UCLUS method of R. D'Andrade (1978) which uses the median distance, which is much more outlier-proof than the average distance.
This kind of hierarchical clustering is called agglomerative because it merges clusters iteratively. There is also a divisive hierarchical clustering which does the reverse by starting with all objects in one cluster and subdividing them into smaller pieces. Divisive methods are not generally available, and rarely have been applied.
(*) Of course there is no point in having all the N items grouped in a single cluster but, once you have got the complete hierarchical tree, if you want k clusters you just have to cut the k-1 longest links.

Single-Linkage Clustering: The Algorithm
Let’s now take a deeper look at how Johnson’s algorithm works in the case of single-linkage clustering.
The algorithm is an agglomerative scheme that erases rows and columns in the proximity matrix as old clusters are merged into new ones.
The N*N proximity matrix is D = [d(i,j)]. The clusterings are assigned sequence numbers 0,1,......, (n-1) and L(k) is the level of the kth clustering. A cluster with sequence number m is denoted (m) and the proximity between clusters (r) and (s) is denoted d [(r),(s)].
The algorithm is composed of the following steps:
  1. Begin with the disjoint clustering having level L(0) = 0 and sequence number m = 0.
  2. Find the least dissimilar pair of clusters in the current clustering, say pair (r), (s), according to

    d[(r),(s)] = min d[(i),(j)]


    where the minimum is over all pairs of clusters in the current clustering.
  3. Increment the sequence number : m = m +1. Merge clusters (r) and (s) into a single cluster to form the next clustering m. Set the level of this clustering to

    L(m) = d[(r),(s)]
  4. Update the proximity matrix, D, by deleting the rows and columns corresponding to clusters (r) and (s) and adding a row and column corresponding to the newly formed cluster. The proximity between the new cluster, denoted (r,s) and old cluster (k) is defined in this way:

    d[(k), (r,s)] = min d[(k),(r)], d[(k),(s)]
  5. If all objects are in one cluster, stop. Else, go to step 2.
An Example
Let’s now see a simple example: a hierarchical clustering of distances in kilometers between some Italian cities. The method used is single-linkage.
Input distance matrix (L = 0 for all the clusters):
 
BA
FI
MI
NA
RM
TO
BA
0
662
877
255
412
996
FI
662
0
295
468
268
400
MI
877
295
0
754
564
138
NA
255
468
754
0
219
869
RM
412
268
564
219
0
669
TO
996
400
138
869
669
0
The nearest pair of cities is MI and TO, at distance 138. These are merged into a single cluster called "MI/TO". The level of the new cluster is L(MI/TO) = 138 and the new sequence number is m = 1.Then we compute the distance from this new compound object to all other objects. In single link clustering the rule is that the distance from the compound object to another object is equal to the shortest distance from any member of the cluster to the outside object. So the distance from "MI/TO" to RM is chosen to be 564, which is the distance from MI to RM, and so on.
After merging MI with TO we obtain the following matrix:
 
BA
FI
MI/TO
NA
RM
BA
0
662
877
255
412
FI
662
0
295
468
268
MI/TO
877
295
0
754
564
NA
255
468
754
0
219
RM
412
268
564
219
0
min d(i,j) = d(NA,RM) = 219 => merge NA and RM into a new cluster called NA/RM
L(NA/RM) = 219
m = 2

 
BA
FI
MI/TO
NA/RM
BA
0
662
877
255
FI
662
0
295
268
MI/TO
877
295
0
564
NA/RM
255
268
564
0

min d(i,j) = d(BA,NA/RM) = 255 => merge BA and NA/RM into a new cluster called BA/NA/RM
L(BA/NA/RM) = 255
m = 3


 
BA/NA/RM
FI
MI/TO
BA/NA/RM
0
268
564
FI
268
0
295
MI/TO
564
295
0

min d(i,j) = d(BA/NA/RM,FI) = 268 => merge BA/NA/RM and FI into a new cluster called BA/FI/NA/RM
L(BA/FI/NA/RM) = 268
m = 4


 
BA/FI/NA/RM
MI/TO
BA/FI/NA/RM
0
295
MI/TO
295
0
Finally, we merge the last two clusters at level 295.
The process is summarized by the following hierarchical tree:

Problems
The main weaknesses of agglomerative clustering methods are:
  • they do not scale well: time complexity of at least O(n2), where n is the number of total objects;
  • they can never undo what was done previously.

Bibliography

Fuzzy C-Means Clustering

The Algorithm
Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to belong to two or more clusters. This method (developed by Dunn in 1973 and improved by Bezdek in 1981) is frequently used in pattern recognition. It is based on minimization of the following objective function:
     ,     
where m is any real number greater than 1, uij is the degree of membership of xi in the cluster jxi is the ith of d-dimensional measured data, cj is the d-dimension center of the cluster, and ||*|| is any norm expressing the similarity between any measured data and the center.Fuzzy partitioning is carried out through an iterative optimization of the objective function shown above, with the update of membership uij and the cluster centers cj by:
     ,     
This iteration will stop when , where  is a termination criterion between 0 and 1, whereas k are the iteration steps. This procedure converges to a local minimum or a saddle point of Jm.The algorithm is composed of the following steps:
  1. Initialize U=[uij] matrix, U(0)
  2. At k-step: calculate the centers vectors C(k)=[cj] with U(k)
  3. Update U(k) , U(k+1)
  4. If || U(k+1) - U(k)||< then STOP; otherwise return to step 2.

Remarks
As already told, data are bound to each cluster by means of a Membership Function, which represents the fuzzy behaviour of this algorithm. To do that, we simply have to build an appropriate matrix named U whose factors are numbers between 0 and 1, and represent the degree of membership between data and centers of clusters.
For a better understanding, we may consider this simple mono-dimensional example. Given a certain data set, suppose to represent it as distributed on an axis. The figure below shows this:

Looking at the picture, we may identify two clusters in proximity of the two data concentrations. We will refer to them using ‘A’ and ‘B’. In the first approach shown in this tutorial - the k-means algorithm - we associated each datum to a specific centroid; therefore, this membership function looked like this:
In the FCM approach, instead, the same given datum does not belong exclusively to a well defined cluster, but it can be placed in a middle way. In this case, the membership function follows a smoother line to indicate that every datum may belong to several clusters with different values of the membership coefficient.
In the figure above, the datum shown as a red marked spot belongs more to the B cluster rather than the A cluster. The value 0.2 of ‘m’ indicates the degree of membership to A for such datum. Now, instead of using a graphical representation, we introduce a matrix U whose factors are the ones taken from the membership functions:
         
(a)                                  (b)
The number of rows and columns depends on how many data and clusters we are considering. More exactly we have C = 2 columns (C = 2 clusters) and N rows, where C is the total number of clusters and N is the total number of data. The generic element is so indicated: uij.
In the examples above we have considered the k-means (a) and FCM (b) cases. We can notice that in the first case (a) the coefficients are always unitary. It is so to indicate the fact that each datum can belong only to one cluster. Other properties are shown below:

An Example
Here, we consider the simple case of a mono-dimensional application of the FCM. Twenty data and three clusters are used to initialize the algorithm and to compute the U matrix. Figures below (taken from our interactive demo) show the membership value for each datum and for each cluster. The color of the data is that of the nearest cluster according to the membership function.
In the simulation shown in the figure above we have used a fuzzyness coefficient m = 2 and we have also imposed to terminate the algorithm when . The picture shows the initial condition where the fuzzy distribution depends on the particular position of the clusters. No step is performed yet so that clusters are not identified very well. Now we can run the algorithm until the stop condition is verified. The figure below shows the final condition reached at the 8th step with m=2 and =0.3:
Is it possible to do better? Certainly, we could use an higher accuracy but we would have also to pay for a bigger computational effort. In the next figure we can see a better result having used the same initial conditions and =0.01, but we needed 37 steps!
It is also important to notice that different initializations cause different evolutions of the algorithm. In fact it could converge to the same result but probably with a different number of iteration steps.
Bibliography