k-means and EM clustering algorithms
The final data mining assignment (which was due in the middle of exams) was to implement the k-means and EM algorithms in C. These are two pretty simple clustering algorithms, with k-means (or renditions of k-means) used in industry all the time. Both algorithms work in 2-dimensions and take an input file of data points separated by a space.
K-means works as follow:
- Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.
- Assign each object to the group that has the closest centroid.
- When all objects have been assigned, recalculate the positions of the K centroids.
- Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.
The EM algorithm is a bit more complicated, but has generally the same idea:

Since this was in the middle of exams, I wrote both solutions (code) in one night (7pm to 3am), so no guarantee whatsoever. Take a look; if you’d change anything let me know. The code is licensed under GPL v2.


September 22nd, 2006 at 12:16 pm
Hi! I’m working in a research project about video segmentation and like to know if I can use your source code in this. If I change something I’ll let you know, sure. And, in case you say yes, tell me how to thanks you properly in my publication.
Thanks for your attention.
September 22nd, 2006 at 3:51 pm
Hi Leilane,
The code is licensed uder GPL v2.
So go ahead, as long as your confrom to that license.
On a side note, as I said in the post I wrote them both in one night.
The EM version surely has problems, k-means is probably right…
Good luck,
George
May 24th, 2007 at 9:21 am
Hi George
Thanks for the excellent work!
Just wondering how Kmeans can be extended to more then 2 dimensions.
Thanks
Ravi
March 9th, 2009 at 3:58 am
Hi, I would like to know how kmeans will be useful in case of incremental clustering. If the information is dynamically changing, then how kmeans or nearest neighbor will handle editing correct cluster. How to handle unlabelled data? if to begin with we have 90% unlabelled data and 10% labelled then how the algorithm looks like? if this ration need to change, 80% and 20% then how will be the said algorithm change? kindly guide in this regards.
Preeti.