K-means Clustering


My seminar this year was based on an interesting topic, it spoke of how parallelizing the Hierarchical Clustering algorithm provides great benefits in terms of efficiency.

Working on the paper got me thinking about parallelizing the already existing data clustering and grouping algorithms.
I thought I had one of those Eureka moments, think about all the efficiency speeds, and how the great data mongrels would benefit if I designed such a unique algorithm.

So I got to work, opened Google and typed in Parallel K-means, I got around 3,50,00,000 results in 0.67 seconds.
So I thought, okay people have obviously come up with this idea before, let’s see how many have parallelized K-means using OpenMP, maybe that could be my contribution to the scientific technology. (Mind you my meager knowledge of OpenMP wasn’t holding back my confidence in my ability to parallelize something in it.)
Here too, I got some 1,92,000 results.

So I am not proposing any unique idea in this blog post, nor am I even explaining my own code. I am reiterating an excellent use case of Parallelism in the data hungry world that exists today. I feel it is worth anyone’s time to go through this extremely efficient Parallel K-means code that I found.

I am going to be referencing to Prof. Wei-keng Liao’s code as available at Prof. Liao’s Page

I have forked a similar project on github, which is focused on elaborating on Prof Liao’s work. It can be found at Serban Giuroiu’s Implementation
While Serban has gone ahead and implemented K-means in CUDA, for the purposes of this post, I will only concentrate on the OpenMP version.

I am assuming for the sake of this post, that the simple logic behind the K-means clustering algorithm is understood, I will just be elaborating on some of the new constructs used in OpenMP.

To get the basics out of the way, the k-means algorithm begins with the user selecting the number of clusters i.e k to generate. The means of these k clusters are then initialized using various methods. After this, every point in the dataset will be assigned to a cluster with the nearest mean, nearest here can be an arbitrary distance metric used.

The function omp_kmeans is the main function used in the code. The code is defined such that the first K elements will be picked up as the initial K cluster center co-ordinates.

The number of clusters depends on the maximum number of threads in that the user’s hardware will support. This technicality will ensure that the efficiency doesn’t suffer due to the overhead of creating too many threads. This is done by calling the omp_get_max_threads() function.

The proposed approach is:
Each thread will calculate the centers using a private space, then thread 0, the first thread, will go ahead and perform an array reduction on them.

 if (is_perform_atomic) {
   #pragma omp parallel for \
   private(i,j,index) \
   firstprivate(numObjs,numClusters,numCoords) \
   shared(objects,clusters,membership,newClusters,newClusterSize) \
   schedule(static) \
   for (i=0; i<numObjs; i++) {
     /* find the array index of nestest cluster center */
     index = find_nearest_cluster(numClusters, numCoords, objects[i],
     /* if membership changes, increase delta by 1 */
     if (membership[i] != index) delta += 1.0;

     /* assign the membership to object i */
     membership[i] = index;

     /* update new cluster centers : sum of objects located within */
     #pragma omp atomic
     for (j=0; j<numCoords; j++)
        #pragma omp atomic
        newClusters[index][j] += objects[i][j];

The private clause makes sure that each thread will have its own instance of the variable; here the indices are private to ensure no race conditions and false sharing.

The firstprivate clause specifies that each thread should have its own instance of the variable and that each variable should be initialized with the value of the variable, since it exists before the private construct; this ensures that the initialization is not repeated for every thread.

The shared (similar to copyprivate) clause specifies that said variables will be shared among all threads; this is to maintain the uniformity of the values in the required data structures.
Schedule clause is indicative of the kind of scheduling, here we do static scheduling.

The reduction clause simply specifies that at the end of the parallel region, the variables private to each thread will undergo a reduction based on the specified operation; so here we perform the addition operation on the delta variable.

On executing the sequential algorithm, this is the analytical result:

% seq_main -o -b -n 4 -i Image_data/color17695.bin

Writing coordinates of K=4 cluster centers to file
 Writing membership of N=17695 data objects to file

Performing **** Regular Kmeans (sequential version) ****
 Input file: Image_data/color17695.bin
 numObjects = 17695
 numAttributes = 9
 numClusters = 4
 threshold = 0.0010
 I/O time = 0.0266 sec
 Computation timing = 0.3572 sec

The computation time is 0.3572 seconds.
Using the OpenMP version, the analytical results are:

% omp_main -a -o -n 4 -i Image_data/color100.txt

Writing coordinates of K=4 cluster centers to file
 Writing membership of N=100 data objects to file

Performing **** Regular Kmeans (OpenMP) ---- using array reduction ******
 Number of threads = 8
 Input file: Image_data/color100.txt
 numObjects = 100
 numAttributes = 9
 numClusters = 4
 threshold = 0.0010
 I/O time = 0.0035 sec
 Computation timing = 0.0017 sec

The computation timing is 0.0017 seconds. Clearly, OpenMP outperforms the sequential algorithm substantially.

Most of this post has been based on my understanding of Professor Wei-keng Liao’s and Serban Giuroiu’s work. The links have already been provided earlier in this post.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s