Parallel Merge Sort

In my second post I am going to talk about the popular sorting algorithm Merge Sort.

The serial implementation of Merge Sort, is based on a divide and conquer approach. The data set is divided into recursively into smaller and smaller parts, i.e. the divide part, and then to conquer the data, the data is ordered, two elements at a time. The merge action is then called from bottom up, combining the different parts to form the final sorted array.
As deduced, this is a stable sort.

A more detailed explanation can be found at the Wikipedia Page .

It is easy to implement, easy to understand and is an efficient sorting algorithm. This is the first algorithm I am going to try and parallelise using the OpenMP tools at my disposal.

As we know, Merge sort, when executed serially, has a time complexity of the order, ϴ(n log n) in the average case. I am going to try and see if making the code run in parallel gives us an even better complexity.


All the source code used for this post can be found in GitHub Repository: Parallel Computing.

I started off by writing a simple serial algorithm, the source code for which is in a file named “merge-serial.cpp” in the GitHub repository.
The below illustrated code is the main “hotspot” of the entire algorithm due to two rudimentary observations,
1. It is called the most number of times
2. It is doing the most important work, i.e splitting, then merging.

void mergeSort(int a[], int l, int r)
        int m=(l+r)/2;
        mergeSort(a,l,m); //call 1
        mergeSort(a,m+1,r); //call 2

This algorithm has identifiable independent regions. The mergeSort function divides the array into two halves and sorts the elements recursively, following which it merges the two parts. Here, I noticed that the first two calls (call1 and call2 in comments) are essentially independent of each other, by independent I mean, they operate on two different parts of the data array. This brought me to the observation that these two tasks can be executed in parallel.
Note here, that the common variable m is only being read from, and not modified simultaneously. Hence the chances of race conditions are minimal.

Also note here, that for large values of n, it is unlikely that the two threads will update close parts of the array a, thus decreasing the probability of false sharing.

For the implementation of this I used the Parallel Sections tool available to me from OpenMP. Every function call will essentially require two regions to run in parallel, so we need just two threads to be created. We use “section” to define the two individual sections we want running in parallel.

void mergeSort(int a[], int l, int r)
        int m=(l+r)/2;
        #pragma omp parallel sections num_threads(2)
            #pragma omp section
            #pragma omp section

The entire code for your reference can be found in GitHub Repository: Parallel Computing, in a file: “merge-parallel.cpp“.

Time Complexity

I have executed and compared the above two functions on a system with specifications:

   Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz
   Cache Size: 3072 KB
   RAM: 4GB

In order to measure the differences in the time complexities of the two codes, I created random inputs for large values of n,  using the “generate.cpp” file found in  GitHub Repository: Parallel Computing.
Note: the data type of these values is integer. However, the results should scale to character, floating point and double values as well.

I have documented my results in the form of line charts.

Input Size Serial User Time Parallel User Time Real Serial Real Parallel
100 0 0 0.002 0.002
1000 0 0 0.003 0.003
10000 0.008 0.008 0.017 0.009
50000 0.048 0.036 0.105 0.09
80000 0.06 0.056 0.179


100000 0.076 0.068 0.228 0.143
1000000 0.876 0.716 2.172 1.252

Table 1

The Serial User Time and Parallel User Time columns in Table 1, are indicative of the total System time and User time, which account for the total actual CPU time our process used. This will include the time taken by all child threads created. A graph for the same is shown in Figure 1. Clearly, the parallel code is faster for larger inputs.


Figure 1

In Figure 2, I have plotted the real time taken by both the codes, which is the wall clock time, and here too the parallel code is much faster for larger inputs.


Figure 2

Clearly, making a sequential algorithm run in parallel has advantages, and especially for large values of n, parallelism would be preferred.

For smaller values, since the time difference isn’t drastic, criteria such as the overhead of creating and deleting threads come into picture.

These criteria should be kept in mind, since the main purpose of parallelism is increasing the efficiency.

While preparing this post I came across an interesting paper that meticulously explains converting serial Merge Sort into parallel using OpenMP and MPI. Interested readers may refer to it here: Shared Memory, Message Passing, and Hybrid Merge Sorts for Standalone and Clustered SMPs

A good explanation on the difference between the different times the Linux kernel displays can be found at Answer to “What do ‘real’, ‘user’ and ‘sys’ mean in the output of time(1)?”

For any queries, comments, and especially mistakes you see in what I do, kindly contact me.

Thank you.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s