A Parallel Approach to Image Processing

For this post, I am going to delve into the world of images, and learn how to process them faster.

Digital image processing is a tedious procedure involving various steps, many of which are time consuming.
Basically, every pixel of the image has to broken down into the corresponding RGB values and the required operations have to performed on a pixel by pixel basis. I looked for a simple image processing project on GitHub and came across this repository , which performs simple operations such as generating the negative of an image.

The basic idea behind the serial algorithm is to read the pixels, and subtract the RGB values from 255.

In the case of image processing, every pixel is inherently independent of each other and computations can be performed simultaneously on every pixel provided we have the required supporting hardware.
The hardware I am using has the following specifications:

  Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz
  Cache Size: 3072 KB
  RAM: 4GB

The original image can be found at this dropbox link
(This is because WordPress does not allow you to share BitMap images directly.)

As a first step, I executed the original serial code and got the following result:

$ time ./new -i sample.bmp -o new.bmp

real	0m0.118s
user	0m0.288s
sys	0m0.016s

Not to directly jump to the OpenMP method of optimization, I first tried implementing basic serial techniques like loop unrolling.

Loop unrolling is basically asking your for loop to perform the same operation multiple times in the same iteration, thereby reducing the total number of iterations. This usually does not show a substantial improvement in small loops.
(The changes to the code have been commented in my GitHub repository .)

As expected the following was the time analysis:

$time ./new -i sample.bmp -o unroll.bmp

real	0m0.106s
user	0m0.250s
sys	0m0.05s

One possible optimization could be SIMD- single instruction multiple data: reading multiple pixel values at a time, and performing one operation on them,(here which is the subtraction from 255). This should theoretically work since, pixel values are stored sequentially in memory. Non-SIMD image processing would fill only a quarter of the 32 bit register.
The Pixel class in the Pixel.h file defines the pixels as a set of R,G,B values which are essentially unsigned characters. To perform SIMD we can convert this to unsigned integer. Thus when reading from a file, it should help us read 4 bytes at a time, instead of 1 byte.
To test this out, every subsequent call to getR, getG, getB, get, set, etc. basically all r,g,b manipulations should be type-casted or converted to unsigned int.
A theoretical speedup of a factor of four to eight can be expected. I haven’t carried this out myself, but I see no reason for this to not work. The interested reader can read this document for detailed instructions to implement the same.

The serial code with an O3 optimisation gives the following result:

$ g++ -O3 -o new main.cpp
$ time ./new -i sample.bmp -o new.bmp

real	0m0.034s
user	0m0.028s
sys		0m0.000s

This answer on StackOverflow serves a good reference as to the use case of O3 optimisation.

Let’s see if we can do better than this.

On looking at the Image.h file, we find a write function of complexity O(n^2). On trying to optimize this, using either parallel sections (since the three operations of fetching and setting the value for R,G,B are independent) or defining the for loop as a parallel loop, we get race conditions.
One advantage of learning Parallel Computing via image processing is that, whether or not you are doing something wrong can be easily understood from the output.

This is the resulting garbled image. 😛

Clearly we had data race conditions. This race was taken care of by declaring x as private.

But again on executing it, we didn’t get the desired output (efficiency wise).

$ time ./new -i sample.bmp -o new.bmp

real	0m0.117s
user	0m0.276s
sys	0m0.028s

This is because of the attempt to write to the line vector at the same time, which led to false sharing. So even though the indices were unique, if the vector is stored on the same cache block, repeated accesses to it from multiple threads leads to a higher access time (due to more cache misses).

This made me look closer at the code and deduce the main hotspot. This was the for loop in the main.cpp file, since it was performing most of the computation.

On executing the for loop in parallel while taking care of private and global variables, we get the following optimization.

$ time ./new -i sample.bmp -o new.bmp

real	0m0.025s
user	0m0.036s
sys		0m0.000s

Running the same code with an O3 optimisation gives us:

$ time ./new -i sample.bmp -o new.bmp

real	0m0.017s
user	0m0.032s
sys		0m0.000s

Thus, one pragma omp line was all it took.

This is the desired output.

All the images in this post can be found on
My DropBox Folder

All the code in my forked repository.

For more information and reference on a parallel approach to image processing, the user can look at:
This highly illustrous article by Henry A. Gabb and Bill Magro.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s