For this post, I am going to delve into the world of images, and learn how to process them faster.
Digital image processing is a tedious procedure involving various steps, many of which are time consuming.
Basically, every pixel of the image has to broken down into the corresponding RGB values and the required operations have to performed on a pixel by pixel basis. I looked for a simple image processing project on GitHub and came across this repository , which performs simple operations such as generating the negative of an image.
The basic idea behind the serial algorithm is to read the pixels, and subtract the RGB values from 255.
In the case of image processing, every pixel is inherently independent of each other and computations can be performed simultaneously on every pixel provided we have the required supporting hardware.
The hardware I am using has the following specifications:
Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz Cache Size: 3072 KB RAM: 4GB
The original image can be found at this dropbox link
(This is because WordPress does not allow you to share BitMap images directly.)
As a first step, I executed the original serial code and got the following result:
$ time ./new -i sample.bmp -o new.bmp real 0m0.118s user 0m0.288s sys 0m0.016s
Not to directly jump to the OpenMP method of optimization, I first tried implementing basic serial techniques like loop unrolling.
Loop unrolling is basically asking your for loop to perform the same operation multiple times in the same iteration, thereby reducing the total number of iterations. This usually does not show a substantial improvement in small loops.
(The changes to the code have been commented in my GitHub repository .)
As expected the following was the time analysis:
$time ./new -i sample.bmp -o unroll.bmp real 0m0.106s user 0m0.250s sys 0m0.05s
One possible optimization could be SIMD- single instruction multiple data: reading multiple pixel values at a time, and performing one operation on them,(here which is the subtraction from 255). This should theoretically work since, pixel values are stored sequentially in memory. Non-SIMD image processing would fill only a quarter of the 32 bit register.
The Pixel class in the Pixel.h file defines the pixels as a set of R,G,B values which are essentially unsigned characters. To perform SIMD we can convert this to unsigned integer. Thus when reading from a file, it should help us read 4 bytes at a time, instead of 1 byte.
To test this out, every subsequent call to getR, getG, getB, get, set, etc. basically all r,g,b manipulations should be type-casted or converted to unsigned int.
A theoretical speedup of a factor of four to eight can be expected. I haven’t carried this out myself, but I see no reason for this to not work. The interested reader can read this document for detailed instructions to implement the same.
The serial code with an O3 optimisation gives the following result:
$ g++ -O3 -o new main.cpp $ time ./new -i sample.bmp -o new.bmp real 0m0.034s user 0m0.028s sys 0m0.000s
This answer on StackOverflow serves a good reference as to the use case of O3 optimisation.
Let’s see if we can do better than this.
On looking at the Image.h file, we find a write function of complexity O(n^2). On trying to optimize this, using either parallel sections (since the three operations of fetching and setting the value for R,G,B are independent) or defining the for loop as a parallel loop, we get race conditions.
One advantage of learning Parallel Computing via image processing is that, whether or not you are doing something wrong can be easily understood from the output.
This is the resulting garbled image. 😛
Clearly we had data race conditions. This race was taken care of by declaring x as private.
But again on executing it, we didn’t get the desired output (efficiency wise).
$ time ./new -i sample.bmp -o new.bmp real 0m0.117s user 0m0.276s sys 0m0.028s
This is because of the attempt to write to the line vector at the same time, which led to false sharing. So even though the indices were unique, if the vector is stored on the same cache block, repeated accesses to it from multiple threads leads to a higher access time (due to more cache misses).
This made me look closer at the code and deduce the main hotspot. This was the for loop in the main.cpp file, since it was performing most of the computation.
On executing the for loop in parallel while taking care of private and global variables, we get the following optimization.
$ time ./new -i sample.bmp -o new.bmp real 0m0.025s user 0m0.036s sys 0m0.000s
Running the same code with an O3 optimisation gives us:
$ time ./new -i sample.bmp -o new.bmp real 0m0.017s user 0m0.032s sys 0m0.000s
Thus, one pragma omp line was all it took.
This is the desired output.
All the images in this post can be found on
My DropBox Folder
All the code in my forked repository.
For more information and reference on a parallel approach to image processing, the user can look at:
This highly illustrous article by Henry A. Gabb and Bill Magro.