## OpenMP and The N Queens Problem

Now that we all have multi-core CPUs (multiple cores, simultaneous multi-threading, with uniform memory, etc.) I think it’s about time we really get the hang of it.

However, correctly threading applications is hard in general, and not all applications can gain significantly from parallelism. But some applications are embarrassingly parallel by their nature, and in this case, breaking down the problems into independent sub-problem is not hard at all, often requiring little more synchronization than waiting for all worker threads to finish.

The best possible case is where all the left-hand sides are write-only and the right-hand sides read-only (and right-hand sides do not depend on left-hand sides). In this case, you only need to chunk-up the work into n threads, launch them, and wait for them to finish. And, lucky me, a lot of scientific computations (but not all, I’m not that lucky) are like that.

One possible example easily parallelizable problem is the K-means algorithm. You can reduce complexity from $O(m n\lg n)$ (for $m$ prototypes/centroids, $n$ training exemplars, you will need $O(\lg n)$ iterations to get it to converge reasonably, and each iteration costs you $O(m n)$ because, in general, finding the closest prototype amongst $m$ for a given exemplar is $O(m)$—there’s no magic to it). If you have $c$ cores, you can reduce it by a factor of $c$ by finding the nearest neighbors of $c$ exemplars at the same time. The (expected) speed-up is $O(c)$. Unfortunately, as $c$ is a constant, you only get constant speed-up.

But not all problems can be broken down so easily; some will need synchronization as some of the right-hand sides will depend on previously (or concurrently) computed left-hand sides. Remember, in Is Python Slow? I used the generalized 8 Queens problem to test the efficiency of different programming languages. The puzzle is solved by using a classical state space search (with some machine-efficient optimizations).

If we had an infinite number of cores (all accessing the same memory), we could reach all solutions for an $n\times{n}$ board in $O(n)$, since we would need to start at the root of the state search tree, spawn $n$ threads corresponding to all valid sub-problems. Each new thread would in turn spawn $n-1$ threads, each corresponding to still potentially valid solutions, and so on, until each thread can only spawn a small, constant, number of sub-problems corresponding to the last queen to place on the last row of the board. This creates $O(n!)$ threads, but reaches all solutions in linear time.

Alas!, we do not have an infinite number of cores. The solution is to split the problem into $O(c)$ sub-problems, and have each sub-problem resolved by classical recursion, without spawning more threads. You don’t get linear time, but you get an $O(c)$ speed-up, which is better than a swift kick in the butt.

*
* *

Once you have figured out what kind of parallelism you can afford for your application, the goal becomes how to get parallelism without much effort. There are many possible solutions, not all equally convenient. One is to use a language that allows seamless threading by its very nature. There’sn’t that many, and they’re not popular anyway. If you use C, C++ or another language of that family, threading is available only through additional libraries. If you use C or C++, one possibility is to use the POSIX threads (pthreads). If you’re rather a C++ person, you can use Boost threads, which are surprisingly easy. They provide thread objects (instances of class boost::thread) to which you pass any callable object to do the work.

Even easier, when the problem lends itself to it, is OpenMP. OpenMP is a C/C++ compiler extension that uses preprocessor directives (“pragmas”) to indicate where and what kind of parallelism your program will use. The simplest construct is the parallel-for. For example:

#pragma omp parallel for
for (int i=0;i<dx;i++)
solutions[i]=solve(i);


will somehow break the for in many separate threads to maximize the amount of work performed in parallel. In this simple example, we have the best possible case where all left-hand sides are write-only (we assign solutions[i] after solving the i-th problem, and all sub-problems are independent.

If the left-hand side need some kind of (simple) synchronization, we can use the reduction directive:

#pragma omp parallel for reduction(+:solutions)
for (int i=0;i<dx;i++)
solutions+=solve(i);


Here, it instructs the compiler that solutions is to be shared between sub-problems, and how: it’s a reduction using +. This will cause the compiler to maintain a number of intermediary sums (one for each thread) and combine the partial sum as threads complete.

If it’s not that convenient to simply reduce, we can revert to the previous example and transform (usually) slightly the code to accommodate write-only left-hand sides, and combine the results manually after.

*
* *

In many cases, just using a few #pragmas (and the appropriate compiler-dependent switches) will let you access, without much more effort, pretty much all the parallelism your computer can give you—well, CPU-based only, it doesn’t deal with the GPU at all. All the performance, modulo, of course, the scheduling of the threads, in which there is some inefficiency. If you do not specify how to schedule the threads (i.e., how many threads and how many iterations each should get), OpenMP will adapt to the current computer and determine, after a few rounds, how to schedule the threads for maximal speed-up.

The caveat is that it does take some time to adjust scheduling, so really short tasks will probably yield confused scheduling and a slow-down rather than a speed-up. For really short tasks, it is über important to understand how OpenMP scheduling works to get the best results. If your tasks are rather lengthy (as is searching a whole solution sub-graph, for example), then you may not need to worry too much about scheduling as the adaptation is still rather fast (100ms to 1s?) and OpenMP will adjust correctly.

*
* *

So, back to our $n$ Queens problem. The main loop, in the sequential version looks like:

void solve( int niv, int dx )
{
if (niv)
{
for (int i=0; i<dx; i++)
if (test(niv,i))
{
mark ( niv,i );
solve( niv-1, dx );
mark ( niv,i );
}
}
else
for (int i=0; i<dx; i++)
solutions += test(0,i);
}


Of course, the temptation is to put an OpenMP parallel for pragma just before the for, but OpenMP doesn’t deal well with recursion (and, even if it did, it would lead to $O(n!)$ threads being created (as we discussed above (a fork bomb (oh noes!!!)))). If you have a look at the (complete) source code (available here), you see it relies on a couple of global variables. For single-threading, it works fine, but for multi-threading, we need private copies of all those variables if we’re to avoid spending all our time locking/unlocking for update. The simplest transformation is to pass them all as arguments:

uint64_t solve( int niv, int dx,
uint64_t diag45,
uint64_t diag135,
uint64_t cols)
{
uint64_t solutions=0;
if (niv)
{
for (int i=0; i<dx; i++)
if (test(niv,i, diag45, diag135, cols))
solutions+=solve ( niv-1, dx,
diag45 | (1ull << (32+i-niv)),
diag135 | (1ull << (niv+i)),
cols | (1ull << i ));
}
else
for (int i=0; i<dx; i++)
solutions += test(0,i, diag45, diag135, cols);

return solutions;
}


thus creating a private copy for each state. The code is a lot less elegant, but it does essentially the same thing, without global variables, returning the number of solutions as a local variable (that also was global before).

So now we have a thread-friendly version, and we can launch independent threads:

uint64_t meta_solve( int dx )
{
uint64_t solutions=0;
int niv=dx-1;

#pragma omp parallel for reduction(+:solutions)
for (int i=0;i<dx;i++)
solutions+=solve(niv-1,dx,
(1ull << (32+i-niv)),  // diag45
(1ull << (niv+i)),     // diag135
(1ull << i )           // cols
);

return solutions;
}


Which instructs the compiler to launch a certain number of threads and to reduce the number of solutions.

*
* *

Is this efficient? Let us look at the times on a 4-core i7 960 @ 3.07GHz (total of 8 threads). In abscissa, you have the puzzle size and in ordinates, the time, in second, elapsed to solve the problem:

From this graph, OpenMP seems to be doing a much better job than just the sequential implementation. Let us have a look at the low end, this time using log-scale:

Until about $n=8$, the sequential implementation is much faster than the OpenMP version. Again, it’s the overhead due to thread management and adaptation to the machine’s characteristics, as we will discuss in a moment. Looking at speed-ups:

Below $n=8$, the speed-up is in fact a slow-down. But as the puzzle size grows, the speed-up gets better, but oscillates, peaking at $n=16$. Why?

*
* *

Of course, there are some inefficiencies. A part comes from the scheduling, but a part comes from how the problem is split. Since OpenMP will thread at the sub-task level (and won’t be overly smart about it), it may be that the last tasks to be executed at the same time are considerably fewer than the number of available cores. For example, if I have 8 cores and 11 tasks, it may do the first 8 in parallel, then the next 3, leading to a substantially sub-optimal performance. One could compare efficiency and speed-up for different scenarios (for example, giving only 4 cores to the process leads to higher efficiency with the last batch using 3/4 of the cores rather than only 3/8, but also to a less interesting speed-up of at most ×4 rather than at most ×8). And we see some indication of this in the speed-up graph above.

Let us also note that the speed-up isn’t quite 8 as the threads aren’t independent cores, but pairs of threads sharing the same core. A thread sharing a core with another thread will use the extra superscalar capacity of the core, and this seems to be around 30% (and speedups around ×6 are compatible with this hypothesis).

*
* *

You can d/l the OpenMP version here. Using G++:

#!/bin/bash

g++ -O3\
-Wall\
-Wextra\
-fopenmp\
super-reines-openmp.cpp\
-o super-reines-openmp


will compile the executable. If you want to contribute a Visual Studio version of the build, please share it in the comments.

### 4 Responses to OpenMP and The N Queens Problem

1. Tadel says:

Can I get the Visual Studio version?

• I do not have a visual studio version, but the modifications are minimal, if any are needed

2. This is what the work looks like with a 12 core server (24 threads. 2 x Intel Xeon E5 2620V2 which is 2 x (6 x 2,10 GHz); HT):

At the last stage for a shorter time it was like this then: http://abload.de/img/damen_24threads_phasec9un4.png

Output of:
time ./super-reines-openmp 18
18 666090624 506.779297

real 8m26.782s
user 133m16.971s
sys 0m0.933s

• That’s quite a machine you’ve got yourself!

And yes, OpenMP parallelizes for somewhat heuristically. Some threads will have less work than others, and when you have more CPU/Hardware threads than OpenMP threads, it won’t be able to figure out how to use all hardware threads.