## Drawing Random Numbers from Disjoint Ranges (Generating Random Sequences V)

On a number of previous installment of this series, I’ve discussed permutations, generating uniform random points in a triangle, on a sphere, the inversion method, even recycling expensive random bits. In this entry, I will use the rejection method to generate random numbers from disjoint ranges.

For simplicity, let us consider only the uniform case, where all values are equally likely to be drawn. So instead of drawing a number $x$ from a range, say $[1,10]$, we’re interested in the case where the set is composed from several intervals, something like $[1,10] \cup [20,30] \cup [40,50]$. We may think of drawing uniformly from $[1,50]$ and retry if we “fall in the gaps”, that is, if we draw a number in $[11,19]$ or $[31,39]$.

A first, it doesn’t seem like a bad plan, especially if the “holes” are rather small and easy to miss—that is, we have a good chance of hitting in an allowable range in a very few tries. For example, if the ranges we’re interested in look like $[1,10]$ and $[95,100]$, drawing from $[1,100]$ really doesn’t sound like a good idea.

But what if we compacted the ranges?

Instead of drawing from $[1,100]$ then use rejection, why not first analyze the permissible ranges to see that in fact, the domain contains the same numbers of values than $[1,16]$. Drawing from $[1,16]$ in one shot will allow us to do something like:

u=random(1,15);
if (u>10)
return u+84; // -11+95
else
return u;


The scheme generalizes, of course, to any number of (sub)ranges. The good news is that the if cascade that follows needs only to be $O(\lg r)$ deep, if $r$ is the number of ranges.

OK, so maybe an example with only two ranges isn’t that convincing. Let’s see how it’d work if we had, say, five ranges. We want to compact the ranges, and graphically, this would look like:

where the light gray dashed regions are unallowed ranges. Let us say that, left to right, the ranges have 20, 8, 5, 12, 5 values respectively, for a total of 50, and they correspond to ranges $[0,19]$, $[24,31]$, $[50,54]$, $[60,71]$, and $[80,84]$. We need to find where the original ranges start/end in the compacted range; but for this, we need only a cumulative sum. The first one starts at zero (one could start at 1, but that’s not very convenient for many reasons), the second starts at 20, the third at 28, the fourth at 33, the fifth one at 45. We now have our bounds for the if cascade. The code would then look like:

int fragmented_range_random()
{
int u=rand() % 50;

if (u<33)
if (u<20)
return u;
else
if (u<28)
return u-20+24;
else
return u-28+50;
else
// u>=33
if (u<45)
return u-33+60;
else
return u-45+80;
}


*
* *

So the next fair question is “when is it actually profitable to do so?”. Well, we can answer that, but we need some maths (oh noes!).

The preprocessing phase is done once, in linear time in the number of ranges. The if cascade has a depth $O(\lg r)$, where $r$ is the number of ranges. The complexity of drawing a random number with this method is therefore $O(\lg r)$.

For the case of rejection, that is, drawing on the whole range until we finally pick an allowable value, and for the needs of the demonstration, the ranges do not need to be contiguous. We only need to know the ratio of the number of allowable values to the total number of values. In the last example, we have 50 different values on a range of 85 ($[0,84]$). This gives us a probability of success of $p=\frac{50}{85}=\frac{10}{17}\approx{0.588}$. Retrying until we succeed corresponds to a geometric variable, and from classical probability theory, we know that the average number of retries before a success is given by $\frac{1}{p}$. So In average, we would need $\approx{1.7}$ retries for our example.

However, if for the example rejection works better, as the number of ranges grow, and the probability $p$ of success diminishes, the compacted version may work better. In fact, if we have

$\displaystyle\frac{1}{p}\geqslant\lg{r}$,

then the compacted range method is better—assuming we can determine if a value is acceptable in $O(1)$. Otherwise, simple rejection is better.

*
* *

The rejection method may or may not implemented easily. At one point, when the ranges are very fragmented, but with a relatively high value of $p$, one should consider simply using a bitmap to indicate in $O(1)$ whether or not the value is allowed. If you draw random numbers on 32 bits, you’d need 512MB of storage to hold the bitmap, which is cumbersome, but not that much considering the current state of affairs.

If the ranges considered do not lend themselves to a bitmap-based implementation (because the range is too big), you will have to resort to explicit testing of the drawn values. If the rule is somewhat complicated, you may still end up with an $O(\lg r)$ if cascade anyway (or even worst).

*
* *

<disgression> There are cases where the rejection method does not make sense at all. In the comments, we remarked that the volume of a sphere shrinks rapidly in relation to its bounding box. If we draw from the bounding box then reject if we draw outside the sphere, things get nasty quickly. Indeed, if $H_n=2^n$ is the volume of the hypercube in $n$ dimensions, and $V_n=\pi^{n/2}/\Gamma(\frac{n}{2}+1)$ is the volume of the sphere, we have

$p_n=\displaystyle\frac{V_n}{H_n}$

which is moderate for smallish $n$, but

$\displaystyle\lim_{n\to\infty} p_n=\lim_{n\to\infty}\frac{V_n}{H_n}=0$.

In other words, as $n$ grows, the number of retries goes to infinity.</disgression>

*
* *

If you read this far, you’re probably asking yourself what’s the point and/or when do we need to do that anyway. Well, I do have an application for this. Let’s say you have to pick a random globally routable Internet IP (v4) address. That is, you can’t draw 10.x.y.z nor 192.168.x.y, nor any of the loop-back addresses, nor reserved, nor… In fact, there are a whole lot of reserved addresses that may not make sense for a given application. If you want to draw these addresses, you could use rejection or compacted ranges.

But in this case, I have one good argument in favor of compacted ranges. First, if you do rejection with that many conditions (as in this table), you’ll end-up with a beefy if cascade anyway. If you’re going to have one, might as well not iterate over it. Moreover, the constants used in the if cascade are compile-time, so even if you have something like return u-192+2124, the compiler will fold the constants for you (and apply any optimization it sees fit).