While this sounds something like a shameful family secret, discrete inversion is only the finite-valued variation on the method of inversion for the generation of random numbers with a given distribution (as I’ve discussed quite a while ago here). The case we’ll consider here is a random variable with few possible outcomes, each with different odds

So let’s say we have a random variable to generate with few values (maybe corresponding to something like choices in a game) with different odds. Let be four of those choices, with probabilities , , , and . Clearly, drawing an uniform random variable on isn’t sufficient, on its own, to choose one outcome with the desired probabilities. We have to cut the interval into four regions^{1}, one for each symbol, and each with a length equal to the corresponding probability. That’s actually quite easily done:

Now drawing a uniform random variable on will let us “fall” into one of the region, thus choosing the corresponding value.

Since the regions have lengths equal to the probabilities, we are guaranteed that the uniform variable lands with the desired probabilities in each region. There! We have a technique to generate discrete variables with any kind of random distribution—even for those for which it would be hard to find a clever formula.

*

* *

OK, now, how do we translate that into code?

If we have a lot of values (and correspondingly a large number of probabilities), we would have to build a table of cumulative probabilities (the i^{th} entry would contain the sum of all probabilities up to the i^{th} one), and then use binary search (or better yet, interpolation search) to find where our uniformly chosen value lands.

However, if we have very few values, say, four or five, that’s a lot of work!

First, let’s setup our uniform random generator using C++11 `<random>` header:

using random_type=uint64_t; using generator_type= std::linear_congruential_engine <random_type, random_type{6364136223846793005}, // Knuth's random_type{1442695040888963407}, // Knuth's std::numeric_limits<random_type>::max()>;

Now, we must compute the cumulative (mass) function, and search it using binary search. We will do neither! Let’s rather do this:

//////////////////////////////////////// // // p contains the probability of each // class, ended by zero. If the sum is // less than one, a last, virtual choice // is added as an "else", with correct // probability. // size_t random_choice(generator_type & rand, const std::initializer_list<float> & p) { float d=rand()/(float)std::numeric_limits<random_type>::max(); auto z=p.begin(); while (z!=p.end() && (d>*z)) d-=*z++; return z-p.begin(); }

The variable `d` is uniform on with `float` accuracy. Then, we walk the list. If `d` is greater than the current probability (not the sum of all previous probabilities, just the current one), we decrease `d` by the current probability, and examine the next one. If `d` is smaller than the current probability, we stop and have chosen the value. Let’s note that the above code doesn’t suppose that the `initializer_list` is normalized: if we reach beyond the end of the list, the “else” symbol is selected. That’s me at being lazy and not wanting to worry that everything sums to exactly 1.

*

* *

Now, let’s test that code to see what happens:

int main() { generator_type rand_gen(time(0)); float probs[]={0.1,0.1,0.2,0.3}; size_t nb_tries=10000000; size_t c[10]={0}; for (size_t i=0;i<nb_tries;i++) c[random_choice(rand_gen,{0.1,0.1,0.2,0.3})]++; for (int i=0;i<5;i++) std::cout << c[i] << '\t' << c[i]/(float)nb_tries << std::endl; }

Prints:

1001059 0.100106 1001505 0.100151 2000932 0.200093 2998483 0.299848 2998021 0.299802

Showing that, on the long run, it does select the values with the right probabilities.

*

* *

One last remark on the cost of `initializer_list`. To my surprise, `initializer_list` shows no significant pessimization over the use of a simple, null-terminated, array. One reason is that initializer lists may be implemented using only two pointers over a (static allocated) array, and that doesn’t cost a lot. In any case, `initializer_list` gives some flexibility, but we’d probably also need a version taking an array (not `std::vector`!) as input. It’s not very hard to implement.

^{1} They would not need to be contiguous, but it does simplify things a lot if they are.