Discrete Inversion (Generating Random Sequences XII)

While this sounds something like a shameful family secret, discrete inversion is only the finite-valued variation on the method of inversion for the generation of random numbers with a given distribution (as I’ve discussed quite a while ago here). The case we’ll consider here is a random variable with few possible outcomes, each with different odds

So let’s say we have a random variable to generate with few values (maybe corresponding to something like choices in a game) with different odds. Let be four of those choices, with probabilities p_1=0.2, p_2=0.4, p_3=0.3, and p_4=0.1. Clearly, drawing an uniform random variable on [0,1] isn’t sufficient, on its own, to choose one outcome with the desired probabilities. We have to cut the interval [0,1] into four regions1, one for each symbol, and each with a length equal to the corresponding probability. That’s actually quite easily done:

Now drawing a uniform random variable on [0,1] will let us “fall” into one of the region, thus choosing the corresponding value.

Since the regions have lengths equal to the probabilities, we are guaranteed that the uniform variable lands with the desired probabilities in each region. There! We have a technique to generate discrete variables with any kind of random distribution—even for those for which it would be hard to find a clever formula.

*
* *

OK, now, how do we translate that into code?

If we have a lot of values (and correspondingly a large number of probabilities), we would have to build a table of cumulative probabilities (the ith entry would contain the sum of all probabilities up to the ith one), and then use binary search (or better yet, interpolation search) to find where our uniformly chosen value lands.

However, if we have very few values, say, four or five, that’s a lot of work!

First, let’s setup our uniform random generator using C++11 <random> header:

using random_type=uint64_t;
using generator_type=
  std::linear_congruential_engine
   <random_type,
    random_type{6364136223846793005}, // Knuth's
    random_type{1442695040888963407}, // Knuth's
    std::numeric_limits<random_type>::max()>;

Now, we must compute the cumulative (mass) function, and search it using binary search. We will do neither! Let’s rather do this:

////////////////////////////////////////
//
// p contains the probability of each
// class, ended by zero. If the sum is
// less than one, a last, virtual choice
// is added as an "else", with correct
// probability.
//
size_t random_choice(generator_type & rand,
                     const std::initializer_list<float> & p)
 {
  float d=rand()/(float)std::numeric_limits<random_type>::max();
  auto z=p.begin();

  while (z!=p.end() && (d>*z)) d-=*z++;
   
  return z-p.begin();
 }

The variable d is uniform on [0,1] with float accuracy. Then, we walk the list. If d is greater than the current probability (not the sum of all previous probabilities, just the current one), we decrease d by the current probability, and examine the next one. If d is smaller than the current probability, we stop and have chosen the value. Let’s note that the above code doesn’t suppose that the initializer_list is normalized: if we reach beyond the end of the list, the “else” symbol is selected. That’s me at being lazy and not wanting to worry that everything sums to exactly 1.

*
* *

Now, let’s test that code to see what happens:

int main()
 {
  generator_type rand_gen(time(0));
  
  float probs[]={0.1,0.1,0.2,0.3};
  size_t nb_tries=10000000;
  size_t c[10]={0};

  for (size_t i=0;i<nb_tries;i++)
   c[random_choice(rand_gen,{0.1,0.1,0.2,0.3})]++;

  for (int i=0;i<5;i++)
   std::cout << c[i] << '\t' << c[i]/(float)nb_tries << std::endl;
 }

Prints:

1001059	0.100106
1001505	0.100151
2000932	0.200093
2998483	0.299848
2998021	0.299802

Showing that, on the long run, it does select the values with the right probabilities.

*
* *

One last remark on the cost of initializer_list. To my surprise, initializer_list shows no significant pessimization over the use of a simple, null-terminated, array. One reason is that initializer lists may be implemented using only two pointers over a (static allocated) array, and that doesn’t cost a lot. In any case, initializer_list gives some flexibility, but we’d probably also need a version taking an array (not std::vector!) as input. It’s not very hard to implement.


1 They would not need to be contiguous, but it does simplify things a lot if they are.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: