## Of Sets and bitmaps

When we represent sets, we have many options. We can use a language-specific primitive, like std::set<T> (which is likely list- or tree-like in its detail), or use a bitmap that marks, for each element (and therefore assumes that there is an universal set that contains all elements) whether or not it is included in the set. Bitmaps are simple to implement (especially when one uses something like std::vector<bool>) but need an amount of memory proportional to the universal set, not to the actual subset you’re trying to encode.

We can also use lists, or interval lists. But which one is the most efficient? Under what conditions? Let’s have a look.

Let us start by enumerating the hypotheses we will use. First, sets are represented relative to a universal set (usually denoted $U$ in introductory texts), of cardinality $|U|=n$. The sets we represent are subsets of $U$, written $A\subseteq{}U$; therefore $0\leqslant{}k=|A|\leqslant{}|U|$. A (sub)set can be sparse (very few elements) or dense (almost all elements).

The representation we will examine:

• Bitmap. Uses $n$ bits for a universal set of $n$ elements, regardless of actual contents of sets. An empty set uses as much as a full set. Easy to implement (std::vector<bool>).
• List. Uses an amount of memory that depends on the number of elements in the set. In the experiment, I supposed that we could code each integer in the list using $\lceil \log_2 n \rceil$ bits, and pack them in memory (i.e., without padding).
• Interval List. Represents the set by compressing consecutive elements onto intervals (ex: 4,5,6,7 becomes [4,7]). Intervals use $2\lceil \log_2 n\rceil$ bits. This representation uses an amount of memory proportional on the number of intervals. Uses very little memory for almost empty or almost full subsets.

The experimental setup is as follows. The universal set contains 2000 elements. I generated random sets with $0\leqslant{k}\leqslant{2000}$ elements (by inserting $k$ bits to 1 and then shuffling them with Fisher-Yates). I generated 100 random sets for each $k$, as to get meaningful-ish averages. For each instance of an experiment (for fixed $k$, one of 100), I computed the size of the interval list on this specific bitmap (the size of the list can be computed from $k$ alone; and the bitmap size is fixed).

Varying $k$ from 0 to 2000 (with 100 instances for each $k$) yields the following graph:

Where the green line is the fixed 2000-bits bitmap size, blue is the list size, and gray+red the interval lists. Since the number of intervals varies at random (but strongly correlated to the density), we have an average, a likely minimum and a likely maximum size.

If we look at the graph, we see that the list does better than the bitmap up to approx. 200 elements. The interval list does well with very sparse and very dense sets, does worst than list up to half-density (where we have a maximal number of intervals), then, as density grows, it does much better than list: it takes advantage of consecutive elements.

*
* *

If $n$ is very large and we have very few elements (or almost all), we can favor the interval list representation; otherwise, the bitmap representation is preferable. At one bit per element it does worse only when it is extremely sparse or extremely dense. One could therefore imagine switching from one form to the other depending on the number of elements contained in the set. Determining when to switch can be done analytically, or just by comparing the list size to the bitmap size (from interval list to bitmap) or by estimating the number of gaps (and therefore the number of intervals) by counting the number of elements missing from the set (from bitmap back to interval list).

### 2 Responses to Of Sets and bitmaps

1. You omitted differential coding which is likely best in some instances.

• We could estimate the code length using, say, universal or Golomb coding, but I am not sure what the distribution of spaces is (not geometric).