Initializing Arrays

Initializing large arrays of data before use is always cumbersome but it seems to be unavoidable.

The first types of solutions fill the array with a value that denotes an empty entry. While this sounds straightforward, the choice of that value is not always easy. For floating points numbers, zero may or may not be a good candidate. If convenient (as a zero floating point value is encoded as binary zeroes filling the variable) it may be difficult in some contexts to use because zero may be valid. Fortunately, there’s always the possibility to initialize the array with NaNs, which can be tested and used correctly (but you need the POSIX functions to do that).

For integers, the situation seems more complicated as there are no reserved encoding for NaN: all the possible values represent a valid integer. Zero is favored as a NaN, but there are other solutions, which we will explore.

The second types of solutions are based on methods of marking array entries as occupied or freed.

The simplest solution is a bitmap. A bitmap (or bit arrays) will use a bit for each entry in the array, usually in a storage format that uses one bit of memory per logical bits in the bitmap. A zero bit determines that the corresponding entry is free, a one bit that it is used. This might be faster since filling a bitmap with zeroes takes a fraction of the time to fill the array. For starter, it’s at least eight times smaller, and grows even smaller as the size of the entries grows. It also has the nice side effect to manipulate less memory, and doesn’t mess up the caches if the bitmap’s not too big.

If you don’t care much about the order in which an array’s contents are stored, you can use a stack-like structure. To add an item, you just push it on top. Deleting the last item proceeds just like with a stack, you pop it by decrementing the top pointer.

#include <iostream>
#include <algorithm>

template <typename T>
class lazy_array
{
 private:

     int stack_size;
     T * stack;
     int stack_ptr;

 public:

 T & operator[](int x)
  {
   if ((x>-1) && (x<=stack_ptr))
    return stack[x];
   else
    ; // throw
  }

 void add(const T & x)
 {
  if (stack_ptr<stack_size-1)
   {
    ++stack_ptr;
    stack[stack_ptr]=x;
   }
  else
   ; // throw
 }

  int size() const { return stack_ptr+1; }

  void del(int offset)
   {
    if (offset>-1)
     {
      if (offset<stack_ptr)
       std::swap(stack[offset],stack[stack_ptr]);
      else
       ; // nothing to swap (since we
         // needn't swap the top with
         // the top)

      --stack_ptr;
     }
    else
     ; // throw
   }


 lazy_array(int stack_size_)
  : stack_size(stack_size_),
    stack(new T[stack_size]),
    stack_ptr(-1) { }

 ~lazy_array() { delete[] stack; }
};

So of course this solution asks for little extra data (a stack_ptr and the stack size stored in the class), but does adding, access, and deletion in constant time but you sacrifice ordering. You can’t search for an item, find it at position i and expect the ith entry to contain the same data next time you access it.

A solution that preserves ordering was proposed by Aho, Hopcroft, and Ullman [1] and found in Programming Pearls, 2nd ed. [2]. It does preserve ordering, but at the cost of two extra array, plus a pointer.

The two extra array serve as an interlock to determine whether or not an array element is used. The initialization (in addition to allocating memory) consists only in initializing a pointer, top, to zero. The two extra arrays are named to and from (taking the names found in [2]) and are left with whatever they may contain after allocation.

If the ith data element was previously accessed, then we have that from[i]<top and that to[from[i]]=i, that is, from[i] points to an entry that points back to it. If from[i] contains a random integer, then it doesn’t point to an entry in to that points back on itself and therefore the entry is unallocated. Even if both to and from agree, they still have to be so that from[i]<top, that is, already allocated by the algorithm. The above figure shows what such an array might look like.

The proposed scheme entirely safe, provided that we first check if from[i] points somewhere inside to (below top, in fact). If it doesn’t, then the array is clearly not initialized and this entry contains garbage left over by memory allocation. An implementation is proposed by Eli Bendersky on his blog.

While this method is clever, it does require two extra array, containing pointer-sized integers, which may mean a lot of extra memory depending on the original array’s size. If the array will be used sparsely, then a very small fraction of the arrays are accessed, which may have beneficial side effects such as being cache-friendly.

The bitmap approach is therefore not too bad it seems; while it does require to be zeroed initially, it allocates a lot less memory than Aho et al.‘s solution. Like everything else, it’s about trade-offs.


[1] Alfred V. Aho, John E. Hopcroft, Jeffrey D. Ullman — The Design and Analysis of Computer Algorithms — Addison-Wesley (1974)

[2] Jon Louis Bentley — Programming Pearls — 2nd ed., Addison-Wesley (1999)

6 Responses to Initializing Arrays

  1. petke says:

    I don’t see what you gain by using your “lazy_array” class as its not lazy. In the constructor the “new T[stack_size]” will default constructs and initialize every single item. As you are using a template type T here the compiler will help you by initializing the items to zero if T is int or float etc, and initialize to false if T is bool (all according to the standard). I would suggest you use a dynamically sized container like std::vector insted if you dont want to default initailization, and to use a boost::array or std::array if you do want a fixed size container.

    • Steven Pigeon says:

      Depending on the type T, you may or may not get a default initializer. If the items are fat, T might in fact be S *, which doesn’t have much of a default initializer.

      (although you are ultimately right; using std::vector or a similar tested and well implemented containter is a much better approach; most of the times.)

  2. Fanael says:

    Minor nitpick: you don’t really need any POSIX functions to use NaNs – these values are easily obtainable from numeric_limits, testing for them is also simple, as NaN doesn’t compare equal to anything, even to itself… What you need any POSIX functions for?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: