The Zune Freezes: More on Unit Testing

Do you remember the epic fail that bricked the Zune a whole day on the last day of last year, a bisextile year? I described here and here how this error could have been entirely avoided using basic unit testing.

brick-small

You probably remember (if you read the original post) that I first claimed that it’d take a few seconds to check all possible dates but in fact it ended up taking something like 90 minutes. This week, I come back on unit testing of a very large domain under a time constraint.

So in this previous post, I have been testing the values one by one until I—well, the program—exhausted all input values. This ensures that all values passed into a given piece of code returns the correct result. But while this exhaustive testing increases the confidence one can have in his own code, it has at least two major shortcomings, at least in the specific form I presented.

First, it is exhaustive. At first, exhaustive sounds good because there’s no need to identify particular edge cases because you’re going to test them eventually anyway. But that also means that you’ll test plenty of essentially uninteresting cases as well. Nothing special happened on the Zune until you hit the edge case, that is, the last day of a leap year. So, testing mindlessly all possible cases, while a simple test in this case, is very slow—not that you’re the one pushing bits around either; but 90 minutes is quite a lot of computing power to compute a single test on a single function.

Second, it is not readily capable of detecting infinite loops and exit gracefully, that is, by detecting that it is stuck and report an error. One way of constraining run-time (if you’re using a *nix box of some sort and Bash) is to use the ulimit shell built-in to constrain run-time. For example, typing:

$ (ulimit -t 2  -St 1; date-time-unit-test )

You will get:

CPU time limit exceeded

because when the soft time limit (set by -St 1) is reached, the OS starts issuing SIGXCPU signals to the process that traps it and quits with the CPU time limit exceeded message. If the process is really locked-up (or explicitly eats the SIGXCPU messages) it will terminate with a killed message when the hard time limit is reached.

Using ulimit devolves the responsibility of setting time limits to the shell script that launches the unit tests. This also allows the script to report whether the unit test succeeded or failed, and in case of failure, report the correct error message (SIGSEGV, SIGXCPU, etc.) so that it is easier to isolate the bug the next morning.

Another possible technique is to use watchdog timers if they are available in your test environment. Watchdog timers are timers that count down a certain amount of time and when they reach zero, a specific event is triggered. It may be that the program is crashed with a verbose error message or that a specific signal is sent to the program. To prevent the program from being killed by a watchdog timer reaching zero, the program resets the timer each time it can.

In out Zune example, that would be that the unit test program resets the watchdog timer each time a pair of encode/decode date function are called. If either function gets stuck in an infinite loop, the timer will reach zero and the program is terminated—maybe gracefully, giving it time to report a specific error code.

The POSIX signal SIGVTALRM is meant just to do that. On Linux, you can use signal (defined in <signal.h>) to setup a call back on SIGVTALRM. You would do something like:

sighandler_t my_handler = signal(SIGVTALRM,my_watchdog_handler);

to set the signal handler and timer_create to create the timer, and finally timer_settime to set its characteristics. The body of the unit test launcher would look something like:

#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
#include <time.h>

//////////////////////////////
//
// Handles a signal (in our case,
// SIGVTALRM)
//
void my_sighandler(int this_signal)
 {
  printf("oh noes! got a SIGVTARLM\n");

  // Bash-esque return value
  exit(128+this_signal); 
 }



//////////////////////////////
//
// Main program that setups
// timers, signal handler and
// would launch unit tests as
// well.
//
int main()
 {
  // set event-handler
  //
  if (signal( SIGVTALRM, 
              my_sighandler)!=SIG_ERR)
   {
    timer_t timer_id;
    struct sigevent signal_event;

    // creates a timer for
    // say, 2 seconds.
    //
    signal_event.sigev_notify = SIGEV_SIGNAL;
    signal_event.sigev_signo = SIGVTALRM;
    signal_event.sigev_value.sival_ptr = &timer_id;
    
    if( timer_create( CLOCK_REALTIME,
                      &signal_event, 
                      &timer_id
                      )==0)
     {
      struct itimerspec timer_specs, timer_out_specs;
      // we created a timer!
      // let us set granularity
      // (and make it repeat only 
      // once)
      
      // initial delay
      //
      timer_specs.it_value.tv_sec=2;
      timer_specs.it_value.tv_nsec=0;

      // repeat delay
      // (zero means no repeat)
      //
      timer_specs.it_interval.tv_sec=0;
      timer_specs.it_interval.tv_nsec=0;

      if ( timer_settime( timer_id,0,
                          &timer_specs,
                          &timer_out_specs)==0)
       {

        // ok, timer is set, 
        // let's eat the CPU !
        // (let's pretend the 
        // unit test is launched)
        // from here
        //
        while(1);


        // would exit normally 
        // if reached at the end
        // of the unit test
        return 0; 
       }
      else fprintf(stderr,"error setting timer properties!\n");
     }
    else fprintf(stderr,"could not create timer!\n");
   }
  else fprintf(stderr,"could not set handler!\n");

  return 1;
 }

What you put in the signal handler is up to you, but I would suggest printing something informative and returning a standard-looking error code (such as those here).

When you compile the above code snippet (and use -lrt to link the library that contains the timer calls) to timer-test, you get the following output:

$ timer-test ; echo $?
oh noes! got a SIGVTARLM
154

and 154 is indeed 128+SIGVTALRM. Success!

*
* *

Exhaustive testing can be very long. With the Zune example, we’re lucky because the domain isn’t incredibly vast and computers are quite fast enough nowadays to run 2^{32} calls to a simple function (or a couple of) in a matter of mere hours.

Since you’re not especially interested in finding all edge cases by hand, you could enumerate the test value in some order that pushes somewhat the bad cases in the front of the list. For example, one could first enumerate timestamps that correspond to elapsed whole days, then hours (even with redundancies) then minutes, and at last seconds. The Zune leap day problem would have been caught very rapidly during the first phases of the unit test and everybody would’ve been much happier.

Train_wreck_at_Montparnasse_1895

4 Responses to The Zune Freezes: More on Unit Testing

  1. Exterrior says:

    Сайт отличный. Награду бы Вам за него или просто почетный орден. ;)

  2. […] one can either invoke ulimit at the current shell level or use it within a sub-shell as we saw previously. For example, to set soft and hard time limits for a […]

  3. […] how avoidable was the Zune bug, why and how the bug should have been found, and how sometimes even simple testing strategies can help you get better, stronger code. I wonder if Microsoft did a study of the impact of this […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: