The Zune Freezes: A Stupid, Avoidable Bug.

The few Zune users where alienated today when they discovered that their Zunes just froze during boot. Apparently, a mysterious bug.

The cause of the bug was rapidly diagnosticated. According to Itsnotabigtruck on Zuneboards, the bug can be traced back to a defective date conversion routine. I reproduce here the code for you:

year = ORIGINYEAR; /* = 1980 */

while (days > 365)
 {
  if (IsLeapYear(year))
   {
    if (days > 366)
        {
         days -= 366;
         year += 1;
       }
   }
  else
   {
    days -= 365;
    year += 1;
   }
 }

Can you see what’s just wrong about that code?

Well, for starters, there’s the obvious bug that on the last day of a leap year, when exactly 366 days remain, it loops for ever.

But that’s somewhat a normal bug. I mean, an ordinary, silly bug that should have been captured before sending firmware updates to the poor Zune owners—they already have a Zune, isn’t that punishment enough?

And it’s not like it’s very hard to verify a date convertion routine. You simply need to test all possible dates. I mean all of them. Of course, you’re thinking, but, how can we, that’s like, billions of possible times and dates! Yes, yes, it is. So what?

So it takes about an hour and a half on my (relatively) old AMD4000+ box to check all possible date conversions from the epoch to 2^{31}-1 seconds. 90 minutes of compute to check the correctness of the code, trap any possible bug, and not ship a defective date conversion routine that freezes the player dead.

To run the test, I cheated somewhat. First, I did not write my own time convertion routines, I used <time.h>‘s localtime_r and mktime. Clearly, if I’m rewriting localtime_r, I can use the standard library mktime to make sure I decode correctly the time (because I start from time_t, expand it to struct tm, then recomputes the time_t using mktime). The verification program takes 5 minutes to write:

#include <stdio.h>
#include <time.h>

const int max_time = (1u<<31)-1; // Let's say.

int main()
 {
  // my box is 64 bits, so I don't
  // want to test 2^64-1! (int is 32
  // bits, but time_t is 64 bits!)
  //
  printf("max_time=%d\n",max_time);
  for (int time=0; time<max_time; time++)
   {
    if ((time & 0xfffff)==0)
     {
      printf("."); fflush(stdout);
     }
    time_t current_time = time;
    struct tm decoded_time;

    // decodes time. Your code
    // goes here.
    //
    localtime_r(&current_time,&decoded_time);

    // reverses the decoding
    //
    time_t recoded_time;
    recoded_time=mktime(&decoded_time);

    if (current_time!=recoded_time)
     {
      fprintf(stderr,"FAIL! at %d!=%d",current_time,recoded_time);
      return 1;
     }
   }
&#91;/sourcecode&#93;<br><br>

Of course, I compile it using <tt>-O3</tt> to make it a bit faster, and I launch it:<br><br>


/home/steven/download> time a.out
max_time=2147483647
............(quite a few more)...................
real	95m53.483s
user	32m19.705s
sys	48m53.071s

And voilà, I exhaustively checked all dates combinations. How hard was that… compared to the problems of unfreezing all the updated Zunes? (and being drown in shame).

(from XKCD)

(from XKCD)

(And for fun, can you spot the time_t I do not check with this program?)

(Also note that to trap infinite loops, we should modify the code to include a watchdog timer, or just a counter. We know by looking at the Zune code that the number of iterations is proportionnal to the number of years, months, days, hours… in the current date so it cannot loop more than a few thousand iterations. That’s also easily trappable with debug code that throws an exception (or terminates abnormaly with return 1;).)

10 Responses to The Zune Freezes: A Stupid, Avoidable Bug.

  1. gnuvince says:

    So it’s not exactly 20 seconds, is it?

    • Steven Pigeon says:

      Maybe not 20 seconds, but still better 90 minutes than a lifetime of ridicule. When your test domain is “small”, you have no excuse whatsoever to not test it exhaustively. None. Srly.

      (to the other readers: I chatted with gnuvince yesternight about how long it would take to test all possible date expansions. I first estimated the time to 20s, maybe 1 minutes. Turns out that the time_t-based conversion routines from C’s stdlib aren’t all that incredibly fast. Not that it matters because converting 2^{32} dates is not something you do, usually.)

  2. coffee fiend says:

    the Zune meltdown would be an especially tough break for people who bought defective X-Box’s too

    • Steven Pigeon says:

      I don’t know about the Xboxes, I never owned one, so I can’t tell. I think that what we should remember about a bug like that is not so much that it should have been trapped before release, but the long term consequences. How many Zune users are thinking “I’m never going to get a Zune ever again” because their devices “bricked”? The fact that it bricked for a day only is mostly irrelevant because we’re talking about confidence, and their trust in the product was destroyed, or at least greatly dammaged, by this unexpected bricking. Especially so that it came with an update, which is meant to make things work better.

  3. Steven Pigeon says:

    (also, the original discussion concerned days, therefore excluding hours, minutes, seconds. A test incrementing the time by 86400 seconds each turn runs in 50ms.)

  4. cmatthews says:

    Additionally, the errant if statement should absolutely have had an “else” (even if empty) for two reasons:

    1- Make the original coder think about the boundary condition they created with days > 366 and to document why they took no action if days <= 366

    2- Document for later coders the assumed else condition to make maintenance easier and to show that the original coder did mean to take no action on the “else” condition.

    • Steven Pigeon says:

      You know, I never thought of it explicitely, but I do write stuff like:

      if (cond)
          {
            ...do stuff...
          }
      else ; // nothing more to do for this one
      
  5. […] fail that bricked the Zune a whole day on the last day of last year, a bisextile year? I described here and here how this error could have been entirely avoided using basic unit […]

  6. […] can be dire for the project and the company. Remember, more than year ago I told you about how avoidable was the Zune bug, why and how the bug should have been found, and how sometimes even simple testing […]

  7. […] I have said in a previous post, if your function’s domain is small enough, you should use exhaustive testing rather than […]

Leave a comment