Checksums (part I)

I once worked in a company specializing in embedded electronics for industrial applications. In one particular project, the device communicated through a RS-422 cable to the computer and seemed to return weird data once in a while, causing unwanted behavior in the control computer whose programming did not provide for this unexpected data. So I took upon myself to test the communication channel as it seemed that the on-board software was operating properly and did not contain serious bugs. I added a check-sum to the data packet and it turned out that some packets came in indeed corrupted despite the supposedly superior electrical characteristics of the RS-422 link.

After a few days’ work, I implemented the communication protocol that could detect and repair certain errors while reverting to a request to retransmit if the data was too damaged. I then started gathering statistics on error rate, number of retransmit, etc, and the spurious behavior on the controller’s side went away. My (metaphorically) pointy-haired boss opposed the modification because “we didn’t have any of these damn transmission errors until you put your fancy code in there“. Of course, this was an epic facepalm moment. I tried to explain that the errors have always been there, except that now they’re caught and repaired. Needless to say, it ended badly.

Notwithstanding this absurd episode, I kept using check-sum to validate data whenever no other layer of the protocol took care of transmission errors. So, this week, let us discuss check-sums and other error detection algorithms.

The simplest type of check-sum, the check sum simply adds all numbers (or bytes, or bits) and reports that sum as the check-sum. This, unfortunately, doesn’t give you much power for error detection. Indeed, it’s very easy to understand that simply computing the sum of a series of number doesn’t tell you much. There are many ways of getting a given sum in the presence of multiple errors. Even a simple inversion goes undetected.

A bit more protection is given by Luhn‘s algorithm. Luhn’s algorithm on a series of digits (let us consider only digits as it generalises to any radix anyway) starts with an initial sum of $s_0$, which may be 0 or salted. Every odd-position digit is added once to $s_0$, each even-position digit is added twice to $s_0$. For example, the digit series 6,4,2,5,1 would be summed up as $s=s_0+6+(2\times{}4)+2+(2\times{}5)+1$. Then, $s \mod{}\:10$ is returned as the check-sum. We could reformulate the operation as:

$\displaystyle s=s_0+\sum_{i=1}^n 2^{((i+1)\mod{}\:2)} x_i$

An example of this simple type of check-sum is the Canadian Social Insurance Number, whose numbers are chosen so that the check-sum is $\equiv 0 (\!\!\mod 10)$. Needless to say, the validity of the number as a social security number cannot be limited to the verification of the check-sum alone; it has to be checked against the national registry—it’s ridiculously easy to generate a SIN that passes the check-sum test.

The idea can be extended to using other multipliers than 1 and 2 in the sum. The ISBN, or International Standard Book Number, uses either 10 or 13 digits codes to uniquely identify a book. With the correct ISBN alone, you can order a book from your favorite bookshop without ambiguity. The ISBN-10 algorithm computes the sum:

$\displaystyle x_{10}=\left(\sum_{i=1}^9 i\:x_i\right)\mod\:11$

Which yield the last digit of the code (with X encoding 10). The ISBN-13, however, is a Luhn check-sum using 1 and 3 rather than 1 and 2.

Luhn’s method generalizes to any number of digits (and to other bases as well), but is not very strong. Much stronger than Luhn’s method, the CRC, or cyclical redundancy check, computes the “check-sum” as the remainder of the division of the message, considered as a big integer, by a divisor polynomial. There are number-theoretic reasons to call the divisor a “divisor polynomial” but let us not go there right now (if you’re really interested in the mathematics, see here). For all practical purpose, the divisor is a small integer. The divisor is chosen in order to give a remainder on 16, 32 or 64 bits. While the long division of a very big number (the message) by a small integer (the divisor) seems cumbersome, some efficient algorithms exist that will compute the CRC very efficiently, and without doing any arbitrary precision arithmetic.

However, if the CRC is much better than the simple check-sum and Luhn’s algorithm, it is not infallible. As it encodes the message’s structure on a very small number of bits compared to the message itself, it cannot possibly trap every errors. But it turns out that for a good divisor polynomial (there are a few standard ones), it will detect quite many errors. It detects all errors of exactly one bit. It will detect all errors of two bits if the errors are separated by more bits than the order of the polynomial; for a polynomial of order $n$, they must be $n$ bits apart. It will also detect all error bursts of $n$ bits or less. However, it doesn’t do well with long series of leading zero bits, as deletion and insertion of leading zeroes will go undetected.

The CRC is the preferred method of computing check-sums in a number of applications (from Zip files to Ethernet packets) but there’s a number of ad hoc check-sum algorithms out there, not all equally fantastic. For example, the Adler32 function is meant to be efficiently computed; but yields incomplete coverage for messages that are too short. I also presented an ad hoc hash function in a previous post that is reasonably good while amenable to efficient implementations1.

*
* *

Which brings me to the topic of check-sums and hashes. In UEID, I presented a few hash function families. I discussed secure hash functions like MD5 and SHA in the context of providing a unique descriptor to a piece of data. That’s also the goal of a check-sum. A hash function wants to yield a different value whenever the message changes, ever so lightly, which is also a the goal of a check-sum function.

It should be clear by now that a (good) check-sum and a (good) hash function are the same both the exact same thing: they map a series of bits (the message) onto a smaller series of bits (the check-sum/hash) in a way that captures as many changes as possible in the messages.

*
* *

It turns out that inventing a good check-sum/hash algorithm is extremely difficult—even Don Knuth got his ass bitten on this one2. Unless you want to invest a considerable amount of time developing and thoroughly testing your own check-sum/hash function, I would advice to use one that already exist, that have been examined by experts, and that has Open Source implementations available.

1 I tested its behavior in the context of hash table look-ups, and it exhibited a most cromulent behavior. See here (in French).

2 Donald Knuth &mdash The Art of Computer Programming, Volume 3: Sorting and Searching — Addison-Wesley, 1973.

One Response to Checksums (part I)

1. […] a number of different occasions, I briefly discussed Hash Functions, saying that if a hash function needn’t be very strong if […]