Serialization, the process by which run-time objects are saved from memory to a persistent storage (typically disk) or sent across the network, necessitate the objects to be encoded in some efficient, preferably machine-independent, encoding.
One could consider XML or JSON, which are both viable options whenever simplicity or human-readability is required, or if every serialized object has a natural text-only representation. JSON, for example, provides only for a limited number of basic data types: number, string, bool, arrays and objects. What if you have a binary object? The standard approach with text encodings is to use Base64, but this results in an 33% data expansion, since every 3 source bytes become 4 encoded bytes. Base64 uses a-z, A-z, 0-9, +, /, and = as encoding symbols, entirely avoiding comas (,), quotes (both single and double), braces, parentheses, newlines, or other symbols likely to interfere with the host encoding, whether XML, JSON, or CSV.
What if you do not want (or cannot afford) the bloatification of your data incurred by XML or JSON, and are not using a language with built-in binary serialization? Evidently, you will roll up your own binary encoding for your data. But to do so, one has to provide not only the serialization mechanisms for the basic data types (including, one would guess, the infamous “binary blob”) but also a higher-level syntax of serialization that provides for structure and—especially—error resilience.
Structure can be provided by a mechanism similar to Electronic Arts’ 1985 Interchange File Format, or IFF. This revolutionary format provides all the necessary mechanisms for structured, hierarchical object placement within a file. IFF was so ahead of its time that it remains, even today, the basis for many multimedia file formats, including Microsoft’s RIFF and AVI file formats, The Third Generation Partnership Project’s 3GP, MPEG-4 Part 14, and Apple’s Quicktime. Structure may also be much simpler, as a simple as a flat encoding of successive objects, which may be perfectly à propos.
Error resilience is a complex topic, as in multimedia, one will want not only to detect the errors, but to conceal them in some intelligent manner. Let us forget about error concealment for now to concentrate on a much simpler problem: corrupted files. One way to help with recovery of a damaged file is to use block error correction coding, but it leads to a significant increase in data volume and does not account for deleted or inserted data, only for flipped bits. If you cannot afford this increase but somewhat can recover from damaged or lost data, you could simply mark serialized object boundaries by synchronization markers.
The goal of synchronization marker is to allow one to just read, without decoding, the damaged data until the next marker is read. Reading the marker provides a synchronization point from which deserialization may be attempted. If deserialization goes well, the object is deserialized and the next data to be read will be the next marker. If deserialization fails at some point, an error should get thrown, managed, but deserialization may continue by skipping to the next marker. Although you’re loosing objects, it might allow you to recover most of your data.
So the interesting question is how do you create unambiguous markers if you expect your data to be corrupted?
Well, it should be quite obvious that the bit pattern you choose for the marker cannot appear in the serialiazed representation of objects, so they have to be recoded to avoid using the same pattern as the marker.
Let us say we decide on a byte-sized marker, and the marker is 0xff. Using escaping we will recode the byte stream using an escape code. The escape code is a reserved symbol other than the marker, and instructs the decoder to interpret the next byte has a different meaning than its literal value. If you use Perl, Bash, or C, you have already seen this mechanism at work. In the expression \t for example, \ is the escape, and t is the auxiliary code that introduces special meaning: this is not a t, this is a tab character that will cause the cursor to jump forward to the next tab stop.
Using the same idea, we can define the following recoding for the byte stream between 0xff markers:
|0xfe 0x00||Escape for 0xfe|
|0xfe 0x01||Escape for 0xff|
|0x00 to 0xfd||Literal value|
This encoding allows us to recode the serialized object byte stream without 0xff in every case. The escape symbol, 0xfe does not precede 0xff but 0x01, in the same way that \t encodes the ASCII code 0x09 and not t. The escape code is also never repeated as the code to introduce itself is 0x00. This provides a second line of defence against corrupted escapes.
You may wonder why I chose 0xfe as the escape any why not, say, 0x00. 0x00 would probably have been a bad choice as the escape character as it is probably relatively frequent in binary encoded data. If the output data is varied and compressed efficiently, in principle, all bytes are equally likely to be emitted so the specific escape character isn’t overly critical. However, if the encoding is somewhat direct, there is a high probably that 0x00 is repeated many times in the encoding, because any numeric type that is not maxed-out will have zeroes in its most significant bits, therefore leading to a higher occurrence rate of 0x00 bytes. In this case, using 0x00 as the escape code would lead to data expansion, something we are trying to avoid.
The principle can be generalized to wider markers but will remain essentially the same. On 16 bits, one would use 0xffff as the marker. Since the marker itself could be corrupted, it may be wise to not only recode byte pairs 0xff 0xff but also lone 0xff whenever they occur in the binary encodings.