Quite a while ago I discussed using flat arrays and address calculations to store a tree in a simple array. The trick wasn’t new (I think it’s due to Williams, 1964 [1]), but is only really practical when we consider heaps, or otherwise very well balanced trees. If you have a misshapen tree, that trick doesn’t help you. It doesn’t help you either if you try to serialize a misshapen tree to disk.

But what if we do want to serialize arbitrarily-shaped trees to disk? Is it painful? Fortunately, no! Let’s see how.

First, let’s recall what a tree traversal is. There are of course many, many ways to visit all nodes in a tree, but the three usual depth-first ways—those found in textbooks anyways—are:

- Pre-order search, where the root is visited, then the left subtree, then the right subtree;
- In-order search, where the left subtree is visited first, then the root, then the right subtree;
- and Post-order search, where the left and right subtrees are visited first, and the root last.

Either traversals can be used to enumerate all nodes in a tree, they’re all equivalent in this regard. One or the other may be more convenient for the specific application you’re considering. For example, post-order may be more suitable for the evaluation of arithmetic expressions stored as trees. If you want to enumerate the contents of an ordered/search tree, then in-order will give you the values in increasing order. Sometimes you just don’t care: you can serialize a tree in any order you want: the complexity of encoding and decoding is comparable.

But if you want to serialize a tree, you have to make sure that the deserialized tree is reconstructed correctly, that is, that all nodes point to the right nodes. You can’t really save pointers, since when you will reconstruct the tree, nothing guaranties that the same memory locations will be used. One idea might be to assign a number to each node, and store these numbers rather than pointers, and somehow use that information to resew the nodes later. That might work, but 1) we need a first scan to assign numbers, 2) we need to write those to disk, and 3) that doesn’t simplify deserialization one bit.

Furthermore, the structure of the tree itself mostly dispenses us from worrying about pointers. We only need to encode *structure*, and structure in a tree is very simple to describe: a node, in a binary tree, can have zero, one, or two child(ren). We only need to encode the presence (or absence) of children, and the content of the node, and *nothing* else. OK, let’s demonstrate. Consider this tree:

Let’s now indicate on the tree, the null pointers by a simple 0 and non-null pointers by 1s:

If a 1 is associated with a pointer, that means it points to a subtree—with at least one node. If it’s a zero, then there is no subtree associated with the pointer. This gives us the idea that we can only save one bit—0 or 1—to indicate if there’s a subtree to be decoded or not. A pre-order (because, why not) *decoding* would go as follows:

- Decode the data contained in the node (you’ll need some way of encoding the data in the node, something like JSON or a binary representation),
- If the next bit is a 1, decode (recursively) the left subtree, if it’s a zero, there is no left subtree to decode;
- If the next bit is a 1, decode (recursively) the right subtree, likewise, it if’s a zero, there is no right subtree to decode.

(In- and post-order searches merely reorder the above steps.) The decoding will need some kind of iterator, or index, that advances linearly while decoding. Encoding proceeds pretty much the same: encode the data, encode (recursively) the left subtree, encode (recursively) the right subtree.

*

* *

It sounds simple, but is it actually simple? Sometimes simple ideas turns out to be rather complicated to implement right, but not this time! The encode and decode procedure (here prefixed by `p_`, as they are private methods) are given by:

std::string p_encoded(tree_node * node) const { return p_encode_data(node->data) + (node->left? '1'+p_encoded(node->left) : "0") + (node->right? '1'+p_encoded(node->right) : "0"); } tree_node * p_decoded(const std::string & tree_code, size_t & offset) const { return new tree_node( p_decode_data(tree_code,offset), (tree_code[offset]=='1') ? p_decoded(tree_code,++offset) : nullptr, (tree_code[++offset]=='1') ? p_decoded(tree_code,++offset) : nullptr); }

The decoding procedure is a bit more complicated than the encoding because you have to check for subtrees, the “bit” is coded as a whole ASCII character in a string, and that some care must be taken to advance the “cursor” correctly.

On the above tree, the output of `p_encoded` is

`a1b1c0001d1e001f00`,

which, decoded back gives:

`(a(b(c))(d(e)(f)))`,

which is a lame parentheses-based representation of a tree.

*

* *

A more complete Proof of Concept code is found below: click to expand.

#include <iostream> #include <iomanip> #include <string> // needs to_string to be overloaded for your data // template <typename T> class tree { private: class tree_node { public: tree_node *left,*right; T data; tree_node(const T & new_data, tree_node *new_left=nullptr, tree_node *new_right=nullptr) : left(new_left), right(new_right), data(new_data) {} ~tree_node() { delete left; delete right; } }; tree_node *root; // These two should be passed as template // arguments (that's a bit tedious for a // PoC). // T p_decode_data(const std::string & tree_code, size_t & offset) const { return tree_code[offset++]; // for now, char } std::string p_encode_data(const T & data) const { // works fine for 'char' return std::string(sizeof(T),data); } std::string p_encoded(tree_node * node) const { return p_encode_data(node->data) + (node->left? '1'+p_encoded(node->left) : "0") + (node->right? '1'+p_encoded(node->right) : "0"); } tree_node * p_decoded(const std::string & tree_code, size_t & offset) const { return new tree_node( p_decode_data(tree_code,offset), (tree_code[offset]=='1') ? p_decoded(tree_code,++offset) : nullptr, (tree_code[++offset]=='1') ? p_decoded(tree_code,++offset) : nullptr); } void p_show(tree_node * node) const { if (node) { std::cout << "(" << node->data; // must have suitable overload! p_show(node->left); p_show(node->right); std::cout << ")"; } } public: void show() const { p_show(root); std::cout << std::endl; } std::string encoded() const { return p_encoded(root); } tree(const std::string & tree_code) { size_t offset=0; root=p_decoded(tree_code,offset); } tree() : root(nullptr) {} ~tree() { delete root; } }; int main() { const std::string tree_code="a1b1c0001d1e001f00"; tree<char> t(tree_code); std::cout << tree_code << std::endl; t.show(); std::cout << t.encoded() << std::endl; return 0; }

*

* *

Is this the only way of serializing a tree? No. Standish [2] presents quite a few, but none quite as simple. But it’s not exactly all that simple either: a lot of complexity may be hidden in the encoding and decoding of the data contained in the nodes. Even something as banal as a string need special care… how do we take spaces into account? Do we use quotes? Then we need an escape mechanism to introduce quotes within the string… etc. Maybe a simple length+raw data encoding is sufficient. What if the data also contains pointers?!

[1] J. W. J. Williams —

*Algorithm 232: Heap Sort*— C. ACM, vol. 7, n° 6 (1964) pp. 347–348

[2] Thomas A. Standish — *Data Structure Techniques* — Addison-Wesley (1980)