## Average node depth in a Full Tree

May 14, 2013

While doing something else I stumbled upon the interesting problem of computing the average depth of nodes in a tree. The depth of a node is the distance that separates that node from the root. You can either decide that the root is at depth 1, or you can decide that it is at depth zero, but let’s decide on depth 1. So an immediate child of the root is at depth two, and its children at depth 3, and so on until you reach leaves, nodes with no children.

So the calculation of the average node depth (including leaves) in a tree comes interesting when we want to know how far a constructed tree is from the ideal full tree, as a measure of (application-specific) performance. After searching a bit on the web, I found only incomplete or incorrect formulas, or stated with proof. This week, let us see how we can derive the result without (too much) pain.

## Breaking Caesar’s Cipher (Caesar’s Cipher, part II)

April 16, 2013

In the last installment of this series, we had a look at Caesar’s cipher, an absurdly simple encryption technique where the symmetric encryption only consists in shifting symbols $k$ places.

While it’s ridiculously easy to break the cipher, even with pen-and-paper techniques, we ended up, last time, surmising that we should be able to crack the cipher automatically, without human intervention, if only we had a reasonable language model. This week, let us have a look at how we could build a very simple language model that does just that.

## Building a Tree from a List in Linear Time (II)

April 9, 2013

Quite a while ago, I proposed a linear time algorithm to construct trees from sorted lists. The algorithm relied on the segregation of data and internal nodes. This meant that for a list of $n$ data items, $2n-1$ nodes were allocated (but only $n$ contained data; the $n-1$ others just contained pointers.

While segregating structure and data makes sense in some cases (say, the index resides in memory but the leaves/data reside on disk), I found the solution somewhat unsatisfactory (but not unacceptable). So I gave the problem a little more thinking and I arrived at an algorithm that produces a tree with optimal average depth, with data in every node, in linear time and using at most $\Theta(\lg n)$ extra memory.

## Python Memory Management (Part II)

January 8, 2013

Last week we had a look at how much memory basic Python objects use. This week, we will discuss how Python manages its memory internally, and why it goes wrong if you’re not careful.

To speed-up memory allocation (and reuse) Python uses a number of lists for small objects. Each list will contain objects of similar size: there will be a list for objects 1 to 8 bytes in size, one for 9 to 16, etc. When a small object needs to be created, either we reuse a free block in the list, or we allocate a new one.

## Python Memory Management (Part I)

January 1, 2013

[This is a piece I initially wrote while at the LISA at U de M, for the newbie coders in the lab.]

One of the major challenges in writing (somewhat) large-scale Python programs, is to keep memory usage at a minimum. However, managing memory in Python is easy—if you just don’t care. Python allocates memory transparently, manages objects using a reference count system, and frees memory when an object’s reference count falls to zero. In theory, it’s swell. In practice, you need to know a few things about Python memory management to get a memory-efficient program running. One of the things you should know, or at least get a good feel about, is the sizes of basic Python objects. Another thing is how Python manages its memory internally.

So let us begin with the size of basic objects. In Python, there’s not a lot of primitive data types: there are ints, longs (an unlimited precision version of int), floats (which are doubles), tuples, strings, lists, dictionaries, and classes.

## Stemming

July 10, 2012

A few weeks ago, I went to Québec Ouvert Hackathon 3.3, and I was most interested by Michael Mulley’s Open Parliament. One possible addition to the project is to use cross-referencing of entries based not only on the parliament-supplied subject tags but also on the content of the text itself.

One possibility is to learn embeddings on bags of words but on stemmed words to reduce the dimensionality of the one-hot vector, essentially a bitmap where the bit corresponding to a word is set to 1 if it appears in the bag of words. So, let us start at the beginning, stemming.

## (Random Musings) On Floats and Encodings

January 31, 2012

The float and double floating-point data types have been present for a long time in the C (and C++) standard. While neither the C nor C++ standards do not enforce it, virtually all implementations comply to the IEEE 754—or try very hard to. In fact, I do not know as of today of an implementation that uses something very different. But the IEEE 754-type floats are aging. GPU started to add extensions such as short floats for evident reasons. Should we start considering adding new types on both ends of the spectrum?

The next step up, the quadruple precision float, is already part of the standard, but, as far as I know, not implemented anywhere. Intel x86 does have something in between for its internal float format on 80 bits, the so-called extended precision, but it’s not really standard as it is not sanctioned by the IEEE standards, and, generally speaking, and surprisingly enough, not really supported well by the instruction set. It’s sometimes supported by the long double C type. But, anyway, what’s in a floating point number?

## Medians (Part III)

January 24, 2012

So in the two previous parts of this series, we have looked at the selection algorithm and at sorting networks for determining efficiently the (sample) median of a series of values.

In this last installment of the series, I consider an efficient (but approximate) algorithm based on heaps to compute the median.

## Medians (Part II)

January 10, 2012

In the previous post of this series, we left off where we were asking ourselves if there was a better way than the selection algorithm of finding the median.

Computing the median of three numbers is a simple as sorting the three numbers (an operation that can be done in constant time, after all, if comparing and swapping are constant time) and picking the middle. However, if the objects compared are “heavy”, comparing and (especially) moving them around may be expensive.

## Building a Balanced Tree From a List in Linear Time

January 3, 2012

The usual way of forming a search tree from a list is to scan the list and insert each of its element, one by one, into the tree, leading to a(n expected) run-time of $O(n \lg n)$.

However, if the list is sorted (in ascending order, say) and the tree is not one of the self-balancing varieties, insertion is $O(n^2)$, because the “tree” created by the successive insertions of sorted key is in fact a degenerate tree, a list. So, what if the list is already sorted and don’t really want to have a self-balancing tree? Well, it turns out that you can build a(n almost perfectly) balanced tree in $O(n)$.