Average node depth in a Full Tree

May 14, 2013

While doing something else I stumbled upon the interesting problem of computing the average depth of nodes in a tree. The depth of a node is the distance that separates that node from the root. You can either decide that the root is at depth 1, or you can decide that it is at depth zero, but let’s decide on depth 1. So an immediate child of the root is at depth two, and its children at depth 3, and so on until you reach leaves, nodes with no children.

tree-diagram7

So the calculation of the average node depth (including leaves) in a tree comes interesting when we want to know how far a constructed tree is from the ideal full tree, as a measure of (application-specific) performance. After searching a bit on the web, I found only incomplete or incorrect formulas, or stated with proof. This week, let us see how we can derive the result without (too much) pain.

Read the rest of this entry »


Breaking Caesar’s Cipher (Caesar’s Cipher, part II)

April 16, 2013

In the last installment of this series, we had a look at Caesar’s cipher, an absurdly simple encryption technique where the symmetric encryption only consists in shifting symbols k places.

markov-chains

While it’s ridiculously easy to break the cipher, even with pen-and-paper techniques, we ended up, last time, surmising that we should be able to crack the cipher automatically, without human intervention, if only we had a reasonable language model. This week, let us have a look at how we could build a very simple language model that does just that.

Read the rest of this entry »


Building a Tree from a List in Linear Time (II)

April 9, 2013

Quite a while ago, I proposed a linear time algorithm to construct trees from sorted lists. The algorithm relied on the segregation of data and internal nodes. This meant that for a list of n data items, 2n-1 nodes were allocated (but only n contained data; the n-1 others just contained pointers.

wood

While segregating structure and data makes sense in some cases (say, the index resides in memory but the leaves/data reside on disk), I found the solution somewhat unsatisfactory (but not unacceptable). So I gave the problem a little more thinking and I arrived at an algorithm that produces a tree with optimal average depth, with data in every node, in linear time and using at most \Theta(\lg n) extra memory.

Read the rest of this entry »


Python Memory Management (Part II)

January 8, 2013

Last week we had a look at how much memory basic Python objects use. This week, we will discuss how Python manages its memory internally, and why it goes wrong if you’re not careful.

To speed-up memory allocation (and reuse) Python uses a number of lists for small objects. Each list will contain objects of similar size: there will be a list for objects 1 to 8 bytes in size, one for 9 to 16, etc. When a small object needs to be created, either we reuse a free block in the list, or we allocate a new one.

Read the rest of this entry »


Python Memory Management (Part I)

January 1, 2013

[This is a piece I initially wrote while at the LISA at U de M, for the newbie coders in the lab.]

One of the major challenges in writing (somewhat) large-scale Python programs, is to keep memory usage at a minimum. However, managing memory in Python is easy—if you just don’t care. Python allocates memory transparently, manages objects using a reference count system, and frees memory when an object’s reference count falls to zero. In theory, it’s swell. In practice, you need to know a few things about Python memory management to get a memory-efficient program running. One of the things you should know, or at least get a good feel about, is the sizes of basic Python objects. Another thing is how Python manages its memory internally.

So let us begin with the size of basic objects. In Python, there’s not a lot of primitive data types: there are ints, longs (an unlimited precision version of int), floats (which are doubles), tuples, strings, lists, dictionaries, and classes.

Read the rest of this entry »


Stemming

July 10, 2012

A few weeks ago, I went to Québec Ouvert Hackathon 3.3, and I was most interested by Michael Mulley’s Open Parliament. One possible addition to the project is to use cross-referencing of entries based not only on the parliament-supplied subject tags but also on the content of the text itself.

One possibility is to learn embeddings on bags of words but on stemmed words to reduce the dimensionality of the one-hot vector, essentially a bitmap where the bit corresponding to a word is set to 1 if it appears in the bag of words. So, let us start at the beginning, stemming.

Read the rest of this entry »


(Random Musings) On Floats and Encodings

January 31, 2012

The float and double floating-point data types have been present for a long time in the C (and C++) standard. While neither the C nor C++ standards do not enforce it, virtually all implementations comply to the IEEE 754—or try very hard to. In fact, I do not know as of today of an implementation that uses something very different. But the IEEE 754-type floats are aging. GPU started to add extensions such as short floats for evident reasons. Should we start considering adding new types on both ends of the spectrum?

The next step up, the quadruple precision float, is already part of the standard, but, as far as I know, not implemented anywhere. Intel x86 does have something in between for its internal float format on 80 bits, the so-called extended precision, but it’s not really standard as it is not sanctioned by the IEEE standards, and, generally speaking, and surprisingly enough, not really supported well by the instruction set. It’s sometimes supported by the long double C type. But, anyway, what’s in a floating point number?

Read the rest of this entry »


Medians (Part III)

January 24, 2012

So in the two previous parts of this series, we have looked at the selection algorithm and at sorting networks for determining efficiently the (sample) median of a series of values.

In this last installment of the series, I consider an efficient (but approximate) algorithm based on heaps to compute the median.

Read the rest of this entry »


Medians (Part II)

January 10, 2012

In the previous post of this series, we left off where we were asking ourselves if there was a better way than the selection algorithm of finding the median.

Computing the median of three numbers is a simple as sorting the three numbers (an operation that can be done in constant time, after all, if comparing and swapping are constant time) and picking the middle. However, if the objects compared are “heavy”, comparing and (especially) moving them around may be expensive.

Read the rest of this entry »


Building a Balanced Tree From a List in Linear Time

January 3, 2012

The usual way of forming a search tree from a list is to scan the list and insert each of its element, one by one, into the tree, leading to a(n expected) run-time of O(n \lg n).

However, if the list is sorted (in ascending order, say) and the tree is not one of the self-balancing varieties, insertion is O(n^2), because the “tree” created by the successive insertions of sorted key is in fact a degenerate tree, a list. So, what if the list is already sorted and don’t really want to have a self-balancing tree? Well, it turns out that you can build a(n almost perfectly) balanced tree in O(n).

Read the rest of this entry »


Follow

Get every new post delivered to your Inbox.

Join 41 other followers