## Building a Tree from a List in Linear Time (II)

April 9, 2013

Quite a while ago, I proposed a linear time algorithm to construct trees from sorted lists. The algorithm relied on the segregation of data and internal nodes. This meant that for a list of $n$ data items, $2n-1$ nodes were allocated (but only $n$ contained data; the $n-1$ others just contained pointers.

While segregating structure and data makes sense in some cases (say, the index resides in memory but the leaves/data reside on disk), I found the solution somewhat unsatisfactory (but not unacceptable). So I gave the problem a little more thinking and I arrived at an algorithm that produces a tree with optimal average depth, with data in every node, in linear time and using at most $\Theta(\lg n)$ extra memory.

## Shallow Constitude

March 19, 2013

In programming languages, there are constructs that are of little pragmatic importance (that is, they do not really affect how code behaves or what code is generated by the compiler) but are of great “social” importance as they instruct the programmer as to what contract the code complies to.

One of those constructs in C++ is the const (and other access modifiers) that explicitly states to the programmer that this function argument will be treated as read-only, and that it’s safe to pass your data to it, it won’t be modified. But is it all but security theater?

## Compressed Series (Part II)

March 12, 2013

Last week we looked at an alternative series to compute $e$, and this week we will have a look at the computation of $e^x$. The usual series we learn in calculus textbook is given by

$\displaystyle e^x=\sum_{n=0}^\infty \frac{x^n}{n!}$

We can factorize the expression as

## Float16

February 12, 2013

The possible strategies for data compression fall into two main categories: lossless and lossy compression. Lossless compression means that you retrieve exactly what went in after compression, while lossy means that some information was destroyed to get better compression, meaning that you do not retrieve the original data, but only a reasonable reconstruction (for various definitions of “reasonable”).

Destroying information is usually performed using transforms and quantization. Transforms map the original data onto a space were the unimportant variations are easily identified, and on which quantization can be applied without affecting the original signal too much. For quantization, the first approach is to simply reduce precision, somehow “rounding” the values onto a smaller set of admissible values. For decimal numbers, this operation is rounding (or truncating) to the $n$th digit (with $n$ smaller than the original precision). A much better approach is to minimize an explicit error function, choosing the smaller set of values in a way that minimizes the expected error (or maximum error, depending on how you formulate your problem).

## Fat, Slim Pointers

July 31, 2012

64 bits address space lets us access tons more memory than 32 bits, but with a catch: the pointers themselves are … well, yes, 64 bits. 8 bytes. Which eventually pile up to make a whole lot of memory devoted to pointers if you use pointer-rich data structures. Can we do something about this?

Well, in ye goode olde dayes of 16 bits/32 bits computing, we had some compilers that could deal with near and far pointers; the near, 16-bit pointers being relative to one of the segments, possibly the stack segment, and the far, 32-bits pointers being absolute or relative to a segment. This, of course, made programming pointlessly complicated as each pointer was to be used in its correct context to point to the right thing.

## Stemming

July 10, 2012

A few weeks ago, I went to Québec Ouvert Hackathon 3.3, and I was most interested by Michael Mulley’s Open Parliament. One possible addition to the project is to use cross-referencing of entries based not only on the parliament-supplied subject tags but also on the content of the text itself.

One possibility is to learn embeddings on bags of words but on stemmed words to reduce the dimensionality of the one-hot vector, essentially a bitmap where the bit corresponding to a word is set to 1 if it appears in the bag of words. So, let us start at the beginning, stemming.

## Ambiguous Domain Names

May 8, 2012

Two weeks ago I attended the Hackreduce Hackathon at Notman House to learn about Hadoop. I joined a few people I knew (and some I just met) to work on a project where the goal was to extract images from the Wikipedia and see if we could correlate the popularity, as the number of references to the image, to some of the intrinsic images characteristics.

But two other guys I know (David and Ian) worked on a rather amusing problem: finding domain names that can be parsed in multiple, hilarious ways. I decided to redo their experiment, just for fun.

## Faster Collatz

May 1, 2012

Quite a while ago, I presented the Collatz conjecture and I was then interested in the graphical representation of the problem—and not really going anywhere with it.

In this entry, let us have a look at the implementation of the Collatz function.

## Trigonometric Tables Reconsidered

February 28, 2012

A couple of months ago (already!) 0xjfdube produced an excellent piece on table-based trigonometric function computations. The basic idea was that you can look-up the value of a trigonometric function rather than actually computing it, on the premise that computing such functions directly is inherently slow. Turns out that’s actually the case, even on fancy CPUs. He discusses the precision of the estimate based on the size of the table and shows that you needn’t a terrible number of entries to get decent precision. He muses over the possibility of using interpolation to augment precision, speculating that it might be slow anyway.

I started thinking about how to interpolate efficiently between table entries but then I realized that it’s not fundamentally more complicated to compute a polynomial approximation of the functions than to search the table then use polynomial interpolation between entries.

## (Random Musings) On Floats and Encodings

January 31, 2012

The float and double floating-point data types have been present for a long time in the C (and C++) standard. While neither the C nor C++ standards do not enforce it, virtually all implementations comply to the IEEE 754—or try very hard to. In fact, I do not know as of today of an implementation that uses something very different. But the IEEE 754-type floats are aging. GPU started to add extensions such as short floats for evident reasons. Should we start considering adding new types on both ends of the spectrum?

The next step up, the quadruple precision float, is already part of the standard, but, as far as I know, not implemented anywhere. Intel x86 does have something in between for its internal float format on 80 bits, the so-called extended precision, but it’s not really standard as it is not sanctioned by the IEEE standards, and, generally speaking, and surprisingly enough, not really supported well by the instruction set. It’s sometimes supported by the long double C type. But, anyway, what’s in a floating point number?