Stemming

10/07/2012

A few weeks ago, I went to Québec Ouvert Hackathon 3.3, and I was most interested by Michael Mulley’s Open Parliament. One possible addition to the project is to use cross-referencing of entries based not only on the parliament-supplied subject tags but also on the content of the text itself.

One possibility is to learn embeddings on bags of words but on stemmed words to reduce the dimensionality of the one-hot vector, essentially a bitmap where the bit corresponding to a word is set to 1 if it appears in the bag of words. So, let us start at the beginning, stemming.

Read the rest of this entry »


Ambiguous Domain Names

08/05/2012

Two weeks ago I attended the Hackreduce Hackathon at Notman House to learn about Hadoop. I joined a few people I knew (and some I just met) to work on a project where the goal was to extract images from the Wikipedia and see if we could correlate the popularity, as the number of references to the image, to some of the intrinsic images characteristics.

But two other guys I know (David and Ian) worked on a rather amusing problem: finding domain names that can be parsed in multiple, hilarious ways. I decided to redo their experiment, just for fun.

Read the rest of this entry »


Faster Collatz

01/05/2012

Quite a while ago, I presented the Collatz conjecture and I was then interested in the graphical representation of the problem—and not really going anywhere with it.

In this entry, let us have a look at the implementation of the Collatz function.

Read the rest of this entry »


Trigonometric Tables Reconsidered

28/02/2012

A couple of months ago (already!) 0xjfdube produced an excellent piece on table-based trigonometric function computations. The basic idea was that you can look-up the value of a trigonometric function rather than actually computing it, on the premise that computing such functions directly is inherently slow. Turns out that’s actually the case, even on fancy CPUs. He discusses the precision of the estimate based on the size of the table and shows that you needn’t a terrible number of entries to get decent precision. He muses over the possibility of using interpolation to augment precision, speculating that it might be slow anyway.

I started thinking about how to interpolate efficiently between table entries but then I realized that it’s not fundamentally more complicated to compute a polynomial approximation of the functions than to search the table then use polynomial interpolation between entries.

Read the rest of this entry »


(Random Musings) On Floats and Encodings

31/01/2012

The float and double floating-point data types have been present for a long time in the C (and C++) standard. While neither the C nor C++ standards do not enforce it, virtually all implementations comply to the IEEE 754—or try very hard to. In fact, I do not know as of today of an implementation that uses something very different. But the IEEE 754-type floats are aging. GPU started to add extensions such as short floats for evident reasons. Should we start considering adding new types on both ends of the spectrum?

The next step up, the quadruple precision float, is already part of the standard, but, as far as I know, not implemented anywhere. Intel x86 does have something in between for its internal float format on 80 bits, the so-called extended precision, but it’s not really standard as it is not sanctioned by the IEEE standards, and, generally speaking, and surprisingly enough, not really supported well by the instruction set. It’s sometimes supported by the long double C type. But, anyway, what’s in a floating point number?

Read the rest of this entry »


Medians (Part III)

24/01/2012

So in the two previous parts of this series, we have looked at the selection algorithm and at sorting networks for determining efficiently the (sample) median of a series of values.

In this last installment of the series, I consider an efficient (but approximate) algorithm based on heaps to compute the median.

Read the rest of this entry »


Medians (Part II)

10/01/2012

In the previous post of this series, we left off where we were asking ourselves if there was a better way than the selection algorithm of finding the median.

Computing the median of three numbers is a simple as sorting the three numbers (an operation that can be done in constant time, after all, if comparing and swapping are constant time) and picking the middle. However, if the objects compared are “heavy”, comparing and (especially) moving them around may be expensive.

Read the rest of this entry »


Medians (Part I)

27/12/2011

In a previous installment, about filtering noise, we discussed how to use a moving average to even out irregularities in a sequence. Averaging over a window is tricky. First, the window size must make sense in terms of the speed at which the signal changes (in samples), and the average is (overly) sensitive to outliers.

One way to limit the influence of the outliers for denoising is to use the median. However, computing the median is usually more tricky than computing the average, or mean, and this first post (in a series of three, in the next few weeks), discusses how to compute the median efficiently using the selection algorithm.

Read the rest of this entry »


Hash Functions (checksums, part II?)

22/11/2011

On a number of different occasions, I briefly discussed Hash Functions, saying that if a hash function needn’t be very strong if the intend is to use it for look-up, contrary to cryptographic applications. But, unsurprisingly, it’s rather difficult to get a good fast hash function.

Coming up with a very good hash function isn’t easy, but we can at least make an effort to build one by understanding what makes a better hash function, especially by explicitly testing them. Let us have a try at building a good hash function (for look-up) from scratch.

Read the rest of this entry »


Mild Obfuscation

08/11/2011

Sometimes, you have a small bit of data, may something like a GUID (for which there are many possible solutions), that you may have to store in a plain-text file, nothing crucial, not sensitive, but that you don’t really want your users to poke with, even if they really mean to. In such cases, you could use encryption, but it may be that mild obfuscation is quite sufficient and dissuasive.

So, if you don’t really want strong encryption, what can you do to provide a machine-efficient encryptionnette?

Read the rest of this entry »