Building a large text corpus (Part I)

December 12, 2017

Getting good text data for language-model training isn’t as easy as it sounds. First, you have to find a large corpus. Second, you must clean it up!

Read the rest of this entry »

Finding dependencies for Make

April 25, 2017

How hard is it to get dependencies for your project to use in a Makefile?

Well, it depends.

Read the rest of this entry »

Choosing Random Files

February 14, 2017

This week, something short. To run tests, I needed a selection of WAV files. Fortunately for me, I’ve got literally thousands of FLAC files lying around on my computer—yes, I listen to music when I code. So I wrote a simple script that randomly chooses a number of file from a directory tree (and not a single directory) and transcode them from FLAC to WAV. Also very fortunately for me, Bash and the various GNU/Linux utilities make writing a script for this rather easy.


Read the rest of this entry »

LaTeXify C/C++ code snippets

January 26, 2016

So I’m still writing lecture notes. This time, I need to include kind of larger pieces of C or C++ code, and \LaTeX environments do not really help me a lot. Some are better than others, but you still have to escape and fancify text yourself. This is laborious and error-prone, and is an obvious target for automation. A script of some sort. The task isn’t overly complicated: highlight keywords, and escape symbols like { } _ and & that make \LaTeX unhappy. This looks like a job for

Read the rest of this entry »

Checking a LaTeX Index

March 31, 2015

This week again, LaTeX. This time, the index. At the end of a document, you will usually find an index so that, if you don’t have a magical ctrl-f, you can find something fast in the book.


In LaTeX, creating an index is really easy. You include the package makeindex, and plant \index{topic!subtopic} tags in the text (preferably just besides the word you want to index, the \index command doesn’t understand paragraphs). You add \printindex somewhere else in your document and you run pdflatex (or just latex) to get the index generated. That’s all fine except that it doesn’t provide checks. It adds to the index whatever you typed, and doesn’t give warnings if you have an entry “compression” and an entry “compresion” (because, you know, typos happen). Let’s see how we can somewhat fix that.

Read the rest of this entry »

Comparing GPSes (yet more GPS data)

February 3, 2015

In a few other entries, I’ve toyed with GPS, either getting or parsing the data with Bash, assessing or using the GPS data. However, when we use GPS, we suppose that the precision varies by brand and model&mdashsome will have greater precision—but our intuition tells us that two GPS devices of the same brand and model should perform identically. That’s what we’re used to with, or at least expect from, computers. But what about USB GPS devices?


So I got two instances of the same model+brand GPS. Let them call them GPS-1 and GPS-2. Do they perform similarly?

Read the rest of this entry »

Is Python Slow?

November 10, 2009

Python is a programming language that I learnt somewhat recently (something like 2, 3 years ago) and that I like very much. It is simple, to the point, and has several functional-like constructs that I am already familiar with. But Python is slow compared to other programming languages. But it was unclear to me just how slow Python was compared to other languages. It just felt slow.


So I have decided to investigate by comparing the implementation of a simple, compute-bound problem, the eight queens puzzle generalized to any board dimensions. This puzzle is most easily solved using, as Dijkstra did, a depth-first backtracking program, using bitmaps to test rapidly whether or not a square is free of attack1. I implemented the same program in C++, Python, and Bash, and got help from friends for the C# and Java versions2. I then compared the resulting speeds.

Read the rest of this entry »