Building a large text corpus (Part I)

December 12, 2017

Getting good text data for language-model training isn’t as easy as it sounds. First, you have to find a large corpus. Second, you must clean it up!

Read the rest of this entry »


Finding dependencies for Make

April 25, 2017

How hard is it to get dependencies for your project to use in a Makefile?

Well, it depends.

Read the rest of this entry »


r4nd0m pa$$w0rd

March 14, 2017

Let’s take it easy this week. What about we generate random passwords? That should be fun, right?

dice

Read the rest of this entry »


Cleaning Scans

February 21, 2017

Scanning documents or books without expensive hardware and commercial software can be tricky. This week, I give you the script I use to clean up a scanned image (and eventually assemble many of them into a single PDF document).

scanner

Read the rest of this entry »


Choosing Random Files

February 14, 2017

This week, something short. To run tests, I needed a selection of WAV files. Fortunately for me, I’ve got literally thousands of FLAC files lying around on my computer—yes, I listen to music when I code. So I wrote a simple script that randomly chooses a number of file from a directory tree (and not a single directory) and transcode them from FLAC to WAV. Also very fortunately for me, Bash and the various GNU/Linux utilities make writing a script for this rather easy.

dice

Read the rest of this entry »


Search all your Bibtex files

January 12, 2016

When I write papers or other things, I tend to create separate bib files, so that I don’t end with a giant unsearchable and unmaintainable blob. Moreover, topics tend to be transient, and the bibliography may or mayn’t be interesting in a few year’s time, so, if unused, it can safely sleep in a directory with the paper it’s attached to.

book_stack

But once in a while, I need one of those old references, and since they’re scatted just about everywhere… it may take a while to find them back. Unless you have a script. Scripts are nice.

Read the rest of this entry »


Optimizing JPEG for bandwidth

September 1, 2015

Optimizing web content is always complicated. On one hand, you want your users to have the best possible user experience, but on the other hand, you don’t really want to spend much bandwidth delivering the bits.

compteur-small

This week, let’s have a look at how we can optimize images for perceptual quality while minimizing bandwidth. While we could proceed by guesswork—fiddling the parameters until it kind of looks OK—or we can take 5 minutes to write a script that searches the parameter space for the best solution given a constraint, say, perceptual quality.

Read the rest of this entry »