Getting good text data for language-model training isn’t as easy as it sounds. First, you have to find a large corpus. Second, you must clean it up!
Let’s take it easy this week. What about we generate random passwords? That should be fun, right?
Scanning documents or books without expensive hardware and commercial software can be tricky. This week, I give you the script I use to clean up a scanned image (and eventually assemble many of them into a single PDF document).
This week, something short. To run tests, I needed a selection of WAV files. Fortunately for me, I’ve got literally thousands of FLAC files lying around on my computer—yes, I listen to music when I code. So I wrote a simple script that randomly chooses a number of file from a directory tree (and not a single directory) and transcode them from FLAC to WAV. Also very fortunately for me, Bash and the various GNU/Linux utilities make writing a script for this rather easy.
When I write papers or other things, I tend to create separate bib files, so that I don’t end with a giant unsearchable and unmaintainable blob. Moreover, topics tend to be transient, and the bibliography may or mayn’t be interesting in a few year’s time, so, if unused, it can safely sleep in a directory with the paper it’s attached to.
But once in a while, I need one of those old references, and since they’re scatted just about everywhere… it may take a while to find them back. Unless you have a script. Scripts are nice.
Optimizing web content is always complicated. On one hand, you want your users to have the best possible user experience, but on the other hand, you don’t really want to spend much bandwidth delivering the bits.
This week, let’s have a look at how we can optimize images for perceptual quality while minimizing bandwidth. While we could proceed by guesswork—fiddling the parameters until it kind of looks OK—or we can take 5 minutes to write a script that searches the parameter space for the best solution given a constraint, say, perceptual quality.