Getting good text data for language-model training isn’t as easy as it sounds. First, you have to find a large corpus. Second, you must clean it up!
About a week ago, some dude drops on IRC that he’s beat memcpy “by a lot”. That’d be interesting, except that we couldn’t get neither code nor test methodology out of him. But, how hard can making a better memcpy be? Turns out, harder than you think!
If you think this is a typical case of “reinventing the wheel”, I mostly agree with you. But while reinventing will be hard, can improvements be made?
I’m currently working with one of my students on a laser-based range finder. To assess the precision of the device, I needed a calibration piece. Because of the setup, the piece should look like a stair.
The piece should allow a wide range of different readings, say from 1 to 10 centimeters in known increments, say, 1cm. The naïve way of building such a piece is to build a stair with 10 steps. However, if you do it like this, the piece is wide, cumbersomely so. Is there a much better way to do so?
Von Neumann proposed the middle square method of generating pseudo-random numbers in 1949, in a paper published a bit later. The method is simple: you take a seed, say 4 digits long, you square it, and extract the middle 4 digits, which become the next seed. For example:
While it seems random enough, is it?
ANSI art and poor resolution may appeal to the nostalgic, those in want of the time when BBS were still it and the IBM PC’s programmable character set was the nec plus ultra of semigraphics, but they’re not really useful. At best, we can use them to dispense ourselves from using ncurse and still getting some colors and effects
However, “semigraphics” may have their use in lossy data compression, were we allow some data to be lost to gain some more compression. That may be especially true when we have very little computing power or if we want to have many simple CPUs in parallel doing the decoding.