Building a large text corpus (Part I)

December 12, 2017

Getting good text data for language-model training isn’t as easy as it sounds. First, you have to find a large corpus. Second, you must clean it up!

Read the rest of this entry »


Reinventing the Wheel (or not)

December 5, 2017

About a week ago, some dude drops on IRC that he’s beat memcpy “by a lot”. That’d be interesting, except that we couldn’t get neither code nor test methodology out of him. But, how hard can making a better memcpy be? Turns out, harder than you think!

If you think this is a typical case of “reinventing the wheel”, I mostly agree with you. But while reinventing will be hard, can improvements be made?

Read the rest of this entry »


Chaotic Rulers

November 28, 2017

I’m currently working with one of my students on a laser-based range finder. To assess the precision of the device, I needed a calibration piece. Because of the setup, the piece should look like a stair.

The piece should allow a wide range of different readings, say from 1 to 10 centimeters in known increments, say, 1cm. The naïve way of building such a piece is to build a stair with 10 steps. However, if you do it like this, the piece is wide, cumbersomely so. Is there a much better way to do so?

Read the rest of this entry »


The Middle Square Method (Generating Random Sequences VIII)

November 21, 2017

Von Neumann proposed the middle square method of generating pseudo-random numbers in 1949, in a paper published a bit later. The method is simple: you take a seed, say 4 digits long, you square it, and extract the middle 4 digits, which become the next seed. For example:

4373\to{}19123129\to{}1231.

While it seems random enough, is it?

Read the rest of this entry »


ANSI Art

November 14, 2017

Since we now have minimal ANSI support, we can use it. Of course, for cute things such as changing text color (red for error, green for OK, etc.), but that’s not very amusing. Let’s make some ANSI ART!!1!

Read the rest of this entry »


Semigraphics Compression

November 7, 2017

ANSI art and poor resolution may appeal to the nostalgic, those in want of the time when BBS were still it and the IBM PC’s programmable character set was the nec plus ultra of semigraphics, but they’re not really useful. At best, we can use them to dispense ourselves from using ncurse and still getting some colors and effects

However, “semigraphics” may have their use in lossy data compression, were we allow some data to be lost to gain some more compression. That may be especially true when we have very little computing power or if we want to have many simple CPUs in parallel doing the decoding.

Read the rest of this entry »


Just Larger

October 31, 2017

This week, something short & sweet & probably useful for generic programming: find the next integral type just bigger than a given type. The obvious application would be that if you want to add ints, you’ll want the variable that holds the sums to be larger than int. In generic programming, however, you don’t necessarily know beforehand that the base type will be int.

Turns out, while there’s’nt anything built-in to do that, it isn’t very complicated either. First, we will define a series of templates with overloads; second, we will use using to extract the type seamlessly.

We will define a struct (rather than a class, merely to avoid adding public:) that defines a type type depending on its template argument. The trick is that if T is xyz then define nested type zyx and make it public. Since we can’t (or at least, I haven’t figured how to) have some kind of switch/case on types, we will have to define explicit specializations, one for each supported (integral) type.

The simplest possible implementation would be something like this:

////////////////////////////////////////
template <typename T> struct just_larger_; // incomplete type is default

// some are implementation-specific (char may or may not be signed)
template <> struct just_larger_<char> { using type = int16_t; };

// some standard types
template <> struct just_larger_<int16_t>  { using type = int32_t;  };
template <> struct just_larger_<uint8_t>  { using type = uint16_t; };
template <> struct just_larger_<uint16_t> { using type = uint32_t; };

// "extracts" type
template <typename T>
 using just_larger=
  typename just_larger_<T>::type;

Of course, one would have to define all overloads for the basic types, the types from headers <cstddef>, <cstdint>, and any other platform- or implementation-specific headers.

The usefulness of just_larger<T> is within another template. If one of the arguments of the this template is used with just_larger, then it’s simple. If, for some reason, you do not have directly access to the type, but only to, say, a field name, you may need to use decltype, a C++11 addition. decltype gives the declared type of an entity or of an arbitrary expression.

An example of use:

#include <iostream>
#include <cstdint>
#include <typeinfo>
#include <type_traits> // for typeid and type_info

#include <just_larger.hpp>

int main()
 {
  just_larger<char> z;

  std::cout
   // use from type
   << sizeof(just_larger<char>) << std::endl

   // use from variable
   << typeid(z).name() << std::endl
   << sizeof(decltype(z)) << std::endl
   << typeid(just_larger<decltype(z)>).name() << std::endl
   << sizeof(just_larger<decltype(z)>) << std::endl
   ;

  return 0;
 }

The typeid operator returns a (const) reference on a std::type_info, a class that holds some information on the type. Unfortunately, name() doesn’t print pretty names, but some implementation-specific string: int is not printed as int but as i. That’s somewhat cryptic, but enough to verify that the implementation works correctly.

*
* *

We could overload just_larger with just anything, not just integral types. One evident generalization would be from float to double, but it can be anything that makes sense for your application. Also, maybe just_larger needs a companion template much_larger, that could ensure that large sums a given type would not overflow.