Why Validating Input so Hard in C?

Validating input from file or keyboard is probably the most difficult thing to get right in C. Not only is it difficult to get right regardless of the programming language, C really doesn’t do much to help you. There’s the standard library, mostly accessible through the two headers <stdlib.h> and <stdio.h>. However, the facilities provided by the C library are rustic at best. They haven’t aged well, and they’re clunky.

Rusty_pincers-small

For this post, I will limit I/O validation to grabbing input from text files, whether through a redirection, pipe, file, or console input. I may discuss binary or highly structured formats like XML in a later post, but let us first limit ourselves to a few simple cases.

The first of this series of routines are the ato* routines: atoi, atol, etc. Their prototypes are of the form of int atoi(const char *), with int being replaced by long for atol, with long long for atoll, etc. The first problem is that, as we discussed in a series of previous posts (here, and here) the size of int, long, etc., vary from platform to platform, so it is not clear how convenient they are for portable code.

But their greatest problem is that they do not detect errors at all. If a conversion fail, it returns 0, or not. If you feed it "toaster", as in atoi("toaster"), it returns zero and does not even set errno. Worse, if you feed it a very large number, it doesn’t seems to be returning INT_MAX (or INT_MIN in the case of a negative number) but merely some random-conversion modulo 2^{32} (if your system has 32 bits integers). Calling atoi("1242341243124124") returns 477919644, which is indeed 1242341243124124\!\mod 2^{32} because int is 32 bits on the test platform.

The next level of half-bakery comes with strtol. The prototype for strtol is:

long int strtol(const char *nptr, char **endptr, int base);

This function can return some errors, but you still have a lot of work to do to figure out exactly what went wrong. Now, strtol does return LONG_MAX (or LONG_MIN) should overflow (underflow) occur. If so, errno is set to ERANGE and you can determine that a huge (or “huge” negative) number was entered. If you try to convert from an unsupported base, errno is set to EINVAL.

But what if you’re trying to input text? Well, now, you have to rely on the second pointer (pointer to pointer, actually) endptr that gives you a pointer to the first character that caused strtol to stop parsing the input. That is, if you call strtol on “14toasters” it will succeed in converting 14 (returning the right value and setting errno to 0) but endptr will point to the first “t”. So you have to test for the return value, errno (which may or may not be set, that’s implementation-specific apparently) and endptr to check if it points to a white space or something.

The file-oriented input routines, such as scanf, and their string-based equivalents like sscanf, are a tiny bit better, maybe. First they provide some pattern matching capabilities (albeit quite limited). Second, they return the number of items successfully converted, or EOF should the end of file be reached before the first successful conversion. Unlike strtol, sscanf("%d","14toasters",&my_int) will not successfully read one integer from the string (14) as it notices that there’s extra garbage after. In the example, sscanf returns 0 and the variable my_int is unchanged. If an error occurs, you still have to deal with the remaining input in some way—yup, you’re on your own.

*
* *

So, what would it take to make a function like strtol behave correctly? First, I would change its prototype to:

error_t strtol( const char * s, 
                int base,
                long * result,
                const char ** endptr );

and make it always return an error code (not affecting errno, which is global for the current thread). I would make it behave more like scanf where the string “14toasters” isn’t recognized as 14 but as an invalid number, maybe by returning an error such as EINVDELIM (for invalid delimiter). That would also make easier to test for errors. If there’s an error, the function returns something different from zero. Then you can investigate what went wrong, instead of making a couple of tests to detect if something went wrong.

*
* *

Merely changing the a function like strtol doesn’t do much to solve the general problem that the C standard library lends the programmers very little help to validate input. If scanf is a bit better, it offers a set of very restricted input formats specifiers, and it also doesn’t help you with unbounded input. What do you think happens when you read a string with scanf? Well if you set a limit, it will read up to a limit, but it’s also a limit. Is 200 ok? what if you would have needed 205? If you set a size limit, it will read succesfully up to the limit and even if the input is truncated, it will “succeed”.

If you haven’t set a limit, it will sooner or later cause problems because the destination storage for the string will be exceeded. The GNU implementation has a non-standard extensions that will allocate the string for you, and you can also set a size limit but this extension interferes with C99 format specifiers, and, by its very nature or being an extension, has limited interest.

No, what we need is to rethink the conversion routines altogether; a complete tabula rasa. An while we’re at it, include a complete and easy support for internationalization, another thing the C library isn’t very good at.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: