UEID: Unique Enough IDs

Generating unique, unambiguous, IDs for data is something we do often, but we do not always know what level of uniqueness is really needed. In some cases, we want to be really sure that two instances of the same ID identify two copies of the same object or data. In other cases, we only want to be reasonably sure. In other cases, yet, we just assume that collisions—two different objects yielding the same ID—are very unlikely, and, if the need be, we can proceed to further testing to establish equality.

There are many ways of generating IDs, each with varying levels of confidence on uniqueness and differing applications.

To name a few:

  • Custom Table-Lookup Hash Function. We do it every time we roll up yet another implementation of a hash lookup table, and yet, it’s very rarely well done. Too simple hash functions simply fail at randomizing bits enough to thin out collisions. However, if it is very important that we choose a good hash function, we can avoid the extreme of expensive cryptographic hash functions. Mixing a classical hash function with a (fixed) table of random numbers can lead to good results yet with very little code/computation. One example of which is the Zobrist Hash Function. This function, from some of my previous work (in French, though):
     inline unsigned
      hash(const unsigned char *buffer,
           size_t size)
      {
       static const unsigned EXPAND[256]=
       #include "hash-random-table";
    
       unsigned h=0;
    
       for (int i=0;i<size; i++)
        h=(h ^ (h >> 1) )
           +EXPAND[ buffer[i] ]; 
    
       return h;
      }
     

    does also very well for lookup tables applications. It is not affected much by the length of the data and, depending on the contents of the random table, spreads bits quite evenly across the unsigned range (whatever unsigned may be in your implementation).

    However, in hash tables, collisions can be dealt with, so it is not that important to have truly unique IDs, it suffice to have IDs that have a low probability of being generated for two different pieces of data.

  • MD5. The MD5 message digest algorithm produces a 128 bits (or 16 bytes) value from your data, pretty much like a hash table lookup hash function but in a way that makes it really difficult to find the original data (or a plausible substitution) from the 128 hashed bits. This algorithm ensure a very, very low probability of collision, although, theoretically, collisions can still occur. However, the MD5 hashing algorithm is known to be vulnerable to forgery, that is, it possible to forge messages with a given MD5 hash relatively easily. When I mean relatively, I mean by using a very sophisticated algorithm exploiting a weakness in MD5, and hours of CPU time (see here for details). In normal situations, however, MD5 is perfectly safe to use for caches and other lookups where you cannot afford collisions.
  • The SHA Family. Contrary to MD5, the SHA hash functions family has no known weaknesses that can lead to forgery. SHA-256, for example, yields a 256 bits signature. This greatly reduces the probability of a collision, but now your data has a 256 bits (or 32 bytes) signature attached to it. It may represent too much data to store, depending on your application.
  • RFC-4122 UUID. A UUID, or universally unique identifier is a very long (32 bytes, or 256 bits) series of bits that are most probably unique. RFC-4122 describes the flavors of UUIDs. However, unlike the previous methods, based on hashing, UUIDs do not depend on your data. Rather, they are either generated pseudo-randomly (as for Version 4), or based on your computer supposedly unique features like your primary Ethernet card’s MAC address. The Linux kernel offers the procfs pseudo-file /proc/sys/kernel/random/uuid that yields (RFC 4122, Version 4) UUIDs that look like e56b51fd-4d7c-4d81-bbce-2579d2020861.
  • UEID. The Unique Enough ID is any human-readable (most likely) unique signature you can build from your data. Maybe you cannot hash a whole record to ID it, because some part of it changes. Maybe it makes more sense to have the ID human-readable, for whatever reason. For example, this could correspond to a log file where some fields remain the same and some change depending on whatever happened when the log line was generated. You could hash on the fixed fields and bundle all log lines with the same hash together as pertaining to the same series of events. Indeed, instead of having an ID such as 182f0ed1cb934660, it may be preferable to have the easily readable “310.260.613.0-http-get-www.domain.com-file.gif” as an UEID to track file transfer from an Apache log 1,2.

The hash functions (custom, MD5, SHA) can be used whenever the main goal is to cache data/files with a very low rate of collision, or where collisions are assumed so unlikely as to be virtually impossible. Using MD5, for example, in a setting where the signature cannot be forged and submitted to your application, is perfectly safe. In fact, even if MD5 is subject to attack, there are, to my knowledge, no known “real world” collisions. They just never happened. Yet. If you feel a bit paranoid, you can go for SHA-256, but you double the signature size for very little real gain—assuming, of course, a non-cryptographic setting.

The UUID algorithms described by RFC-4122 are more suited for objects that will persist and be exchanged between different systems yet must be uniquely identified. For example, it may be convenient for you to identify your computer by a UUID in a network rather than by an IP address that can change when roaming. It could also be a licence number issued once. You can figure out applications where a UUID would be useful, I am sure.

The UEID can be used whenever there is uncertainty about the original data, or when you can tolerate a higher probability of error error.

There are other uses for UUID, such as include guards in languages such as C and C++. Although not as elegant as carefully chosen symbol names, UUIDs prevent include guard collisions (as would the quite standard #pragma once) a very direct, simple way.


UUID are accessible in many languages. Java offers java.util.uuid, Python uuid, PHP uniqid, and libuuid offers UUID functionalities to C and C++. libuuid is part of (most? all?) Linux distributions. On Windows, one would use the UuidCreate function found in Rpc.h


Adding the following to your .emacs file

(defun insert-uuid
       ()
       (interactive)
       (shell-command "cat /proc/sys/kernel/random/uuid" t)
       )
(defun insert-UUID 

       ()
       (interactive)
       (shell-command "tr a-z\- A-Z\_ < /proc/sys/kernel/random/uuid" t)
      )

will allow you to invoke M-x insert-uuid and M-x insert-UUID to insert RFC 4122 Version 4 UUIDs in RFC format or capitalized, with underscores rather than dashes, respectively. Sorry for the VI(M) users, I do not know the VI(M) equivalent.


1 An “IP V4.5″ address seen in CSI:Miami episode 13, season 4, Silencer

2 As far as I know, Apache is connectionless, so it is not easy to track this kind of information from the logs: one has to reassemble the many gets to reconstruct the transfer details and historic.

2 Responses to UEID: Unique Enough IDs

  1. [...] incomplete coverage for messages that are too short. I also presented an ad hoc hash function in a previous post that is reasonably good while amenable to efficient [...]

  2. [...] a number of different occasions, I briefly discussed Hash Functions, saying that if a hash function needn’t be very strong if the intend is to use [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 65 other followers

%d bloggers like this: