Some time ago, a friend was trying to find an efficient way (storage and time complexity) to find collisions for a (secure) hash function. Hashing random keys until you get a collision is reminiscent of von Mises’ birthday paradox.

In is simplest form, the birthday paradox states that, amongst (randomly) gathered people, the probability that (at least) two share a birthday increases counter-intuitively fast with the number of gathered people. This mean that even if your hash function is random (i.e., strong), you may not have to try a very large number of random keys before finding collisions. Or does it?

Let us forget about people. Let us consider hash functions. Suppose we have , the number of possible values the hash will take. Typically, we have a large number, say . Let be the number of keys to hash. The probability of not having a collision after hashes is:

(Because at first, you have zero items drawn so far, and there are hashes still free, thus the probability is of not having a collision. Then you draw one, and have chances of not having a collision, since are still available, and so on until you draw the th hash, and you have keys already drawn from a possible set of , giving you a chance of of not having a collision.) And therefore, the probability of having a collision is simply:

.

Whenever and are largish, we can safely approximate using:

.

So, knowing (for example ), how large does have to be to have a collision? OK, let us say only to have a probability of collision?

Well, we just solve

for .

After a bit of pain, we find

.

* * *

OK, let’s say now that we want a 50% chance of finding a collision, how large should be? Well, now we have , , and we solve for , finding that

and finally

.

We observe that (for , anyways).

* * *

Therefore, before reasonably expecting a collision with probability using , we would have to hash about keys, which… may take a while.

This entry was posted on Friday, March 30th, 2012 at 22:31 pm and is filed under algorithms, Mathematics. You can follow any responses to this entry through the RSS 2.0 feed.
You can leave a response, or trackback from your own site.

This assumes, of course, that hash values are truly random over 128-bit. That’s a strong requirement.

A slightly more realistic requirement would be to assume 32-bit hash values (as in, for example, Java). Your formula then gives me 77163. Not quite as good, right?

But what about truly random hashing?

I would argue that there is no such thing in practice. For example, the best we ever do in actual software over strings is to have pairwise independence (i.e., strong universality). For a mathematical justification, see…

Daniel Lemire, The universality of iterated hashing over variable-length strings, Discrete Applied Mathematics 160 (4-5), 2012. http://arxiv.org/abs/1008.1715

and

Daniel Lemire and Owen Kaser, Recursive n-gram hashing is pairwise independent, at best, Computer Speech & Language 24 (4), pages 698-710, 2010. http://arxiv.org/abs/0705.4676

Formally speaking, some people get 4-wise and 5-wise independence using tabulation, but I would argue that just getting “random hashing” (at all!) in software is quite something. I haven’t seen tabulation hashing ever used…

The Ruby and Perl languages do have random hashing… but that is about it.

Most software uses deterministic hashing and, in that case, the probability of having a collision is either 1 or 0.

Final plug: Strongly universal hashing over strings is quite fast… see http://arxiv.org/abs/1202.4961 But it is inconvenient because… well, you need to read my paper to see the downside. ;-)

That’s mostly the point: to get good hashing, you need more bits, even if it’s strong hashing. But that’s intuitive in some sense: you would’nt be surprised to need very few trials to get collisions on, say, 4 bits, even though you throw a perfectly random dice to get your bits. If you have a very small number of bits, collisions are easy to understand. It’s when the number of bits grows that the result counter-intuitive. You kind of expect the number of trials needed to get a collision to be somewhat linear in and instead it’s . That’s the core of von Mises’ argument.

My own point (and sorry for the plugs) is that actual results using actual random hash functions may differ substantially from your estimate because hash values are not really random.

Could be better, could be worse.

It is also maybe interesting to note that most hash values aren’t random at all! Cryptographic hash functions are certainly not random… they are very much deterministic.

[…] to the Birthday paradox. How bad is it? Assuming that hashing is perfectly random, Steven Pigeon worked out some of the mathematics. To have a significant (50%) risk to find a collision when using 128-bit […]

Interesting.

This assumes, of course, that hash values are truly random over 128-bit. That’s a strong requirement.

A slightly more realistic requirement would be to assume 32-bit hash values (as in, for example, Java). Your formula then gives me 77163. Not quite as good, right?

But what about truly random hashing?

I would argue that there is no such thing in practice. For example, the best we ever do in actual software over strings is to have pairwise independence (i.e., strong universality). For a mathematical justification, see…

Daniel Lemire, The universality of iterated hashing over variable-length strings, Discrete Applied Mathematics 160 (4-5), 2012.

http://arxiv.org/abs/1008.1715

and

Daniel Lemire and Owen Kaser, Recursive n-gram hashing is pairwise independent, at best, Computer Speech & Language 24 (4), pages 698-710, 2010.

http://arxiv.org/abs/0705.4676

Formally speaking, some people get 4-wise and 5-wise independence using tabulation, but I would argue that just getting “random hashing” (at all!) in software is quite something. I haven’t seen tabulation hashing ever used…

The Ruby and Perl languages do have random hashing… but that is about it.

Most software uses deterministic hashing and, in that case, the probability of having a collision is either 1 or 0.

Final plug: Strongly universal hashing over strings is quite fast… see http://arxiv.org/abs/1202.4961 But it is inconvenient because… well, you need to read my paper to see the downside. ;-)

That’s mostly the point: to get good hashing, you need more bits, even if it’s strong hashing. But that’s intuitive in some sense: you would’nt be surprised to need very few trials to get collisions on, say, 4 bits, even though you throw a perfectly random dice to get your bits. If you have a very small number of bits, collisions are easy to understand. It’s when the number of bits grows that the result counter-intuitive. You kind of expect the number of trials needed to get a collision to be somewhat linear in and instead it’s . That’s the core of von Mises’ argument.

(and yes, that’s quite a bit of shameless plugs.)

Indeed, it is an interesting result.

My own point (and sorry for the plugs) is that actual results using actual random hash functions may differ substantially from your estimate because hash values are not really random.

Could be better, could be worse.

It is also maybe interesting to note that most hash values aren’t random at all! Cryptographic hash functions are certainly not random… they are very much deterministic.

I have formally my thoughts on this issue as a blog post:

Hashing and the Birthday paradox: a cautionary tale

http://lemire.me/blog/archives/2013/06/17/hashing-and-the-birthday-paradox-cautionary-tale/

Thanks for the nice idea.

[…] to the Birthday paradox. How bad is it? Assuming that hashing is perfectly random, Steven Pigeon worked out some of the mathematics. To have a significant (50%) risk to find a collision when using 128-bit […]