Some time ago, a friend was trying to find an efficient way (storage and time complexity) to find collisions for a (secure) hash function. Hashing random keys until you get a collision is reminiscent of von Mises’ birthday paradox.
In is simplest form, the birthday paradox states that, amongst (randomly) gathered people, the probability that (at least) two share a birthday increases counter-intuitively fast with the number of gathered people. This mean that even if your hash function is random (i.e., strong), you may not have to try a very large number of random keys before finding collisions. Or does it?
Let us forget about people. Let us consider hash functions. Suppose we have , the number of possible values the hash will take. Typically, we have a large number, say . Let be the number of keys to hash. The probability of not having a collision after hashes is:
(Because at first, you have zero items drawn so far, and there are hashes still free, thus the probability is of not having a collision. Then you draw one, and have chances of not having a collision, since are still available, and so on until you draw the th hash, and you have keys already drawn from a possible set of , giving you a chance of of not having a collision.) And therefore, the probability of having a collision is simply:
Whenever and are largish, we can safely approximate using:
So, knowing (for example ), how large does have to be to have a collision? OK, let us say only to have a probability of collision?
Well, we just solve
After a bit of pain, we find
OK, let’s say now that we want a 50% chance of finding a collision, how large should be? Well, now we have , , and we solve for , finding that
We observe that (for , anyways).
Therefore, before reasonably expecting a collision with probability using , we would have to hash about keys, which… may take a while.