17 min read

Share this article:

Here’s something that may well have happened to you, but if it hasn’t, it’s easy enough to imagine, and likely enough to be realistic.

You’re on the road, and you’ve just walked into a meeting room at a customer site, where three or four people are waiting for you, including someone called Emily you haven’t met before.

Before you get started, one of them smiles at you and says, “This won’t take a moment,” before ushering in a colleague carrying a cake and some party balloons and announcing cheerily that it’s Emily’s birthday.

“That’s funny,” you say, “because it’s my birthday, too,” at which point everyone laughs and wishes the pair of you *Happy Birthday*.

At this point, someone inevitably says, “What an awesome coincidence! I wonder what the chance was of that happening?”

Intriguingly the nature of that question depends on whether it’s you or Emily asking it, or someone else in the meeting.

(To keep things simple, we’ll ignore the detail of the birthday specifically being today, rather than any other day of the year, and focus just on the issue of matching birthdays.)

After all, you and Emily will almost certainly be wondering, “What’s the chance of another person in the very same meeting having my birthday?”

But the others will be thinking of something slightly different, namely, “What’s the chance of two people in this room having the same birthday?”

They’re very similar types of question, but the answers are weirdly and dramatically different.

In fact, many people end up with their intuition suggesting the answer to the more restrictive question (the chance of **a birthday that is exactly the same as your own**), even if they’re working out the answer to the more general question (the chance of **any two people in a group sharing a birthday**).

Indeed, the answers are so different that many people struggle to believe them, even after they’ve been told what they are.

That’s why it’s become known as the *Birthday Paradox*, because the answers seem so counterintuitive.

It’s a fascinating puzzle, so let’s take it on, avoiding any heavy mathematics and just relying on plus, minus, times and divide!

We’ll start by thinking about some of the terminology and rules we might need.

*Probabilities*, or chances to use a less formal word, are numbers between 0 and 1 that denote the likelihood of something coming true or being correct, where 0 means impossible, 1 means certain, and 1/2 or 0.5 is the mid-point, meaning that either outcome is equally likely, as for example when you toss a coin.

And because things can only either *happen* or *not happen* (except perhaps in those weird quantum computers that we wrote about recently), the two numbers in phrases like *it’s a fifty-fifty thing* or *I rate their chances at 60-40* always add up to 100.

That’s because the sum of all the possible probabilities for any event must end up at 100%, or 1, given that they’re the only possible outcomes, and exactly one of them will turn out to be true.

Those two tiny zeros in a percentage sign are there because it was originally a medieval *ligature*, or written abbreviation, concocted to save time and space. `50%`

is just shorthand for `50/100`

, or `0.5`

as we usually write it today. Writing 100% isn’t merely ‘a bit like saying 1’, but an *exact numeric equivalent* for it.

For example, you’ve got chance of 1/6, or about 16.67%, of rolling a 3 with a single die, which we’ll denote by using `P(...)`

, with `P`

serving as an abbreviation for the *probability* of whatever is between the round brackets, and write: `P(roll 3) = 1/6`

.

The chance of rolling any of the other five numbers is 5/6, or about 83.33%.

If we reduce the outcomes for any event to two mutually exclusive options in this way, which we’ll denote as `X`

and `not X`

respectively, and if we keep in mind that the chances of all possible outcomes must add up to 1, or 100%, we get this handy result:

It’s often easier to work out the chance that something will *not happen at all* than to juggle all the different ways that it could happen, so the rule that `P(X) = 1 − P(not X)`

can be incredibly useful in working out probabilities.

When it comes to figuring out shared birthdays in any group of `N`

people, we can start by thinking about the chances at each end of the problem: when there’s only one person, or, for that matter, no one; and when there are as many people as there are days in a year.

(Whether there are 365 or 366 days in a year doesn’t make much difference to the calculations, so let’s ignore leap years and conveniently assume there are 365 days in every year.)

When `N = 0`

or `N = 1`

, there aren’t two people to share a birthday, so a matched birthday is impossible, so that `P(0) = 0`

and `P(1) = 0`

.

(We’re using `P(N)`

as shorthand for the probability of finding a matched birthday with `N`

people in the group.)

When `N = 365`

, it’s conceivable that every person will have a different birthday, though it feels highly unlikely, but even if they do, when one more person arrives to make `N = 366`

, there simply aren’t enough birthdays in the year for everyone to have a different one.

At least one shared birthday is inevitable, so that `P(366) = 1`

, and by the same argument, `P(367) = 1`

, and so on for any number of additional people.

The probability graph for a shared birthday in a group of N people (we’ll go from 0 to 400 to keep the numbers round) has therefore started out with the blue marks shown here:

What about the empty space between `N = 1`

and `N = 366`

?

We might assume that the probabilities would change steadily, corresponding to a straight line on the graph, with the half-way point in probabilities sitting half-way through the number of days in a year (182.5), which would make 183 people the ‘break-even’ point where the chance of a repeated birthday goes above 50%:

Or we might imagine a more graceful curve with no kinks or corners, such as the sinusoidal shape of a sine or cosine graph, like the sound wave from a pure note such as a tuning fork:

Fitting a sinusoidal curve between the 0% and 100% ends of the graph gives a result that at least looks pleasing to the eye, overlaid here in pink:

As we shall find, the real graph (technically, because there can only ever be a whole number of people in the room, the graph forms a staircase rather than a curve) is interestingly shaped, but doesn’t form a regular or easily predicted pattern as shown above..

Let’s return to the specific version of the Birthday Question, namely, “Do any of the other people in the room right now have the same birthday as me?”

For this, we’ll start by finding the chance that we match exactly zero other birthdays, which is simpler and easier than working out the chance that we do match the first person, combined with the chance that although we didn’t match the first, we do match the second, and so on.

The chance we don’t match the first person’s birthday is the probability that theirs occurs on one of the other 364 days of the year, so `P(no match once) = 364/365`

.

For the next person, we need the same chance applied again to the first chance, so that `P(no match twice) = 364/365 × 364/365`

, and so on.

If there are `N`

people in the room, then there are `N−1`

people excluding ourselves, and therefore the probability we want is 364/365 multiplied by itself `N−1`

times, or `P(no match at all) = (364/365)`

.^{N−1}

If that’s the chance we match no one else’s birthday, then the chance we do match someone is clearly `1 − (364/365)`

, using the formula ^{N−1}`P(X) = 1 − P(not X)`

.

This gives us a logarithmic probability curve that strictly speaking never reaches 100%, though *in the limit*, as the jargon puts it, the value gets so close to 100% that it is considered mathematically indistinguishable:

As the graph shows, the break-even point, where the probability of a match goes above 50% for the first time, comes when there are **253 people in the group**.

That might seem surprisingly high, at least if your initial guess was that the half-way point would come at 183 people, given that 183 is just over half the number of days in a year.

Now let’s switch to the general version of the Birthday Question, where we aren’t dealing with whether someone in the group matches our birthday in particular, but whether there is a shared birthday anywhere in the group.

If the answer of 253 people in the previous section seemed slightly counter-intuitive, just wait and see what happens next!

In this calculation, we have to go through all possible pairings of people in the whole group, not just every pairing of ourselves with each of the other `N−1`

people.

In other words, once we’ve checked our own birthday against every one else’s, we need to take those `N−1`

other people, and treat them as a sub-group for possible matches.

If we then pick out one person from that group to match against, we need to check their birthday against the other `N−2`

in their group.

Then we can take a third person from that sub-group of `N−2`

people, and so on until we are left with a group of just two people.

We check for a match between that final pair, and then we’re done.

The total number of pairs we need to check is therefore `(N−1) + (N−2) + (N−3) + ⋯ + 3 + 2 + 1`

, a number that increases in proportion to the *square* of the number of people present, not merely linearly:

So, when we take into account all possible birthday pairs in a group of `N`

people, not just of the `N−1`

people who might pair up specifically with us, we can expect the chance of a match to rise quickly and steeply as more and more people join the group.

In the previous formula, we were multiplying our cumulative chance of no matches by a constant value of 364/365 each time, giving us the consistent light blue logarithmic curve above.

The chance of no matches was continually decreasing, but at an ever-decreasing rate.

But when we work out our new progression, where all pairs of people in the entire group need to have no matches, we get a different formula.

For the first person into the room, any of the 365 possible birthdays will do because there’s no one to worry about matching with.

The probability of no match is therefore 365/365 = 100%, which denotes certainty.

The second person needs to have a birthday from the remaining 364 unused birthdays in order not to match; the third person has 363 days left that won’t match; then 362 days, and so on.

Of course, when the 366th person shows up, there are no unused birthdays left to keep the matchless streak going, so the chance then gets stuck at zero forever, as we worked out in the very first graph above.

This time, the formula involves starting with the first probability of 365/365, then multiplying by 364/365, then by 363/365, then by 362/365, and so on for as many people as there are in the group.

Once again, we have a continually decreasing cumulative sequence, but the number by which we are multiplying by is decreasing every time as well, so that the rate of decrease begins to increase (you may want to read this sentence twice):

For the first few entries in the list, the probabilities seem to be changing slowly, but those ever-decreasing multipliers on top of the division signs soon start to have a dramatic effect on the cumulative total, as we can see if we look at the whole sequence.

Note that in this graph we are plotting `1 − P(no match)`

in order to obtain `P(match)`

, so the probabilities go upwards to 100% rather than downwards to 0%:

With 12 people in the room, the chance of a shared birthday has reached 16%; with 18 people we have a better than 33% chance of a double birthday, and **at just 23 people we pass the break-even point** at which it’s more likely that there will be a shared birthday than not.

Perhaps even more amazingly, a room with just 50 people in it has a 97% chance of a shared birthday; at 57 people we’re over 99% ; and with 97 people, the chance that no two people in the room share a birthday drops below one in a million.

Don’t feel bad if your initial inclination was to guess that the break-even point would come when there were half as many people in the room as there are days in the year, or 183 people.

That’s a common mistake, motivated for the very reasons we suggested at the start.

In fact, with 183 people in the room, the probability of a shared birthday is so close to certainty that if you try to calculate it using regular double-precision floating point arithmetic (what’s known as *IEE 754 binary64 format*, accurate to at least 15 decimal digits), there aren’t enough decimal places to tell that the answer isn’t exactly 100%.

And that, admittedly in quite some detail, is the problem often referred to as the Birthday Paradox.

Strictly speaking, it’s not a paradox because it doesn’t actually contradict itself.

But it gets that name anyway, for two reasons:

**Most people’s first guess**is a lot higher than 23.**Many people find the final answer hard to accept,**even after it’s explained.

Interestingly, as long as everyone declares their birthday privately and doesn’t collude with anyone else in the room to alter the odds deliberately, it doesn’t matter whether they make up a fake birthday or not, because there are only 365 different dates to choose from anyway.

The odds are more complex and slightly different when you take leap years into account, but the break-even point is still just 23 people.

Before we finish, we need to tie the Birthday Paradox back to cybersecurity.

Above, we were working with collections with up to 365 different items to choose from, namely the number of days in a regular year.

But the Birthday Paradox often involves very much larger numbers, for example in cybersecurity code that uses cryptographic hashes as “fingerprints” to verify files or passwords without needing to store the original files in their entirety, or in data processing code that uses non-cryptographic hashes as fingerprints to speed up finding and matching data items in a collection.

Years ago, for example, when downloads weren’t as reliable as they are now, a short-and-simple digital fingerprint known as CRC-32 was widely used as a cross-check for detecting damaged files.

If the file you just downloaded didn’t come up with the same CRC-32 checksum that the publisher listed, you knew that your modem had encountered an error during the process and that your download was corrupted.

As the name suggests, a CRC-32 is just a 32-bit number, so it can’t uniquely tag every possible file.

There are only 2^{32} (about 4 billion) different numbers you can represent with 32 bits, so if you have more than 4 billion files in your collection, it’s not merely unlikely but actually impossible to avoid what’s known as a *collision*, or two files with the same fingerprint.

But what if you used a CRC-32 hash to check your modest stash of already-downloaded files to see if you already had a copy of the new file and therefore didn’t need to download it again?

(Downloads used to be both slow and expensive, especially if you were doing them via a long-distance dialup connection.)

How would you manage the risk that you’d match the wrong file because one of the samples in your collection just happened to collide with the CRC-32 of the new download?

Well, we already know the answer to that, or at least the shape of the answer.

The risk of a mismatch increases with the number of files already in your collection, and the probability follows the logarithmic curve we saw above for birthdays, though with a different range on the X-axis.

But if your local file collection contains a few thousand files, rather than billions, the risk of a collision stays very modest, as we can see if we zoom into the bottom left corner of the previous graph:

If you have thousands or tens of thousands of files, you might decide that the graph above reflects a manageable risk.

From there, it’s an easy step to assume that it’s similarly safe to use your database of CRC-32 checksums for related ‘file recognition’ purposes, for example as a quick way of finding and eliminating duplicate files from your collection.

Simply sort the list of CRC-32 fingerprints into numerical order, and then run through it looking for duplicated fingerprints, and remove the repeated files.

Easy!

But we know from the Birthday Paradox that the probabilities that apply when we’re checking *for a specific file to match* just don’t hold when we are looking *for any two files that match*.

The probability that two files already in our collection will produce a hash collision, thus seeming the same when they aren’t and inviting us to conclude incorrectly that we can delete one of them, follows the paradoxically unintuitive behaviour we saw before:

Note that with the X-axis values running from 0 to 40 million files, which is the range we started with above, the graph reaches the level of ‘indistinguishable from 100%’ so quickly that it appears to jump up vertically at the left-hand end.

Only by magnifying the X-axis 50 times do we get a clear picture of just how inadequate a short checksum is when used for more than a tiny handful of items.

Even if intuition seems to be shouting that a 32-bit checksum is adequate because “the chance of guessing a 32-bit number correctly is only 1 in 4 billion”, the truth is wildly different:

The break-even point is clearly seen to be just below 80,000 files, which is a surprisingly modest number.

In the green graph above, where we calculated the chance of a specific new download colliding with an existing file, the risk of a collision against a collection of 80,000 files was just 18 in a million.

If you have 200,000 files on your computer, which is not unlikely these days, there’s a better than 99% chance that at least one pair of files will suffer a 32-bit hash collision, even though, with 200,000 hashes in your list, you have used up at most 200K/4G, or less than a miserly 0.005%, of all available hashes.

By the way, many computer scientists will tell you that the break-even point for the Birthday Paradox can be calculated using the square root of the number of objects in your collection, so that a 32-bit hash reaches its fifty-fifty point for collisions at `√2`

.^{32} = 2^{16} = 65,536

(To take the square root of a power of two, simply halve the exponent, so that `√2`

, and so on. Loosely speaking, the square root of an N-bit number has N/2 bits.)^{128} = 2^{64}

Some ‘experts’ even seem to think that `√N`

is *an exact formula* for the point at which the chance of a match gets to 50%, but we can see from the examples above that it’s merely an approximation, and an under-estimate at that.

The break-even point for 365 birthdays was 23 people, but `√365 = 19.1`

; for CRC-32 hashes, the break-even was just under 80,000 (calculated explicitly it comes out at 77,164), but `√2`

.^{32} = 65,536

There is no direct formula that avoids looping around and repeatedly multiplying and dividing, which is impractical for working out the break-even point for large collections, such as when you’re using a 128-bit or a 256-bit hash.

But a very close approximation to the Birthday Paradox break-even point, should you need to compute one, can be calculated as follows for a group of `N`

items:

Calculate `√N`

and then multiply by `1.177`

.

**Don’t blindly trust your intuition.**In cryptographic challenges or cybersecurity problem solving, always check and always verify before you accept conclusions. Big numbers don’t always put cracks and hacks out of reach, and ‘obvious’ answers might not be correct.**You may need bigger hashes than you think.**The SHA-2 and SHA-3 algorithms already provide fast and convenient hashes that are cryptographic and anywhere from 256 to 512 bits (16 to 32 bytes) long.**Mathematics and probability can be fun.**Don’t be afraid to dig in and learn more, even if you haven’t looked at a mathematical equation since you were at school. We got right into this one just using plus, minus, times and divide!

Why not ask how SolCyber can help you do your cybersecurity calculations correctly?

Paul Ducklin is a respected expert with more than 30 years of experience as a programmer, reverser, researcher and educator in the cybersecurity industry. Duck, as he is known, is also a globally respected writer, presenter and podcaster with an unmatched knack for explaining even the most complex technical issues in plain English. Read, learn, enjoy!

*Featured image of birthday ballons by Adi Goldstein via Unsplash.*

Share this article:

What it requires is practicality and reason.

It’s time to move beyond the inadequacies of current managed services and experience true security management.

No more paying for useless bells and whistles.

No more time wasted on endless security alerts.

No more dealing with poor automated services.

No more services that only detect but don’t respond.

No more breaches caused by all of the above.

PORTAL

Join our newsletter to stay up to date on features and releases.

CONTACT

SolCyber XDR++™

SolCyber MDR++™

SolCyber Extended Coverage™

SolCyber Foundational Coverage™

Free Demo

8255