Compiling the Birthday Problem into Rust
The Birthday Problem demonstrates that, in a group of 23 people, the odds of two people sharing a birthday exceed 50%. You can read the linked article for an explanation, but the probability is about 50.73% that, for a 365-day year and a group of 23 people, two people will share a birthday.
You can extrapolate this problem to different ranges other than 365, of course. Given range N and population X, in which each X occupies a single slot of N, at what value of X does the probability of two members sharing a slot rise above 50%?
At work the other day, we were facing a birthday problem scenario for a range of 100. Any collisions would silently overwrite data. The calling threads had no context of each other, and the API offered no collision detection. As we were weighing our options, a coworker decided to figure out what our tipping value was. He wrote the following Python script:
#!/usr/bin/env python3
import sys
p = 0.5
n = int(sys.argv[1])
x = 1.0
np = 1 - p
for i in range(n):
x *= (n - i) / n
if x <= np:
print(i + 1, 1.0 - x)
break
This script takes a single, numeric parameter as its range. It then loops from zero to the range, exclusive. For each iteration, it calculates the probability of a collision. If that probability exceeds 50%, it prints the number and the probability as a decimal. Passing 100 to this program nets the following:
13 0.557225039544664
So for range 100, the probability of collision exceeds 50% with 13 participants, and that percentage is 55.72%.
Suddenly, writing software to calculate birthday problem percentages seemed more interesting than the actual problem we were supposed to solve. I quickly translated that Python script to Rust. I was about to send the Rust version to my colleagues, when I paused. Why are we waiting until run time to calculate these probabilities? The probabilities for a given range don’t change. We could calculate all the values up front, put them in a lookup table, and then merely do a lookup at run time. O(1)!
I vaguely remembered that Rust offers some hooks into compilation, so I kagied my way to this:
Generating static arrays during compile time in Rust
Perfect. I decided to mimic the output of the Python script. I’d store the calculations in an array of tuples. An array requires a fixed maximum range, of course. I decided to make the maximum range configurable at compile time, as the linked article does. I’d check at run time that the passed parameter doesn’t exceed the length of the array. Once that check is done, the code to spit out the answer becomes:
let (value, odds) = BIRTHDAY_VALUES[range as usize - 1];
println!("{} {}", value, odds);
The magic (well, the calculations) takes place in build.rs
, which cargo
runs when you build. The next section talks about the steps in the build code.
The Build Code
Code to run at compile time goes in a file called build.rs
at the root of your Rust project. This code has a main
function, which executes when Cargo compiles your code. In main
, we
- Determine the maximum value for our calculation range
- Create a string that contains the code for an array of tuples
- Write the string to a Rust code file
- Tell Cargo how to detect when to rerun our build file
Determining the Maximum Value
We use an environment variable, NUM_VALUES
, to determine our maximum range. In its absence, we default to 1000. That code looks like this:
let num_values = usize::from_str_radix(
env::var("NUM_VALUES")
.unwrap_or("1000".to_string())
.as_str(),
10,
)
.unwrap();
Creating the String
We want to create a string that creates a static array of tuples. When finished, that string will look something like this:
static BIRTHDAY_VALUES: [(u32, f64); 1000] = [
(1, 0.0000000000000000),
(2, 0.5000000000000000),
(3, 0.7777777777777778),
...
(38, 0.5100251098084975),
(38, 0.5096705921517373),
(38, 0.5093165369503566),
];
The string creates a static array of 1000 elements. It’s zero-based, of course, so the zeroth element corresponds to range 1, and the 999th element corresponds to the range 1000.
Each entry in the string is a tuple that contains the number of people required to tip the percentage above 50%, and the odds of a collision at that number of people. Interpret the string above like this:
- Range 1 (zeroth element) is a special case: even with all 1 people, the odds never hit 50% (they stay at 0%)
- Range 2 (first element): once you hit 2 people, you have a 50% chance of a collision
- Range 3 (second element): 3 people are required to tip the odds, but they tip them all the way to 78%
- Range 1000 (999th element): 38 people are required to favor a collision, and it’s a 51% chance
The code for printing this string is pretty straightforward. The only minor gotcha is to remember to print all 16 digits of the odds.
let mut arr = format!("static BIRTHDAY_VALUES: [(u32, f64); {}] = [\n", num_values).to_string();
for i in 1..=num_values {
let (value, odds) = calculate_birthday_value(i as u32);
arr.push_str(&format!("({}, {:.16}),\n", value, odds));
}
arr.push_str("];\n");
We’ve moved the one complexity of the code — calculating the values for the tuple — into its own function called calculate_birthday_value
. It looks like this:
fn calculate_birthday_value(range: u32) -> (u32, f64) {
let mut odds = 1.0;
for i in 0..range {
odds *= (range - i) as f64 / range as f64;
if odds <= 0.5 {
return (i + 1, (1.0 - odds));
}
}
(range, 0.0)
}
Writing the String to a Rust File
When Cargo runs a build script, such as the one we’re creating, it creates several environment variables. You can find more information here. One of the environment variables Cargo creates is OUT_DIR
, defined as “The folder in which all output and intermediate artifacts should be placed (see link for reference).” We use OUT_DIR
to determine where to write the string we’ve created, and we call the file birthday_values.rs
. Here’s that code:
let out_dir = env::var("OUT_DIR").unwrap();
let dest_path = Path::new(&out_dir).join("birthday_values.rs");
fs::write(&dest_path, arr).unwrap();
Detecting Changes
We can avoid rerunning our build script unless NUM_VALUES changes, so we add this to our main function:
println!("cargo:rerun-if-env-changed=NUM_VALUES");
The Run-time Code
At run-time, we perform three actions:
- Get the passed argument, defaulting it to 365 when absent
- Test that the requested value is within the range of our lookup table
- Look up the values in the array
We require one bit of magic to include the birthday_values.rs
file we create at build time: the appropriately-named include!
macro, like this:
include!(concat!(env!("OUT_DIR"), "/birthday_values.rs"));
The main function executes the three actions, as outlined above:
fn main() {
let args = env::args().collect::<Vec<String>>();
let range = args.get(1).map_or(365, |arg| arg.parse().unwrap());
if range > BIRTHDAY_VALUES.len() as u32 {
eprintln!("Range too large, maximum is {}", BIRTHDAY_VALUES.len());
std::process::exit(1);
}
let (value, odds) = BIRTHDAY_VALUES[range as usize - 1];
println!("{} {}", value, odds);
}
When I compile this on my macOS machine in release mode, the resulting executable weighs in at 445k.
You can find all the code here: https://gitlab.com/hoop33/birthday
Conclusion
Static calculations in predictable ranges beg for lookup tables. Rust makes creating lookup tables easy with its hooks into the build process. Was it worth it? You tell me — here’s a comparison of the Rust version vs the Python version on each planet in our Solar System:
Planet | Days in Year1 | Number of People | Odds | Python Time | Rust Time |
Mercury | 88 | 12 | 0.543872883290264 | 0.131 | 0.005 |
Venus | 225 | 18 | 0.5025938339844269 | 0.137 | 0.005 |
Earth | 365 | 23 | 0.5072972343239857 | 0.131 | 0.005 |
Mars | 687 | 32 | 0.519666360035727 | 0.136 | 0.005 |
Jupiter | 4,333 | 78 | 0.5020335858396545 | 0.136 | 0.005 |
Saturn | 10,759 | 123 | 0.5034303495990434 | 0.137 | 0.005 |
Uranus | 30,687 | 207 | 0.5016003710386033 | 0.148 | 0.005 |
Neptune | 60,190 | 209 | 0.5020874868344309 | 0.135 | 0.005 |
OK, so we’ve learned that Rust is faster than Python. We had fun, though, didn’t we?
And the original problem with collisions? Collision detection was added to the API we were using, so we just had to implement our logic in a retry loop.