German Tanks, Wood Sticks, and Loaded Dice

Nadim Kawwa
DataDrivenInvestor

--

War, what is it good for? Absolutely nothing.

Statistics, what are they good for? Absolutely everything.

In this post, we will explore how data scientists use a method called bootstrapping to hack their way into statistics. The method is popular since it can deliver quick results that are close to an analytical solution.

In this post we will go over three examples, for each we will:

  • State the problem
  • Set up the experiment
  • Implement in python code
  • Validate result analytically

Let’s begin with our first problem: Tanks!

German Tanks

Photo by Stephanie LeBlanc on Unsplash

The German Tank Problem is used for estimation and stems from a real problem faced by the Allied Forces during World War II. This variation of the problem is from a post on fivethirtyeight.com

Problem Statement

You are a British spy trying to record how many tanks the Germans have. You know that every new tank produced is given a serial number, with the smallest number being 1. So the first tank built has a serial number of 1, the second 2, and so on…

You jot down the serial numbers that you spotted. On the way back you get ambushed and lose the information. All you remember is that the smallest number is 22 and the largest is 114.

How many tanks do the Germans have?

Experiment Setup

Here we know little but we do not for sure how far away we are from the true minimum, here 1. What happens if our spy saw the tanks many times and then got ambushed at each occasion?

Solution in Python Code

Implementing it in code is as follow

Here you saw that we assumed our spy snapped 10 serial numbers. The histogram distribution is, therefore:

What happens if we snapped 20 serials?

We can see that with a bigger sample size we get closer and closer to the edge, this is the key takeaway from the experiment. In both cases, we can infer some kind of symmetry.

This leads us to say that we are as far away from the minimum as we are from the maximum. The number of tanks that the Germans have is:

Analytical Solution

The wikipedia entry lists more than one way to solve this problem. For example, a frequentist approach would yield a result in the same ballpark:

Wood sticks

Photo by Markus Spiske on Unsplash

Problem Statement

We have a stick made of wood of some random length L. We pick two random points along the length and cut the stick at those points. We now have 3 smaller sticks.

What’s the probability we can form a triangle from those 3 sticks?

Experiment Setup

What makes a triangle? If we remember geometry class, we can draw upon the necessary conditions for a triangle to exist. Among those, we can make use of the triangle inequality.

Given a triangle of sides a, b, and c the following inequalities must hold:

  • a + b <c
  • b + c <a
  • a + c <b

So if we pick two points on our stick, those three conditions are sufficient and necessary to see if a triangle can take shape.

Solution in Python Code

In reality, the length of the stick does not matter, when speaking of individual stick lengths we refer to the length as a fraction of total length. We also know that there is a uniform chance when it comes to picking points.

Run the experiment below several times as follows:

  • Pick two random points from a uniform distribution
  • Get the length of each portion
  • Check for triangle inequality
  • Record successes

Run the code above we get about 0.25.

Analytical Solution

There are several methods of solving this analytically and we won’t go in the details here, although we can solve it using a lot of integrals as seen here. We note that all solutions return a probability of 0.25, matching our experimental results.

Loaded Dice

Photo by Brett Jordan on Unsplash

Problem Statement

A bag contains 9 dice, 8 are fair and one is loaded. A loaded dice will always return a 6, a fair dice returns values from 1 to 6.

We draw a random dice from the bag, roll it twice and get a 6 on both rolls.

What’s the chance we picked the loaded die?

Experiment Setup

Let’s assume our dice are numbered from 1 to 10, the tenth one is the loaded dice.

  • Pick a random number between 1 and 10
  • If it’s a ten, record it as a cheat success
  • If it’s anything else, pick two integers between 1 and 6, and record a fair success if it’s two sixes

Solution in Python Code

We implement the experiment above as follows:

Running the code block we obtain about 0.820.

Analytical Solution

The last problem can be easily solved with Bayes’ Theorem. Given two independent events A & B, such that P(B) is not zero:

Rewriting the above equation to fit our example:

A loaded die will always give 1, therefore:

In a bag of 8 fair dice and 1 loaded die, the probability of picking the loaded die is:

The last part is the trickiest, what’s the chance of getting two sixes for one dice? Remember to distribute the probability:

We now have our terms and can find the probability:

Success! Our simulation closely matches our experimental results!

Conclusion

I hope you enjoyed these problems, you can find more details about the implementation in this jupyter notebook.

When confronting these types of exercises think of the experiment constraints, ask yourself what a reasonable result might be, and do many trials.

We end with one last riddle: Two of the three exercises showed up in real data science interview questions, can you guess which?

--

--