A Data Story of Cholera & Simulation
We return to 1854, a time when cholera gripped Victorian London. Previously, we applied the K-Means algorithm (in our previous doodle) to mathematically derive the Broad Street pump as the spatial centroid of the outbreak. But clustering only tells us where the data sits. To understand how it got there, we need a more dynamic approach.
Now, we are going to use a method that embraces randomness to find certainty: Monte Carlo simulation.
Imagine we want to understand a chaotic event—like a sudden disease outbreak—but we only have scattered, random fragments of information. How do we make sense of the chaos?
Let's start with a classic mathematical trick: calculating the constant Pi (π) leveraging randomness within a defined geometric frame.
Look at the simulation. Imagine we are blindly throwing 10,000 darts at a square board with a circle painted on it.
If we divide the circle's area by the square's area, the r² mathematically cancels out, leaving a fixed ratio of exactly π / 4.
Here is where it gets clever: we can find this ratio without any geometry! We just count our darts. The ratio of Red Darts (inside the circle) to the total number of darts thrown will approximate π / 4.
By algebraically rearranging our formula—simply multiplying our dart ratio by 4—we organically calculate Pi! This approximation relies on 10,000 random placements of darts. It is not guaranteed that the actual value of Pi (3.141...) is achieved in every run, as the method is fundamentally probabilistic.
Using stochastic (randomly determined) sampling to estimate deterministic values is called the Monte Carlo Method. We are going to use this exact principle to reverse-engineer the 1854 Soho Cholera outbreak.
Dr. John Snow recorded the locations of deaths (red dots) and water pumps (yellow dots) during the 1854 Soho cholera outbreak.
We've imported this exact dataset from a previous doodle (https://johnsnow.vercel.app/). Notice the distinct clustering around the Broad Street pump at the centre.
In that approach, the model simply loops through each data point and assigns each to one or more clusters, pre-set in advance. In this case, a simple eye-balling would suggest that there is only one cluster.
Now to look at the Monte Carlo method. We begin by selecting a single random victim's location. From this point, a green agent begins to move, or "stumble," very slowly across the map. Each individual step is chosen entirely at random—either up, down, left, or right.
This erratic, "drunken" pattern occurs because the algorithm is entirely stateless, meaning it has no memory of where it has been. In physics and statistics, this is known as the Markov Property: the next move depends only on the current location, not on the path taken to get there.
While this happens, the counters for the yellow pumps sit at zero. If the green agent stumbles into the vicinity of a pump by pure geometric chance, that yellow pump’s counter increments, and the agent immediately resets to its starting location to begin its walk once more.
Now we pick a second victim's location and start another random walk at the same slow speed.
The paths accumulate over the streets, and occasional hits tally up on the pumps simply because of random wandering around the origin point.
Finally, we release the remaining 500+ agents simultaneously. Use the speed bar below to watch them emerge at an increasing pace. While individual random walks are chaotic, an emergent pattern solidifies over time: Broad Street accumulates the hits. This demonstrates that stochastic sampling over geographical territory reveals the system's underlying attractor.
In this single-source scenario, the dominance of the Broad Street pump is expected. However, the true value of this method lies in its application to multi-variable systems that are computationally intractable using traditional equations. By sampling millions of random "what-if" paths, we can approximate solutions to problems where no closed-form formula exists.
This same logic—Monte Carlo Tree Search (MCTS)—powers systems like Google DeepMind’s AlphaGo, which simulates billions of potential game futures to identify the most probable path to success. In structural biology, similar simulations model the stochastic folding patterns of proteins, allowing us to map the fundamental building blocks of life. From pricing complex financial derivatives to training reinforcement learning agents in PyTorch, Monte Carlo methods provide the computational framework necessary to translate chaotic noise into actionable probability.