Opinion

Results of “Experiment on Bernoulli processes”

​Published on October 26, 2025 9:47 PM GMTTwo weeks ago I posted an experiment for priors on Bernoulli processes. I gave you all way too much data, though, so I don’t think it worked out to be a very good experiment.
This post provides the results and reveals the hidden experiment class.
Motivation Behind the Experiment
The experiment was meant to test induction. Suppose you are a Bayesian, and you only have a small number of observations about some time-invariant law. What is the correct posterior you should have after these observations? If you just use the bare observed frequencies, then you will be much too overconfident if you see only yesses or nos. What you need to do is start from some prior, and then update from that.
But what is the correct prior? Some people have proposed Jeffreys priors or Laplace’s rule of succession as non-informative priors. Do these actually work? I wanted to see what you guys could come up with.
The only way to determine whether a prior is a good prior is by running many different experiments and testing whether the posterior after some number of trials is correct across those experiments. No single experiment is enough. So, I made a dataset of experiments and asked you to induct on the final trial given the previous four. Naturally, the test depends on how well you can predict me. But observing the provided data can help narrow down the class of experiments I was drawing from. Unfortunately, I gave you guys way too much data, and so it didn’t really matter what prior you used. I think the experiment would have been much more interesting if I only provided 100 experiments, and then computed the true marginals on a held-out dataset.
Results
The correct answers were 11.01%,32.49%,50.00%,67.41%, and 88.95%. Here’s a table of the total number of experiments for each count of Rs in the first four trials:

Rs in First Four Trials
# Experiments
# Final Trial is R
Marginal Frequency

0
253,099
27,856
11.01%

1
166,571
54,125
32.49%

2
161,832
80,924
50.00%

3
165,890
111,825
67.41%

4
252,608
224,700
88.95%

Now that the challenge is over, I’ve posted all of the data online.
Most of you were very close, only off by a few parts in 10,000. I was
especially impressed by Cleo Nardo’s submission. She guessed 10.6%, 34.8%, 50%, 65.2%, and 89.4% without even looking at the CSV, which were all within a couple percent of the correct values.
The Class of Experiments
There were four types of experiments.

A fair coin. In this experiment, each trial had a 50% chance exactly of giving R.
A slider coin. In this experiment, a coin bias p was chosen from [0,1] uniformly at random, and then each trial had a p chance of giving R.
A filtered slider coin. Like in case (2), a coin bias p was chosen uniformly from [0,1]. Unlike in case (2), the coin was only accepted if its first three flips were R (half the time) or L (the other half of the time). Coins would keep being selected until one passed the filter, and then that would be the coin for the experiment.
A frog hop. A frog is placed on the middle lily pad of a line of seven lily pads. Every minute, it hops left or right. Eventually, it will hop off the line of lily pads to the right or the left. Each trial’s result is which side it hops off. An experiment is constructed by choosing fixed hop probabilities for each lily pad independently and uniformly from [0,1].

The four types of experiments were mixed with weights 10%, 10%, 10%, and 70%.
The second and third cases can both be viewed as drawing the Bernoulli probability from a mixture of Beta distributions (so that the correct prior is a mixture of Beta distributions). For case (2), the Beta distribution is Beta(1,1), and for case (3), the distribution is an even mixture of Beta(1,4) and Beta(4,1).
Cases (1) and (4) do not have Beta priors. Case (4) in particular is tricky. The probability for its Bernoulli process is the stationary point of a Markov chain, which involves multiplying and adding many random variables together to calculate.
Calculating the True Marginal Probabilities
The marginals in the first three cases can be calculated analytically very easily. I think the fourth case can also be analytically solved, but I chose to use a Monte Carlo simulation instead. To get better accuracy, I used two tricks:

Instead of actually running the trials for each experiment, I calculated the stationary point of the Markov chain to get the Bernoulli probability p, which allowed me to compute the exact marginal probabilities (and their weights) for that experiment.
I used PyTorch and the GPU to parallelize 10 billion experiments.

Combining everything together, I got the true marginal probabilities to six significant figures (so 5 correct digits). They are 11.0810%,32.5015%,49.9997%,67.4990%, and 88.9188%.
Who wins?
Because you all were so close, I decided to resort to the true probabilities to determine a winner. My scoring function is
4∑i=0wi[piln(piqi)+(1−pi)ln(1−pi1−qi)],
where wi is the true probability of i Rs in the first four trials, pi is the true ith marginal probability, and qi is your ith guessed marginal probability. (This is a weighted average of the KL divergence from the true marginals to your guesses.)
Your scores for your public submissions are

Name
Score

Unnamed
2.999e-07

One
3.017e-07

DaemonicSigil
5.433e-07

James Camacho
2.840e-04

Cleo Nardo
4.520e-04

It’s very close, but Unnamed wins. Congratulations!
Resolving the Manifold Market
As promised, I resolved the Manifold market using the marginals for the original 1,000,000 trials. Manifold Markets didn’t allow me to resolve to a sub-percentage precision, so I randomly rounded a percentage x to either its ceiling or its floor with probabilities {x} and 1−{x}, where {x} is the fractional part of x.
Discuss ​Read More

​Published on October 26, 2025 9:47 PM GMTTwo weeks ago I posted an experiment for priors on Bernoulli processes. I gave you all way too much data, though, so I don’t think it worked out to be a very good experiment.
This post provides the results and reveals the hidden experiment class.
Motivation Behind the Experiment
The experiment was meant to test induction. Suppose you are a Bayesian, and you only have a small number of observations about some time-invariant law. What is the correct posterior you should have after these observations? If you just use the bare observed frequencies, then you will be much too overconfident if you see only yesses or nos. What you need to do is start from some prior, and then update from that.
But what is the correct prior? Some people have proposed Jeffreys priors or Laplace’s rule of succession as non-informative priors. Do these actually work? I wanted to see what you guys could come up with.
The only way to determine whether a prior is a good prior is by running many different experiments and testing whether the posterior after some number of trials is correct across those experiments. No single experiment is enough. So, I made a dataset of experiments and asked you to induct on the final trial given the previous four. Naturally, the test depends on how well you can predict me. But observing the provided data can help narrow down the class of experiments I was drawing from. Unfortunately, I gave you guys way too much data, and so it didn’t really matter what prior you used. I think the experiment would have been much more interesting if I only provided 100 experiments, and then computed the true marginals on a held-out dataset.
Results
The correct answers were 11.01%,32.49%,50.00%,67.41%, and 88.95%. Here’s a table of the total number of experiments for each count of Rs in the first four trials:

Rs in First Four Trials
# Experiments
# Final Trial is R
Marginal Frequency

0
253,099
27,856
11.01%

1
166,571
54,125
32.49%

2
161,832
80,924
50.00%

3
165,890
111,825
67.41%

4
252,608
224,700
88.95%

Now that the challenge is over, I’ve posted all of the data online.
Most of you were very close, only off by a few parts in 10,000. I was
especially impressed by Cleo Nardo’s submission. She guessed 10.6%, 34.8%, 50%, 65.2%, and 89.4% without even looking at the CSV, which were all within a couple percent of the correct values.
The Class of Experiments
There were four types of experiments.

A fair coin. In this experiment, each trial had a 50% chance exactly of giving R.
A slider coin. In this experiment, a coin bias p was chosen from [0,1] uniformly at random, and then each trial had a p chance of giving R.
A filtered slider coin. Like in case (2), a coin bias p was chosen uniformly from [0,1]. Unlike in case (2), the coin was only accepted if its first three flips were R (half the time) or L (the other half of the time). Coins would keep being selected until one passed the filter, and then that would be the coin for the experiment.
A frog hop. A frog is placed on the middle lily pad of a line of seven lily pads. Every minute, it hops left or right. Eventually, it will hop off the line of lily pads to the right or the left. Each trial’s result is which side it hops off. An experiment is constructed by choosing fixed hop probabilities for each lily pad independently and uniformly from [0,1].

The four types of experiments were mixed with weights 10%, 10%, 10%, and 70%.
The second and third cases can both be viewed as drawing the Bernoulli probability from a mixture of Beta distributions (so that the correct prior is a mixture of Beta distributions). For case (2), the Beta distribution is Beta(1,1), and for case (3), the distribution is an even mixture of Beta(1,4) and Beta(4,1).
Cases (1) and (4) do not have Beta priors. Case (4) in particular is tricky. The probability for its Bernoulli process is the stationary point of a Markov chain, which involves multiplying and adding many random variables together to calculate.
Calculating the True Marginal Probabilities
The marginals in the first three cases can be calculated analytically very easily. I think the fourth case can also be analytically solved, but I chose to use a Monte Carlo simulation instead. To get better accuracy, I used two tricks:

Instead of actually running the trials for each experiment, I calculated the stationary point of the Markov chain to get the Bernoulli probability p, which allowed me to compute the exact marginal probabilities (and their weights) for that experiment.
I used PyTorch and the GPU to parallelize 10 billion experiments.

Combining everything together, I got the true marginal probabilities to six significant figures (so 5 correct digits). They are 11.0810%,32.5015%,49.9997%,67.4990%, and 88.9188%.
Who wins?
Because you all were so close, I decided to resort to the true probabilities to determine a winner. My scoring function is
4∑i=0wi[piln(piqi)+(1−pi)ln(1−pi1−qi)],
where wi is the true probability of i Rs in the first four trials, pi is the true ith marginal probability, and qi is your ith guessed marginal probability. (This is a weighted average of the KL divergence from the true marginals to your guesses.)
Your scores for your public submissions are

Name
Score

Unnamed
2.999e-07

One
3.017e-07

DaemonicSigil
5.433e-07

James Camacho
2.840e-04

Cleo Nardo
4.520e-04

It’s very close, but Unnamed wins. Congratulations!
Resolving the Manifold Market
As promised, I resolved the Manifold market using the marginals for the original 1,000,000 trials. Manifold Markets didn’t allow me to resolve to a sub-percentage precision, so I randomly rounded a percentage x to either its ceiling or its floor with probabilities {x} and 1−{x}, where {x} is the fractional part of x.
Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *