How To Create Non-Reproducable Results in Academic Research: A Story of Survivor Bias

I had a discussion today with a friend about the recent news that some researchers couldn’t reproduce a significant amount of the results in academic journals.

He believed that this indicated rampant cheating in the academic community. I disagreed with him, though.

According to Hanlon’s Razor, we should never attribute to malice that which we can attribute to stupidity; so, I argued that we should prefer to believe in massive incompetence instead of some evil grand conspiracy.

I used a very simple thought experiment to illustrate this.

Stupid Is as Stupid Does


Suppose that some postgraduate student wants to research the following question: “Does spanking children increase their likelihood of going to jail?”.

Suppose that after a few months of collecting data, our postgraduate student found that spanked people went to jail at a rate that is 50% higher than non-spanked people.  Does this suggest that spanking children increases the chance of committing crime?

Well, that depends on how many people we included in our sample.

Suppose that the data looked like this:


No Jail






Not Spanked








This hypothetical study included 200 people of which 10 went to jail. If spanking a child has no effect then we would expect to see the same amount of people from both classes (i.e. spanked and not spanked) go to jail. However, in this case we see that 2 extra spanked people went to jail.

Does this suggest that we have a measurable effect worth investigating?

The answer is no, because we could have observed this outcome by mere chance.

Suppose we take any group of 200 people, tag two of them, and randomly assign all 200 people to 4 cells. The probability that both of our tagged people would fall into one cell is around 30%.

We use that fact to claim that our results are not significant.

However, suppose that the data looked like the following, instead:


No Jail






Not Spanked








In this case, we have 200 extra people who have been spanked and went to jail. The probability that all 200 tagged people would randomly fall into one group is less than .01%.

This type of data would suggest that something significant is worth investing about this phenomenon.

This provides an illustration of the basic theory of “null hypothesis testing” which is also known as “significance testing”.

We use this method to measure the probability of observing some data when we assume that the null hypothesis is correct.

Statisticians call this measurement a “p-value”, and from what I hear, the social sciences heavily use it to determine what they publish and don’t publish.

With “significance testing”, we want to reject the null hypothesis, and we use the p-value as the means to reject it. Therefore, we want small p-values, since small p-values provide good evidence against the null hypothesis.

Further, there exists a consensus among some researchers that we should consider p-values less than 5% as significant.

By claiming that we should use 5% as the cutoff for significance testing, the academic community necessarily accepts false negatives 5 times out 100.

In the examples I gave above, we did not have a significant result with 200 sampled people because we had a 30% (i.e. 30 times out of 100) chance to produce that result by accident. However, we did have a significant result with 20,000 sampled people because we had a .01% (i.e. 1 time out of 10,000) chance to produce that result by accident.

Now, suppose that we have 10 postgraduate students researching the question “Does spanking children increase their likelihood of going to jail?”, and they all aim for a 5% p-value.

With 10 postgraduate students the probability that at least one of them will get a false negative is roughly 40%.

If we suppose that 20 postgraduate students did this experiment then the probability that at least one of them will get a false negative is roughly 64%.

If we suppose that 100 postgraduate students did this experiment then the probability that at least one of them will get a false negative  is roughly 99%.

Essentially, if you have enough postgraduate students running the same experiment then you’ll get a “significant” result for pretty much any research question.

Mediocrity: It Takes A Lot Less Time And Most People Won’t Notice The Difference Until It’s Too Late


The incentives caused by “publish or die” in academia essentially guarantee that we will have a significant amount of research papers based on false negatives.

In this case, everyone ignores the results that don’t look sexy even when they are true, and focus only the sexy results even when they are false.

This is a special case of survivorship bias.

Sadly, we have an easy way to check against this phenomenon that apparently the journals do not use: REPEAT THE EXPERIMENT.

This works because we have a 1 in 400 chance to generate two false negative in a row when we use a 5% p-value.

The fact that so many articles made it passed the peer-review process just exposes just how lazy some of our “academics” really are.


What do probabilities measure, and why does it matter?

Wikipedia defines measurement as “the assignment of numbers to objects or events. It is a cornerstone of most natural sciences, technology, economics, and quantitative research in other social sciences”.

For example, a gram measures weight, a meter measures distance, and a liter measures volume.

So, what does a percent measure?

Well, it can measure at least 3 different things

  • frequency
  • belief
  • vagueness

Each measure gives rise to different disciplines.

A brief history

The history of probability and statistics is really the history of people. For that reason, I’d like to focus on the people involved in various school of thought.

The Frequentists

The frequentist tradition views probability as the result of calculating the ratio of “successes” vs “total outcomes”. Pierre Fermat, Blaise Pascal, Galileo, Gaus, and Jacob Bernoulli devised methods associated with this view of probability.

As a consequence of this view, probability only made sense when applied to large collections of objects or events.  For example, according to this view, it does not make sense to talk about the probability of a single coin flip. You could only talk about the frequency of head vs tails given a large collection of coin flips.

The Bayesians

The Bayesian tradition views probability as a measure of partial knowledge.

Consider the example of a coin flip. If we flip a coin then that coin will either land heads or tails. There is no in-between. However, we can have a partial belief in the outcome, even though the outcome can only be heads or tails.

Thomas Bayes developed this theory from a very simple equation we now know as Baye’s Theorem. With Baye’s theorem we can essentially measure the likelihood of an individual event if it is conditional on something else.

In the 1920s, John Maynard Keynes built on that foundation by developing methods that we now know as “objective” Bayesianism. Under this school of thought, given the same amount of information, everyone should have the same belief which we can measure with probability.

Around the same time, Frank Ramsey and Bruno de Finetti  proposed that you cannot really measure what people should believe. You can only measure what they actually do believe. As a result, probabilities measured the subjective beliefs of an individual. Naturally, these methods belong to “subjective” Bayesianism.

Fuzzy Logic

Lotfali Zadeh invented fuzzy logic (fuzzy set theory) starting in the 1960s as a means of measuring “truth”.

Lotfali Zadeh asked what would happen if we had partial (fuzzy) truth. That leads to very different methods and approaches.

Consider the case of a pot of cooking rice. We want to know if the rice is finished. However, we don’t really have a clear distinction of when the rice is cooked or when it is partially cooked. In this case, we say that the rice is sort of cooked because the distinction between the two categories are “fuzzy”.

Why these distinctions matters?

These are not distinctions without a difference. Improper use of probabilistic language can actually lead to confusion and bad decisions.

Consider the following scenario.

Using random sampling, you can say something about the parameters of a distribution within a certain accuracy and precision. In the frequentist tradition, we measure accuracy with error bounds and precision with confidence intervals.

Consider the error bound as a measure of the result’s reliability and the confidence interval as a measure of the method’s reliability.

For example, I could say that 50% of California voters will vote yes on a proposition within a 3% error bound and 95% confidence interval. In this case, the 3% error bound says that the true average is actually somewhere between 47% and 53% and that the method I used will be correct 95 times out of 100.

However, I have seen people state that the 95% confidence interval means that they are “95% sure” that the parameter actually lies within that interval. That is completely wrong.

This might seem like a minor distinction, but that distinction has major consequences in decision making: calculating a degree of belief is very different from calculating a confidence interval.

I’ll leave that discussion to another blog post.

Hollywood does not have a liberal agenda. It’s actually much worse.

I had a conversation a while back with a friend about how “Hollywood” tries to push a “liberal agenda”. I argued that “Hollywood” does not really care about a “liberal agenda”. Instead, they simply want to make money on a “liberal agenda”.

I used the following thought experiment to prove my point.

Suppose you are a “greedy television executive”. According to your internal statistics, in the last fiscal year, liberals accounted for 53% of your ad revenues and conservatives 47%. Further, suppose liberals have a click-through rate 8 times higher than conservatives.

This means that liberals engage with your site more than conservatives, even though you essentially make the same amount of money from both demographics.

Now, let me present you some extra information which taken together significantly change the implications of that 53%.

First, those between the ages of 18 to 35 click almost all your ads.

Second, that sub-population makes 12% of your total population

Third, within that sub-population, males are 8%

Multiplying these percentages reveals that liberal males between 18 and 35 is 1% of our population.

This means that 1% of your user base generates close to 53% of your revenue. As a consequence, someone inside that group can make you 26 times more money than someone outside that group.


Now, suppose that you — the “greedy television executive” — have 1 million dollars to spend on generating content. You can “be objective” or “pander to a group”.

What do you do?

I’ll let you think about that.