How To Create Non-Reproducable Results in Academic Research: A Story of Survivor Bias

I had a discussion today with a friend about the recent news that some researchers couldn’t reproduce a significant amount of the results in academic journals.

He believed that this indicated rampant cheating in the academic community. I disagreed with him, though.

According to Hanlon’s Razor, we should never attribute to malice that which we can attribute to stupidity; so, I argued that we should prefer to believe in massive incompetence instead of some evil grand conspiracy.

I used a very simple thought experiment to illustrate this.

Stupid Is as Stupid Does

f212fd68aeca509fcdbbb675e67ce9276b0d9fa965fa8bb5865c167eccb8b1d9

Suppose that some postgraduate student wants to research the following question: “Does spanking children increase their likelihood of going to jail?”.

Suppose that after a few months of collecting data, our postgraduate student found that spanked people went to jail at a rate that is 50% higher than non-spanked people.  Does this suggest that spanking children increases the chance of committing crime?

Well, that depends on how many people we included in our sample.

Suppose that the data looked like this:

Jail

No Jail

Total

Spanked

6

94

100

Not Spanked

4

96

100

Total

10

190

200

This hypothetical study included 200 people of which 10 went to jail. If spanking a child has no effect then we would expect to see the same amount of people from both classes (i.e. spanked and not spanked) go to jail. However, in this case we see that 2 extra spanked people went to jail.

Does this suggest that we have a measurable effect worth investigating?

The answer is no, because we could have observed this outcome by mere chance.

Suppose we take any group of 200 people, tag two of them, and randomly assign all 200 people to 4 cells. The probability that both of our tagged people would fall into one cell is around 30%.

We use that fact to claim that our results are not significant.

However, suppose that the data looked like the following, instead:

Jail

No Jail

Total

Spanked

600

9,400

10,000

Not Spanked

400

9,600

10,000

Total

1,000

1,900

20,000

In this case, we have 200 extra people who have been spanked and went to jail. The probability that all 200 tagged people would randomly fall into one group is less than .01%.

This type of data would suggest that something significant is worth investing about this phenomenon.

This provides an illustration of the basic theory of “null hypothesis testing” which is also known as “significance testing”.

We use this method to measure the probability of observing some data when we assume that the null hypothesis is correct.

Statisticians call this measurement a “p-value”, and from what I hear, the social sciences heavily use it to determine what they publish and don’t publish.

With “significance testing”, we want to reject the null hypothesis, and we use the p-value as the means to reject it. Therefore, we want small p-values, since small p-values provide good evidence against the null hypothesis.

Further, there exists a consensus among some researchers that we should consider p-values less than 5% as significant.

By claiming that we should use 5% as the cutoff for significance testing, the academic community necessarily accepts false negatives 5 times out 100.

In the examples I gave above, we did not have a significant result with 200 sampled people because we had a 30% (i.e. 30 times out of 100) chance to produce that result by accident. However, we did have a significant result with 20,000 sampled people because we had a .01% (i.e. 1 time out of 10,000) chance to produce that result by accident.

Now, suppose that we have 10 postgraduate students researching the question “Does spanking children increase their likelihood of going to jail?”, and they all aim for a 5% p-value.

With 10 postgraduate students the probability that at least one of them will get a false negative is roughly 40%.

If we suppose that 20 postgraduate students did this experiment then the probability that at least one of them will get a false negative is roughly 64%.

If we suppose that 100 postgraduate students did this experiment then the probability that at least one of them will get a false negative  is roughly 99%.

Essentially, if you have enough postgraduate students running the same experiment then you’ll get a “significant” result for pretty much any research question.

Mediocrity: It Takes A Lot Less Time And Most People Won’t Notice The Difference Until It’s Too Late

mediocritydemotivator

The incentives caused by “publish or die” in academia essentially guarantee that we will have a significant amount of research papers based on false negatives.

In this case, everyone ignores the results that don’t look sexy even when they are true, and focus only the sexy results even when they are false.

This is a special case of survivorship bias.

Sadly, we have an easy way to check against this phenomenon that apparently the journals do not use: REPEAT THE EXPERIMENT.

This works because we have a 1 in 400 chance to generate two false negative in a row when we use a 5% p-value.

The fact that so many articles made it passed the peer-review process just exposes just how lazy some of our “academics” really are.

Single Assertions Do Not (Necessarily) Check Single Expectations

A friend recently criticized the way that I write unit tests.

According to him, we should only test one expectation at a time; so, we can only have one assertion per test.

Now, I completely agree with the “test only one expectation at at time” philosophy. However, I completely deny that that implies we should only have one assertion per test.

Unfortunately, I’ve noticed that a large portion of developers buy into this mantra; and under some circumstances, it actually has very good results. However, it also has the potential to block developers from writing clear and meaningful tests.  

Confusing Strategy for Tactics

For this discussion, I will use the arrange-act-assert paradigm of software testing.

Suppose that I wanted to remove duplicate items in a javascript array. The following function could provide that functionality.

In order to test the “uniq” function, I could use the following jasmine code

Superficially, this test seems to follow the “one expectation; one assertion” rule. However, consider what it means for the test to fail.

What postconditions could make this a failing unit test? From the top of my head, I can think of the following scenarios

  • The result is not an array
  • The array has more or less than 4 elements in the list
  • Index zero of the array is not 0
  • Index one of the array is not 1
  • Index two of the array is not 2
  • Index three of the array is not 3

This implies that the following unit test is at least as powerful.

Tasting the Forbidden Fruit

Even though the first unit test had only one assertion, it actually checked at least 6 different postconditions.

This proves that the “one expectation; one assertion” rule does not work (in general); so, feel free to use as many assertions that you want and need.

Personally, I find that aiming for “one expectation; one assertion” makes me write cleaner tests. However, I’ve also found situations where it makes me write complex or cumbersome tests. My milage using “one expectation; one assertion” varies depending on the type of test.

I would recommend everyone at least try to aim for the ideal of “one expectation; one assertion”, but only as far as it helps you write good tests. At that point, you should throw it away.