Measuring User Experience with ScalaCheck, Selenium WebDriver, and Six Sigma

I recently stumbled upon an idea that I think can measure defects in user experience, and I want to put it down in writing so I have a starting point for further research.  

The germ of this idea took root in my mind after my last blog post.

In my last blog post, I applied traditional software engineering principles to developing javascript SPAs, and I used automated testing of user stories as an example.

I also happen to have a project where I use scalacheck to automate generative tests for machine learning algorithms and data pipelining architectures.

Further, I happen to have some experience with six sigma from my days working as a defense contractor.

By combining the different disciplines of (a) user story mapping, (b) generative testing, and (c) six sigma, I believe that we can measure the “defects of user experience” inherent in any “system under test”.

Let’s discuss each discipline in turn.

User Story Mapping

User story mapping is an approach to requirements gathering that uses concrete examples of “real world” scenarios to avoid ambiguity.

Each scenario clearly defines the context of the system and how the system should work in a given case, and ideally, describe something that we can easily test with an automated testing framework.

For example, here is a sample “create account” user story

One of the limitations of testing user stories is that they cannot give you a measure of the correctness of your application. This is because to “prove” program correctness with programatic tests we would need to check every single path through our program. 

However, to be fair, the goal of user stories is to gather requirements and provide an “objective” measurement system by which developers, product, and qa can agree to in advance. 

Nevertheless, we still need a means of providing some measure of “program correctness”.

Enter Generative Testing.

Generative Tests

Generative testing tests programs using randomly generated data. This enables you to provide a probabilistic measurements of program correctness. However, this assumes that you know how to setup an experimental design that you can use to measure the accuracy of your program.

For example, the scalacheck documentation provides the following snippet of code that tests the java string class.

If you run scalacheck with StringSpecification as input then scalacheck would randomly generate strings and check whether the properties that you defined in StringSpecification are true.

Here is the result that scalacheck would provide if you ran it with StringSpecification as input.

We can see that scalacheck successfully ran 400 tests against StringSpecification.

Let’s do a little statistical analysis to figure out what this means about our the string class.

According to one view of statistics, every phenomenon has a “true” probability distribution which is really just an abstract mathematical formula, and we use data collection methods to estimate the parameters of the “true” distribution.

We will assume this is the case for this discussion.

Suppose that we do not know anything the String class. Under this assumption, the maximum entropy principle dictates that we assign a 1 to 1 odds to every test that scalacheck runs.

That basically means that we should treat every individual test like a coin flip. This is known as a Bernoulli trial.

Now, some really smart guy named Wassily Hoeffding figured out a formula that we could use to bound the accuracy and precision of an experiment based exclusively on the number of trials. We, unsurprisingly, call it Hoeffding’s inequality.

I will not bother explaining the math. I’m sure that I’d do a horrible job at it.

It would also be boring.

I will instead give a basic breakdown of how the number of trials relate to the accuracy and precision of your experiment.

number of trials margin of error confidence interval
80 10% 95%
115 10% 99%
320 5% 95%
460 5% 99%
2560 2.5% 95%
3680 2.5% 99%
8000 1% 95%
11500 1% 99%

The margin of error measures the accuracy of our experiment and the confidence interval measures the precision of our experiment.

Consider the margin of error as a measurement of the experimental results reliability, and the confidence interval as a measurement of the experimental method’s reliability.

For example, if I had an experiment that used 80 trials and I obtained a point estimate of 50% then this would mean that the “real” value is somewhere between 40% and 60% and that the experiment itself would be correct 95 times out of 100.

In other words, 5% of the time an experiment like this one would generate completely bogus numbers.

Now that I have explained that, let us apply this concept to our StringSpecification object. Based on the fact that we had 400 successful runs we can objectively say that the String class’s “true” accuracy is roughly between 95% – 100%, and that there is only a 1% chance that I am completely wrong.

Easy. Right?

I totally understand if you didn’t understand a single thing of what I just said. Are you still reading?

You might be able to set-up an experimental design and measure the results if you are a statistician. However, it is probably beyond the ability of most people.

It would be nice if there was some general set of methods that we could apply in a cookie cutter way, but still have robust results.

Enter Six Sigma.

Six Sigma

Officially, Six Sigma is a set of techniques and tools for process improvement; so, I do not believe that it is generally applicable to software engineering. However, there are a few six sigma techniques that I think are useful.

For example, we could probably use DPMO to estimate how often out system would create a bad user experience (this is analogous to creating a bad part in a manufacturing process).

DPMO stands for Defects per million opportunities, and it is defined by the formula

Let’s suppose that we decided to use scalacheck to test user stories with randomly generated values.

This would immediately open up the prospect of measuring “user experience” using DPMO.

For example, let’s consider the scenario “Valid Account Information” for the feature “Create Account”.

According to the scenario, there are two things that would make this test fail:

  • not seeing the message “Account Created”
  • not seeing the link to the login screen

Suppose that we ran 200 randomized tests based on this user story, and had 5 instances where we did not see the message “Account Created” and 2 instances where we did not see the link to the login screen.

This means we have 7 defects out of 2 opportunities from 200 samples. Therefore, DMPO = (7 / (200*2)) = 0.0175 * 1,000,000 = 17,500, which implies that if we left our system in its current state then we can expect to see 17,500 defects for every 1,000,000 created accounts.

Notice how much easier the math is compared to the math for pure generative testing.

These are formulas and techniques that an average person could understand and make decisions from.

That is huge.


This is just me thinking out loud and exploring a crazy idea. However, my preliminary research suggests that this is very applicable. I’ve gone through a few other six sigma tools and techniques and the math seems very easy to apply toward the generative testing of user stories.

I did not provide a concrete example of how you would use scalacheck to generatively test user stories because I didn’t want it to distract from the general concept. In a future blog post, I will use an example scenario to walk through a step-by-step process of how it could work in practice.

Stay tuned.


How To Create Non-Reproducable Results in Academic Research: A Story of Survivor Bias

I had a discussion today with a friend about the recent news that some researchers couldn’t reproduce a significant amount of the results in academic journals.

He believed that this indicated rampant cheating in the academic community. I disagreed with him, though.

According to Hanlon’s Razor, we should never attribute to malice that which we can attribute to stupidity; so, I argued that we should prefer to believe in massive incompetence instead of some evil grand conspiracy.

I used a very simple thought experiment to illustrate this.

Stupid Is as Stupid Does


Suppose that some postgraduate student wants to research the following question: “Does spanking children increase their likelihood of going to jail?”.

Suppose that after a few months of collecting data, our postgraduate student found that spanked people went to jail at a rate that is 50% higher than non-spanked people.  Does this suggest that spanking children increases the chance of committing crime?

Well, that depends on how many people we included in our sample.

Suppose that the data looked like this:


No Jail






Not Spanked








This hypothetical study included 200 people of which 10 went to jail. If spanking a child has no effect then we would expect to see the same amount of people from both classes (i.e. spanked and not spanked) go to jail. However, in this case we see that 2 extra spanked people went to jail.

Does this suggest that we have a measurable effect worth investigating?

The answer is no, because we could have observed this outcome by mere chance.

Suppose we take any group of 200 people, tag two of them, and randomly assign all 200 people to 4 cells. The probability that both of our tagged people would fall into one cell is around 30%.

We use that fact to claim that our results are not significant.

However, suppose that the data looked like the following, instead:


No Jail






Not Spanked








In this case, we have 200 extra people who have been spanked and went to jail. The probability that all 200 tagged people would randomly fall into one group is less than .01%.

This type of data would suggest that something significant is worth investing about this phenomenon.

This provides an illustration of the basic theory of “null hypothesis testing” which is also known as “significance testing”.

We use this method to measure the probability of observing some data when we assume that the null hypothesis is correct.

Statisticians call this measurement a “p-value”, and from what I hear, the social sciences heavily use it to determine what they publish and don’t publish.

With “significance testing”, we want to reject the null hypothesis, and we use the p-value as the means to reject it. Therefore, we want small p-values, since small p-values provide good evidence against the null hypothesis.

Further, there exists a consensus among some researchers that we should consider p-values less than 5% as significant.

By claiming that we should use 5% as the cutoff for significance testing, the academic community necessarily accepts false negatives 5 times out 100.

In the examples I gave above, we did not have a significant result with 200 sampled people because we had a 30% (i.e. 30 times out of 100) chance to produce that result by accident. However, we did have a significant result with 20,000 sampled people because we had a .01% (i.e. 1 time out of 10,000) chance to produce that result by accident.

Now, suppose that we have 10 postgraduate students researching the question “Does spanking children increase their likelihood of going to jail?”, and they all aim for a 5% p-value.

With 10 postgraduate students the probability that at least one of them will get a false negative is roughly 40%.

If we suppose that 20 postgraduate students did this experiment then the probability that at least one of them will get a false negative is roughly 64%.

If we suppose that 100 postgraduate students did this experiment then the probability that at least one of them will get a false negative  is roughly 99%.

Essentially, if you have enough postgraduate students running the same experiment then you’ll get a “significant” result for pretty much any research question.

Mediocrity: It Takes A Lot Less Time And Most People Won’t Notice The Difference Until It’s Too Late


The incentives caused by “publish or die” in academia essentially guarantee that we will have a significant amount of research papers based on false negatives.

In this case, everyone ignores the results that don’t look sexy even when they are true, and focus only the sexy results even when they are false.

This is a special case of survivorship bias.

Sadly, we have an easy way to check against this phenomenon that apparently the journals do not use: REPEAT THE EXPERIMENT.

This works because we have a 1 in 400 chance to generate two false negative in a row when we use a 5% p-value.

The fact that so many articles made it passed the peer-review process just exposes just how lazy some of our “academics” really are.