Measuring User Experience with ScalaCheck, Selenium WebDriver, and Six Sigma

I recently stumbled upon an idea that I think can measure defects in user experience, and I want to put it down in writing so I have a starting point for further research.  

The germ of this idea took root in my mind after my last blog post.

In my last blog post, I applied traditional software engineering principles to developing javascript SPAs, and I used automated testing of user stories as an example.

I also happen to have a project where I use scalacheck to automate generative tests for machine learning algorithms and data pipelining architectures.

Further, I happen to have some experience with six sigma from my days working as a defense contractor.

By combining the different disciplines of (a) user story mapping, (b) generative testing, and (c) six sigma, I believe that we can measure the “defects of user experience” inherent in any “system under test”.

Let’s discuss each discipline in turn.

User Story Mapping

User story mapping is an approach to requirements gathering that uses concrete examples of “real world” scenarios to avoid ambiguity.

Each scenario clearly defines the context of the system and how the system should work in a given case, and ideally, describe something that we can easily test with an automated testing framework.

For example, here is a sample “create account” user story

One of the limitations of testing user stories is that they cannot give you a measure of the correctness of your application. This is because to “prove” program correctness with programatic tests we would need to check every single path through our program. 

However, to be fair, the goal of user stories is to gather requirements and provide an “objective” measurement system by which developers, product, and qa can agree to in advance. 

Nevertheless, we still need a means of providing some measure of “program correctness”.

Enter Generative Testing.

Generative Tests

Generative testing tests programs using randomly generated data. This enables you to provide a probabilistic measurements of program correctness. However, this assumes that you know how to setup an experimental design that you can use to measure the accuracy of your program.

For example, the scalacheck documentation provides the following snippet of code that tests the java string class.

If you run scalacheck with StringSpecification as input then scalacheck would randomly generate strings and check whether the properties that you defined in StringSpecification are true.

Here is the result that scalacheck would provide if you ran it with StringSpecification as input.

We can see that scalacheck successfully ran 400 tests against StringSpecification.

Let’s do a little statistical analysis to figure out what this means about our the string class.

According to one view of statistics, every phenomenon has a “true” probability distribution which is really just an abstract mathematical formula, and we use data collection methods to estimate the parameters of the “true” distribution.

We will assume this is the case for this discussion.

Suppose that we do not know anything the String class. Under this assumption, the maximum entropy principle dictates that we assign a 1 to 1 odds to every test that scalacheck runs.

That basically means that we should treat every individual test like a coin flip. This is known as a Bernoulli trial.

Now, some really smart guy named Wassily Hoeffding figured out a formula that we could use to bound the accuracy and precision of an experiment based exclusively on the number of trials. We, unsurprisingly, call it Hoeffding’s inequality.

I will not bother explaining the math. I’m sure that I’d do a horrible job at it.

It would also be boring.

I will instead give a basic breakdown of how the number of trials relate to the accuracy and precision of your experiment.

number of trials margin of error confidence interval
80 10% 95%
115 10% 99%
320 5% 95%
460 5% 99%
2560 2.5% 95%
3680 2.5% 99%
8000 1% 95%
11500 1% 99%

The margin of error measures the accuracy of our experiment and the confidence interval measures the precision of our experiment.

Consider the margin of error as a measurement of the experimental results reliability, and the confidence interval as a measurement of the experimental method’s reliability.

For example, if I had an experiment that used 80 trials and I obtained a point estimate of 50% then this would mean that the “real” value is somewhere between 40% and 60% and that the experiment itself would be correct 95 times out of 100.

In other words, 5% of the time an experiment like this one would generate completely bogus numbers.

Now that I have explained that, let us apply this concept to our StringSpecification object. Based on the fact that we had 400 successful runs we can objectively say that the String class’s “true” accuracy is roughly between 95% – 100%, and that there is only a 1% chance that I am completely wrong.

Easy. Right?

I totally understand if you didn’t understand a single thing of what I just said. Are you still reading?

You might be able to set-up an experimental design and measure the results if you are a statistician. However, it is probably beyond the ability of most people.

It would be nice if there was some general set of methods that we could apply in a cookie cutter way, but still have robust results.

Enter Six Sigma.

Six Sigma

Officially, Six Sigma is a set of techniques and tools for process improvement; so, I do not believe that it is generally applicable to software engineering. However, there are a few six sigma techniques that I think are useful.

For example, we could probably use DPMO to estimate how often out system would create a bad user experience (this is analogous to creating a bad part in a manufacturing process).

DPMO stands for Defects per million opportunities, and it is defined by the formula

Let’s suppose that we decided to use scalacheck to test user stories with randomly generated values.

This would immediately open up the prospect of measuring “user experience” using DPMO.

For example, let’s consider the scenario “Valid Account Information” for the feature “Create Account”.

According to the scenario, there are two things that would make this test fail:

  • not seeing the message “Account Created”
  • not seeing the link to the login screen

Suppose that we ran 200 randomized tests based on this user story, and had 5 instances where we did not see the message “Account Created” and 2 instances where we did not see the link to the login screen.

This means we have 7 defects out of 2 opportunities from 200 samples. Therefore, DMPO = (7 / (200*2)) = 0.0175 * 1,000,000 = 17,500, which implies that if we left our system in its current state then we can expect to see 17,500 defects for every 1,000,000 created accounts.

Notice how much easier the math is compared to the math for pure generative testing.

These are formulas and techniques that an average person could understand and make decisions from.

That is huge.

Conclusion

This is just me thinking out loud and exploring a crazy idea. However, my preliminary research suggests that this is very applicable. I’ve gone through a few other six sigma tools and techniques and the math seems very easy to apply toward the generative testing of user stories.

I did not provide a concrete example of how you would use scalacheck to generatively test user stories because I didn’t want it to distract from the general concept. In a future blog post, I will use an example scenario to walk through a step-by-step process of how it could work in practice.

Stay tuned.

Advertisements

Data Science and the Answer to the Ultimate Question of Life, the Universe, and Everything

In “The Hitchhiker’s Guide to the Galaxy”, Douglas Adams tells the story of hyper-intelligent pan-dimensional beings who build a computer named Deep Thought to calculate “the Answer to the Ultimate Question of Life, the Universe, and Everything.” After seven and a half million years, Deep Thought outputs an unintelligible answer: 42.

When they probed Deep Thought for more information it tells them that they did not understand the answer because they did not understand what they had asked.

The moral: make sure you have a good question before you start looking for an answer.

So is the case with “data science”.

You can employ the most sophisticated data science techniques with the right data crunching technologies, but without clear goals you can’t make sense of the numbers.

Based on this principle, I believe that business analysts contribute the most to the success of any “data science” project: they know what to ask, and they know what an answer should look like.

Unfortunately, I’ve seen many organizations invest heavily in machine learning experts and statisticians who don’t understand the business. They are simply building another Deep Thought who will return unactionable results like “42”.

All this could have been avoided if more people just read science fiction.

Is your software project failing? Blame management

I just read the Chaos Manifeto 2013: Think Big, Act Small paper from the Standish Group. It blew my mind.

Let me give a few introductions about the Standish Group and the chaos manifesto before I elaborate on my mind blowing revelation.

Who’s the Standish Group and Why Should You Care?

The Standish Group has been collecting information on IT projects since 1985. This situation allows them to give unique and informed commentary on what does and does not work for building software projects.

The Chaos Manifesto provides their insight and perspective into software projects based on the data they have. The Chaos Manifesto 2013: Think Big, Act Small paper provides that for “small projects” based on 50,000 projects since 2002.

I would recommend everyone involved in building software to read it (particularly if you have a management role).

The paper claims that there exists 10 “Factors of Success” to a project, and that some factors matter more than others. Each one of the factors has a very technical meaning, and I will not talk about it in this post. I challenge you to download the pdf and read it for yourself.

I have replicated a table that summarizes their findings below.

Factors of Success Points
Executive management support 20
User involvement 15
Optimization 15
Skilled resources 13
Project management expertise 12
Agile process 10
Clear business objectives 6
Emotional maturity 5
Execution 3
Tools and infrastructure 1

The Lesson To Be Learned

I’d like you to notice just how little developer actually matters according to the Standish Group.

Within this paradigm, the individual developer only has control of their own knowledge, skill, and emotional maturity, and that falls under the “skilled resources” and “emotional maturity” category.

That means that developers can really only contribute at most 18% to the success of a project. The other 72% belongs to management.

That blew my mind.

However, once I really thought about it, I realized just how consistent it was with all of my experiences.

All of my best (and successful) projects had good management, and most of my worst (and failing) projects had horrible management.

For example, I have been in many situations where very important aspects of my job were completely out of my control, or I had to work in an environment or with people and tools that made me very unproductive.

This revelation both humbles and horrifies me.

It humbles me because I understand just how much those above me contribute to my success.

It horrifies me because I understand just how much those above me contribute to my success.

I wish that management really understand just how much their actions (or lack thereof) affects their people.

The Mathematics of Disruptive Innovation

In my post How Rent-Seeking Will Kill the Hollywood Studios, I stated that the hollywood studios have implemented a strategy of milking their existing markets instead of trying to capture a new market.

I also mentioned that this business strategy would ultimately kill the hollywood studios.

In this post, I’d like to give a thought experiment to justify why retreating up-market cannot sustain a company when new markets undermine their existing business models.

Suppose we have two companies: Company A and Company B.

Suppose Company A has a customer base of 100,000,000 people and Company B has no customer base.

Suppose that some new technology creates market A of 10,000,000 million new customers and that this market has a growth rate of 10% per year. Also, market A undermines market B at a rate of 5% per year.

Suppose that company A captures 10% of market A per year, and still captures 100% of the diminishing market B.

Suppose that company B captures 90% of market A per year, and does not have any investment in market B.

We can model this situation with the following equations.

Company A = 10^8 * (.95)^year + 10^7 * (1.10)^year * 0.10

Company B = 10^7 * (1.10)^year * 0.90

Simply calculating the results by plugging in the year yields the following results

year Company A Company B
0 101,000,000 9,000,000
1 96,100,000 9,900,000
5 78,988,603 14,494,5900
10 62,467,436 23,343,682
20 42,576,092 60,547,499

I admit that I’ve created a very simple model.

First, we have no exact number of how much a new market can undermine a business besides the fact that it must do so by a value greater than zero: it could be 0.01% or greater than 10%.

Secondly, there are hundreds of variables working together to determine the growth rate of a specific market. That prevents it from being constant over any length of time.

However, the simulation does show the long-term damage caused by bad business strategy. Any decline in a market can severely damage a company over the long run and make it vulnerable to bankruptcy or acquisition.

Ultimately, market forces will punish any company that fights disruption. If a company fights disruption then they can only delay the inevitable by changing the rate of decline.

How Rent-Seeking Will Kill the Hollywood Studios

Having worked in the entertainment industry for 3 years, I have learned the internal workings of how movie studios make money. After countless hours discussing business models and technology, I’ve come to believe that the hollywood studios have strategically positioned themselves on the wrong end of an innovation disruption.

First, let me tell you the classic story of how disruptive innovation destroys companies.

A brief example

By definition, disruptive innovation creates a new market by applying a different set of values, which ultimately (and unexpectedly) overtakes an existing market (source: http://en.wikipedia.org/wiki/Disruptive_innovation).

When disruptive innovation begins to undermine existing business models, large corporations do not typically have the political will to kill their affected business sectors. Instead, they try to defend their products from disruption (especially, if the disruption affects their high profit margin products).

At some point, they try to squeeze as much money as possible from their existing business models, once they realize that they can no longer defend against the disruption. After that they can either retreat  somewhere they feel that the disruption cannot reach, or somehow include the new market into their existing business models. However, by that time, anything they do will be “too little and too late”.

Take the example of Eastman Kodak.

Eastman Kodak invented the business of selling inexpensive cameras. Their business model worked so well that at one point, they commanded 90% of film sales and 85% of camera sales in the US.

When consumers started to demand digital photography products, Kodak did not move fast enough to meet that demand. At that time, digital photography had very low profit margins, and also undercut its existing high profit margin film business. By the time digital film generated high profit margins, other companies had already outmaneuvered Eastman Kodak in that space. Eastman Kodak could not do anything to take the existing market share from their competitors.

Ironically, Eastman Kodak invented digital photography in 1975. If they pushed digital photography early they could have easily owned the emerging digital photography market. However, they dropped that product because they feared it would threaten their photography film business.

The studios’ problem

Enter the giant Hollywood Studios (i.e. Disney, Warner Brothers, Universal, NBC, etc …).

Hollywood movies have “release windows”, and each release window has a different set of players who take a cut of the profits.

The different “release windows” typically goes in this order: (a) theaters, (b) dvd/bluray, (c) pay-per-view, (d) premium cable channels, (e) network and cable tv, and (f) syndicated tv.

Before a movie ever gets released, the “profits” have already been divided among the different players in each “release window” through contractual agreement. This situation makes it nearly impossible to adapt to a disruption because everyone depends on the studios for content.

For example, movie theaters threatened to boycott the movie “Tower Heist” when Universal tried to experiment with releasing it on VOD in some markets just three weeks after its theatrical debut. They intended to test the viability of “premium VOD”. Rather than risk a complete loss, Universal capitulated to the theaters, and from then on never attempted to enter the “premium VOD” market ever again.

This situation places all the Hollywood studios in the exact same losing position of Eastman Kodak: trying to defend their nameplates from disruption instead of adapting to it.

In the Eastman Kodak situation, they had an entire business model around film development that would kick and scream if Kodak did anything to undermine them. However, at some point building a digital camera got so easy and inexpensive that pretty much anyone could do it.

Why buy film when your cell phone already has a built in camera? Oh, you need high quality photos well consider the various SLR digital camera’s from Nikon, Cannon, Sony, and … not Kodak.

The rise of digital photography made consumers value film less. To consumers, digital photography added value because it provided more convenience at less cost with more quality. However, Kodak anchored themselves to a business model that opposed the new value system.

Similarly, the rise of video on demand has changed the values of consumers. However, none of the studios can adapt to the new market without undermining their already existing business units. At most, they can only do half-hearted attempts of addressing what consumers really want.

Consider the case of Ultraviolet and Digital Copy. This allows you access to video streaming as long as you purchase a DVD/Blu-ray copy. This goes at least part way to close the value-gap. However, it does not go far enough because it does not really add value from the consumers viewpoint. It is simply another way for the studios to “collect rent”.

The studios think that they are “adding value” when they “collect rent”. However, customers don’t see it that way. To consumers, video piracy and netflix add value because they are less dependent on the studios.

That does not mean that the studios should not have created their own streaming services, though. Given the circumstances, they made the best move they could. Providing that service themselves could not have made their situation any worse; so, why not? However, it simply is not enough to get them the traction they need in the new marketplace.

Now the studios can only milk their nameplates for as much money as possible and attempt to retreat upmarket.

For example, Disney and WB do not really make money selling movies. They make money selling cult memberships.

The studios can always count on their franchises to collect a few very loyal consumers that will buy almost anything packaged with a certain branding. For example, I could literally take a pile of bull shit, wrap it in a Game of Thrones package, sell it, and someone would buy it. See here.

While this has worked very well the last couple of years, I does not seem like a very sustainable business model.

The moral of the story is buy more Netflix stock.

Updated: I corrected a misspelling based on a commenter’s observation that “Kodac” should be “Kodak”