Measuring User Experience with ScalaCheck, Selenium WebDriver, and Six Sigma

I recently stumbled upon an idea that I think can measure defects in user experience, and I want to put it down in writing so I have a starting point for further research.  

The germ of this idea took root in my mind after my last blog post.

In my last blog post, I applied traditional software engineering principles to developing javascript SPAs, and I used automated testing of user stories as an example.

I also happen to have a project where I use scalacheck to automate generative tests for machine learning algorithms and data pipelining architectures.

Further, I happen to have some experience with six sigma from my days working as a defense contractor.

By combining the different disciplines of (a) user story mapping, (b) generative testing, and (c) six sigma, I believe that we can measure the “defects of user experience” inherent in any “system under test”.

Let’s discuss each discipline in turn.

User Story Mapping

User story mapping is an approach to requirements gathering that uses concrete examples of “real world” scenarios to avoid ambiguity.

Each scenario clearly defines the context of the system and how the system should work in a given case, and ideally, describe something that we can easily test with an automated testing framework.

For example, here is a sample “create account” user story

One of the limitations of testing user stories is that they cannot give you a measure of the correctness of your application. This is because to “prove” program correctness with programatic tests we would need to check every single path through our program. 

However, to be fair, the goal of user stories is to gather requirements and provide an “objective” measurement system by which developers, product, and qa can agree to in advance. 

Nevertheless, we still need a means of providing some measure of “program correctness”.

Enter Generative Testing.

Generative Tests

Generative testing tests programs using randomly generated data. This enables you to provide a probabilistic measurements of program correctness. However, this assumes that you know how to setup an experimental design that you can use to measure the accuracy of your program.

For example, the scalacheck documentation provides the following snippet of code that tests the java string class.

If you run scalacheck with StringSpecification as input then scalacheck would randomly generate strings and check whether the properties that you defined in StringSpecification are true.

Here is the result that scalacheck would provide if you ran it with StringSpecification as input.

We can see that scalacheck successfully ran 400 tests against StringSpecification.

Let’s do a little statistical analysis to figure out what this means about our the string class.

According to one view of statistics, every phenomenon has a “true” probability distribution which is really just an abstract mathematical formula, and we use data collection methods to estimate the parameters of the “true” distribution.

We will assume this is the case for this discussion.

Suppose that we do not know anything the String class. Under this assumption, the maximum entropy principle dictates that we assign a 1 to 1 odds to every test that scalacheck runs.

That basically means that we should treat every individual test like a coin flip. This is known as a Bernoulli trial.

Now, some really smart guy named Wassily Hoeffding figured out a formula that we could use to bound the accuracy and precision of an experiment based exclusively on the number of trials. We, unsurprisingly, call it Hoeffding’s inequality.

I will not bother explaining the math. I’m sure that I’d do a horrible job at it.

It would also be boring.

I will instead give a basic breakdown of how the number of trials relate to the accuracy and precision of your experiment.

number of trials margin of error confidence interval
80 10% 95%
115 10% 99%
320 5% 95%
460 5% 99%
2560 2.5% 95%
3680 2.5% 99%
8000 1% 95%
11500 1% 99%

The margin of error measures the accuracy of our experiment and the confidence interval measures the precision of our experiment.

Consider the margin of error as a measurement of the experimental results reliability, and the confidence interval as a measurement of the experimental method’s reliability.

For example, if I had an experiment that used 80 trials and I obtained a point estimate of 50% then this would mean that the “real” value is somewhere between 40% and 60% and that the experiment itself would be correct 95 times out of 100.

In other words, 5% of the time an experiment like this one would generate completely bogus numbers.

Now that I have explained that, let us apply this concept to our StringSpecification object. Based on the fact that we had 400 successful runs we can objectively say that the String class’s “true” accuracy is roughly between 95% – 100%, and that there is only a 1% chance that I am completely wrong.

Easy. Right?

I totally understand if you didn’t understand a single thing of what I just said. Are you still reading?

You might be able to set-up an experimental design and measure the results if you are a statistician. However, it is probably beyond the ability of most people.

It would be nice if there was some general set of methods that we could apply in a cookie cutter way, but still have robust results.

Enter Six Sigma.

Six Sigma

Officially, Six Sigma is a set of techniques and tools for process improvement; so, I do not believe that it is generally applicable to software engineering. However, there are a few six sigma techniques that I think are useful.

For example, we could probably use DPMO to estimate how often out system would create a bad user experience (this is analogous to creating a bad part in a manufacturing process).

DPMO stands for Defects per million opportunities, and it is defined by the formula

Let’s suppose that we decided to use scalacheck to test user stories with randomly generated values.

This would immediately open up the prospect of measuring “user experience” using DPMO.

For example, let’s consider the scenario “Valid Account Information” for the feature “Create Account”.

According to the scenario, there are two things that would make this test fail:

  • not seeing the message “Account Created”
  • not seeing the link to the login screen

Suppose that we ran 200 randomized tests based on this user story, and had 5 instances where we did not see the message “Account Created” and 2 instances where we did not see the link to the login screen.

This means we have 7 defects out of 2 opportunities from 200 samples. Therefore, DMPO = (7 / (200*2)) = 0.0175 * 1,000,000 = 17,500, which implies that if we left our system in its current state then we can expect to see 17,500 defects for every 1,000,000 created accounts.

Notice how much easier the math is compared to the math for pure generative testing.

These are formulas and techniques that an average person could understand and make decisions from.

That is huge.

Conclusion

This is just me thinking out loud and exploring a crazy idea. However, my preliminary research suggests that this is very applicable. I’ve gone through a few other six sigma tools and techniques and the math seems very easy to apply toward the generative testing of user stories.

I did not provide a concrete example of how you would use scalacheck to generatively test user stories because I didn’t want it to distract from the general concept. In a future blog post, I will use an example scenario to walk through a step-by-step process of how it could work in practice.

Stay tuned.

How Frameworks Shackle You, and How to Break Free (Part Deux)

In my last post, I talked about how over-reliance of a framework creates immobile code, and how you can use the dependency inversion principle to break the dependency on a framework.

I also mentioned that I would create a demo application to demonstrate how to do this.

However, before I live up to that promise, I have to introduce you to a little theory.

Principles of Package Management

Mobile code does not just happen. You have to design it.

To build mobile code we typically group code into reusable packages that follow proper package design principles.

For this discussion, I will only consider the principles of package cohesion:

  • The Release/Reuse Equivalency Principle
  • The Common Reuse Principle
  • The Common Closure Principle

Robert Martin codified these principles in his book “Agile Software Development: Principles, Patterns, and Practices”. This book is the gold standard for agile software development.

The Release/Reuse Equivalency Principle

The Release/Reuse Equivalency Principle says that “the granule of reuse is the granule of release”.

This principle makes an equivalence between reusability and releasability.

This equivalence has two major implications:

  • you can only release code that is reusable
  • you can only reuse code that is releasable

The reverse is also true:

  • You cannot release code that is not reusable.
  • You cannot reuse code that is not releasable.

This principle puts a very heavy burden on the maintainer of a package, and that burden forces package maintainer to have a package release strategy.

Package maintainers generally follow the “semantic versioning” strategy.

Semantic versioning has very strict rules related to “semantic version numbers”.

Semantic version numbers consist of the pattern of x.y.z where x, y, and z are integers.

Each position conveys a particular meaning (hence the name “semantic versioning”).

The first number is the major version.

We usually start a package with version 0. We should considered a version 0 package as unfinished, experimental, heavily changing without too much care for backwards compatibility.

Starting from major version 1, we consider the published API as stabilized and that the package has a certain trustworthiness from that moment. Every next increment of the major version marks the moment that parts of the code breaks backward compatibility.

The second part of the version number is the minor version.

We increment the minor versions when we add new functionality to the package or deprecate parts of the public API. A minor release promises your clients that the package will not break backwards compatibility. A minor version only adds new ways of using the package.

The last part of the version number is the patch version.

Starting with version 0 it is incremented for each patch that is released for the package. This can be either a bug fix, or some refactored private code.

Further, a package maintainer has the option to add meta-data after the release numbers. Typically they will use it to classifying packages as having a particular state: alpha, beta, or rc (release candidate).

For example, these items could be the releases of a package:

  • 2.10.2
  • 3.0.0
  • 3.1.0
  • 3.1.1-alpha
  • 3.1.1-beta
  • 3.1.1-rc.1
  • 3.1.1-rc.2
  • 3.1.1

These number communicate the following to a client:

  • release 2.10.x has two patches. The patches may have been bug fixes or refactors. We would need to look at the changelog or commit logs to determine the importance of each patch.
  • After release 2.10.x, the package maintainer decided to break backwards compatibility. The package maintainer signaled to clients that the package breaks backwards compatibility by creating the 3.0.0 release.  
  • At 3.1.0, the package maintainer introduced new features that did not break compatibility with 3.0.0.
  • 3.1.1-alpha signals that the package maintainer started a patch of release 3.1.0. However, the package maintainer does not want to call the patch stable. Someone may have submitted a bug report to the package maintainer for release 3.1.0, and the package maintainer may have started the initial phases of fixing the bug. In this scenario, the package maintainer likely added some testing code to isolate the particular bug, or validate that the bug is fixed.
  • 3.1.1-beta suggests that the package maintainer completed the “feature”. Most likely this signals that the package maintainer’s automated tests pass.
  • 3.1.1-rc.1 suggests that the package passed manual QA and that the package manager can potentially release the patch as a stable version. The package manager would likely tell clients to run their integration tests against this release. Manual QA likely happen against this release, also.
  • 3.1.1-rc.2 suggests that the package maintainer found regression errors in 3.1.1-rc.1. It may indicate that an integration test failed for a client. The package manager may have fixed issues that a client reported and released the fix as 3.1.1-rc.2.
  • 3.1.1 signals that the package maintainer has successfully patched the 3.1.0 release.

The Common Reuse Principle

The Common Reuse Principle states that “code that is used together should be group together”. The reverse is also true: “code that is not used together should not be grouped together.”

A dependency on a package implies a dependency on everything within the package; so, when a package changes, all clients of that package must verify that they work with the new version.

If we package code that we do not use together then we will force our clients to go through the process of upgrading and revalidating their package unnecessarily.

By obeying the Common Reuse Principle, we provide the simple courtesy of not making our clients work harder than necessary.

The Common Closure Principle

The Common Closure Principle says that “code that changes together, belong together”.

A good architect will divide a large project into a network of interrelated packages.  

Ideally, we would like to minimize the number of packages for every change request because when we minimize the number of effected packages we also minimize the work to manage, test, and release those packages. The more packages that change in any given release, the greater the work to rebuild, test, and deploy the release. 

When we obey the Common Closure Principle we force a requirements change to happen to the smallest number of packages possible, and prevent irrelevant releases. 

Automated Tests are Mandatory

While clean code and clean design of your package is important, it’s more important that your package behaves well.

In order to verify the proper behavior of a package, you must have automated tests.

There exists many different opinions on the nature and extent of automated tests, though:

  • How many tests should you write?
  • Do you write the tests first, and the code later?
  • Should you add integration tests, or functional tests?
  • How much code coverage does your package need?

In my opinion, the particulars of how you write tests or the extent of your tests are situational. However, you must have automated tests. This is non-negotiable.

If you don’t have automated tests then you are essentially telling your clients this:

I do not care about you. This works for me today, and I do not care about tomorrow. Use this package at your own parel.  

Putting the Principles to Practice

Now that we have the principles, we can start to apply it.

In a future post, I will create a basic application that uses the package design principles that I described. Further, I will compose the components into different frameworks to demonstrate how to migrate the application between frameworks.

Stay tuned.

How Frameworks Shackle You, and How to Break Free

I sometimes hate software frameworks. More precisely, I hate their rigidity, and cookie cutter systems of mass production.

Personally, when someone tells me about a cool new feature of a framework, what I really hear is “look at the shine on those handcuffs.”

I’ve defended this position on many occasions. However, I really want to put it down on record so that I can simply point people at this blog post the next time they asks me to explain myself.

Why Use a Framework?

Frameworks are awesome things.

I do not dispute that.

Just from the top of my head, I can list the following reasons why you should use a framework

  1. A framework abstracts low level details.
  2. The community support is invaluable.
  3. The out of the box features enable you not to reinvent the wheel.
  4. A framework usually employs design patterns and “best practices” that enables team development  

I’m sure that you can add even better reasons to this list.

So why would I not be such a big fan?

What Goes Wrong with Frameworks?

VENDOR LOCK-IN.

Just to reinforce the point, let me say that again: VENDOR LOCK-IN.

There is a natural power asymmetry between the designer/creators of a framework and the users of that framework.

Typically, the users of the framework are the slaves to the framework, and the framework designers are the masters. It doesn’t matter if it is designed by some open source community or large corporation.

In fact, this power asymmetry is why vendors and consultants can make tons of money on “free” software: you are really paying them for their specialized knowledge.

Once you enter into a vendor lock-in, the vendor, consulting company, or consultant can charge you any amount of money they want. Further, they can continue to increase their prices at astronomical rates, and you will have no choice but to pay it.

Further, the features of your application become more dependent on the framework as the system grows larger. This has the added effect of increasing the switching cost should you ever want to move to another framework.

I’ll use a simple thought experiment to demonstrate how and why this happens.

Define a module as any kind of organizational grouping of code. It could be a function, class, package, or namespace.

Suppose that you have two modules: module A and module B. Suppose that module B has features that module A needs; so, we make module A use module B.

In this case we can say that module A depends on module B, and we can visualize that dependency with the following picture.

Slide1

Suppose that we introduce another module C, and we make module B use module C. This makes module B depend on module C, and consequently makes module A depend on module C.

Slide2

Suppose that I created some module D that uses module A. That would make module D depend on module A, module B, and module C.

Slide3

This demonstrates that dependencies are transitive, and every new module we add to the system will progressively make the transitive dependencies worse.

“What’s so bad about that?” you might ask.

Suppose that we make a modification to module C. We could inadvertently break module D, module A, and module B since they depend on module C.

Now, you might argue that a good test suite would prevent that … and you would be right.

However, consider the amount of work to maintain your test suite.

If you changed module C then you would likely need to add test cases for it, and since dependencies are transitive you would likely need to alter/add test cases for all the dependent modules.

That is an exponential growth of complexity; so, good luck with that. (See my post “Testing software is (computationally) hard” for an explanation). 

It gets worse, though.

The transitive dependencies make it nearly impossible to reuse your modules.

Suppose you wanted to reuse module A. Well, you would also need to reuse module B and module C since module A depends on them. However, what if module B and module C don’t do anything that you really need or even worse do something that impedes you.

You have forced a situation where you can’t use the valuable parts of the code without also using the worthless parts of the code.

When faced with this situation, it is often easier to create an entirely new module with the same features of module A. 

In my experience, this happens more often than not.

There is a module in this chain of dependencies that does not suffer this problem: module C.

Module C does not depend on anything; so, it is very easy to reuse. Since it is so easy to reuse, everyone has the incentive to reuse it. Eventually you will have a system that looks like this.

Slide4

Guess what that center module is: THE FRAMEWORK.

Slide5

This is a classic example of code immobility.

Code mobility is a software metric that measures the difficulty in moving code. 

Immobile code signals that the architecture of the system doesn’t support decoupling.

You can argue that the framework has so much reusable components that we ought to couple directly to it.

Ultimately, we value code reuse because we want to save time and resource, and reduce redundancy by taking advantage of already existing components.

Isn’t that exactly what the framework gave us?

Well, yes and no.

It depends on what you want to reuse.

Are you trying to reuse infrastructure code? By all means, please use a framework for that.

Are you trying to reuse domain specific business logic? You probably don’t want to couple your business logic directly to a framework.

An example of coupling your business logic directly to a framework is your typical MVC-ORM design. I’ve already explained this in my blog post “MongoDB, MySQL, and ORMS: when and where to use them”; so, I will not elaborate on it, here.

The Best of Both Worlds  

So it seems that we are at an impasse: we want to reuse infrastructure code from a framework, but we also want our business logic to be independent of it.

Can’t we have it both?

Actually, we can.

The direction of the arrow in our dependency graph makes our code immobile.

For example, if module A depends on module B then we have to use module B every time we use module A.

However, if we inverted the dependencies such that module A and module B both depended on a common module — call it module C — then we could reuse module C since it has no dependencies. This makes module C a reusable component; so, this is where you would place your reusable code.

coupling The Dependency Inversion Principle exists to enable this pattern.

The typical repository pattern is the perfect example of how inverting dependencies can decouple your business logic from a framework. I have already talked about it in my post “Contract Tests: or, How I Learned To Stop Worrying and Love the Liskov Substitution Principle”; so, I won’t elaborate on it, here.

So how would you structure your application to support the inverted dependencies?

Well, there are multiple ways you can legitimately do it, but I use the strategy of placing the bulk of the applications code in external libraries.

Under this model, the framework’s views and controllers  only serve as a delivery mechanism for the application. 

Consequently, we can easily move our code to another framework because the components do not depend on the framework in any way.

As an added benefit, we can test the bulk of the application independent of the framework. 

In fact, this is your typical Component Oriented Architecture.

For example, standard Component Oriented Architecture will break large C++ applications into multiple dll files, large JAVA applications into multiple jar files, or large .NET applications into multiple assemblies.

There exists rules about how you structure these packages, however.

You could argue that we would spend too much time with up front design when designing this type of system.

I agree that there is some up-front cost, but that cost is easily off-set by the time savings on maintenance, running tests, adding features, and anything else that we want to do to the system in the future.

We should view writing mobile code as an investment that will pay big dividends over time.

I plan to create a very simple demo application that showcases this design in the near future. Stay tuned.