“The Surgical Team” vs “The Full-Stack Developer”

I recently had a conversation with an architect about the wisdom contained inside Fred Brook’s “The Mythical Man-Month”. We both agreed that it contains timeless principles and practices.

At some point, we discussed how his “surgical team” concept might apply to modern development, and we both concluded that we should structure teams around the “surgical team” concept.

I want to tell this story because it clearly demonstrates how we can apply timeless principles to any “modern” problems.

In “The Mythical Man-Month”, Fred Brooks makes two observations::

  • Communication overhead is the single largest problem in software development: the more developers on the project, the bigger the communication overhead.
  • There is a huge productivity gap among software developers: the good developers are very very good. The bad ones are very very bad.

Given these limitations, Dr. Brooks says that we should organize large software development projects into multiple “surgical teams”.

Each “surgical team” has a single “chief programmer” (i.e. the surgeon) who does most of the delicate work.

Much like a surgeon, the “chief programmer” has staff of specialists with more mundane roles to support him.

Contrast this to the current fad of “full-stack” development.

In this model, we build a teams that can handle everything from end-to-end and that can deliver features independently of each other. This is effectively treating people like interchangeable “man-months”, which Dr. Brooks debunks as a “myth”.

I do not want to berate “full-stack” developers, though. They have their place in the world (probably as support for an architect or chief programmer)..

However, a strategy of only hiring full-stack developers does seem very efficient to me because it does not address the two biggest problems that the “Mythical Man Month” addresses.


Vertical Slicing and Product Backlog Management with the Gherkin Syntax

TL;DR: To create a product backlog, “vertically slice” your user stories by grouping similar scenarios. Estimate the work and value of these slices instead of the user story.   

I’ve seen a rise in the demand for “full-stack” developers in the last couple years.

The agile concept of “vertical slicing” made these types of positions very popular.

In a traditional team structure, each person on a team will have knowledge of one layer of an application. When the team attempts to complete some feature, they will have to split the feature into tasks corresponding to layers and then distribute the task to the proper people.

We call this “horizontal slicing”.

If you had a team of “full-stack” developers then you could simply assign a feature to a developer and you could expect them to complete the feature end-to-end with little to no help or coordination.

We call this “vertical slicing”.

This works great in theory, but it has a lot of challenges in practice.

One of these challenges is the exact means of creating high quality product backlog with “vertical slices”.

Enter User Stories

I typically see teams create vertical slices based on “User Stories” and the Gherkin syntax.

The following two code snippets provide examples for two fictitious features: (a) Create Account and (b) Login.

Unfortunately, this sometimes creates very large product backlog items that the team can not deliver quickly.

In particular, I’ve seen this happen when the business dictates that teams can only release  polished and bug-free features.

For example,  it could take a very long time for a team to complete the scenarios “Internal Server Error” and “Wait Too Long” for the feature “Create Account’. Further, those scenarios may not deliver much business value compared to the necessary work.  

In comparison, it could take a very short time for a team to complete the scenario “Valid Account Creation”, and that scenario might have very high business value.

This illustrates that coupling all scenarios together can impede the early and frequent releases we need to create a tight feedback loop between developers and testers or users.

Slice Your Slices

User Stories are not bad, though. We just need a better way to generate vertical slices for our product backlog.

Notice that each user story has multiple scenarios, and that we can conceptually break up each user story into individual scenarios.


Based on this principle, we can create vertical slices by grouping scenarios based on business value.

For example, we could slice our features in the following way.

Feature Vertical Slice Scenario
Create Account Basic Valid Account Creation
Business Rule Violations Duplicate Username
Duplicate Email
User Input Errors Not a Strong Password
Passwords Do Not Match
System Problems Long Wait Time
Internal Server Error
Login Basic Valid Username/Password
Business Rule Violations 1 Invalid Username/Password
Business Rule Violations 2 Too Many Incorrect Attempts
System Problems Long Wait Time
Long Wait Time

Each of these “vertical slices” become product backlog items that we can individually estimate and prioritize.

For example, our fictitious product team could prioritize the “vertical slices” in the following way.

  1. Create Account – Basic
  2. Login – Basic
  3. Login – Business Rule Violations 1
  4. Create Account – User Input Errors
  5. Create Account – Business Rule Violations
  6. Login – Business Rule Violations 2
  7. Create Account – System Problems
  8. Login – System Problems

This allows a more granular approach to creating product backlog items.

As an added benefit, you can leave a “user story” largely undefined so long as you already have its highest priority slices within your product backlog.

This allows you to “groom” your user stories in a “just in time” way.

For example, we created 4 “vertical slices” of the feature “Create Account” in the example above. However, as an alternative, we could simply create the first slice “Create Account – Basic” and not bother with further analysis until someone completes that slice. This could have saved everyone from spending unnecessary time in a grooming session.

I am only providing an illustration, though. Ultimately, the end result depends on the situation and the interaction between team members.


Measuring User Experience with ScalaCheck, Selenium WebDriver, and Six Sigma

I recently stumbled upon an idea that I think can measure defects in user experience, and I want to put it down in writing so I have a starting point for further research.  

The germ of this idea took root in my mind after my last blog post.

In my last blog post, I applied traditional software engineering principles to developing javascript SPAs, and I used automated testing of user stories as an example.

I also happen to have a project where I use scalacheck to automate generative tests for machine learning algorithms and data pipelining architectures.

Further, I happen to have some experience with six sigma from my days working as a defense contractor.

By combining the different disciplines of (a) user story mapping, (b) generative testing, and (c) six sigma, I believe that we can measure the “defects of user experience” inherent in any “system under test”.

Let’s discuss each discipline in turn.

User Story Mapping

User story mapping is an approach to requirements gathering that uses concrete examples of “real world” scenarios to avoid ambiguity.

Each scenario clearly defines the context of the system and how the system should work in a given case, and ideally, describe something that we can easily test with an automated testing framework.

For example, here is a sample “create account” user story

One of the limitations of testing user stories is that they cannot give you a measure of the correctness of your application. This is because to “prove” program correctness with programatic tests we would need to check every single path through our program. 

However, to be fair, the goal of user stories is to gather requirements and provide an “objective” measurement system by which developers, product, and qa can agree to in advance. 

Nevertheless, we still need a means of providing some measure of “program correctness”.

Enter Generative Testing.

Generative Tests

Generative testing tests programs using randomly generated data. This enables you to provide a probabilistic measurements of program correctness. However, this assumes that you know how to setup an experimental design that you can use to measure the accuracy of your program.

For example, the scalacheck documentation provides the following snippet of code that tests the java string class.

If you run scalacheck with StringSpecification as input then scalacheck would randomly generate strings and check whether the properties that you defined in StringSpecification are true.

Here is the result that scalacheck would provide if you ran it with StringSpecification as input.

We can see that scalacheck successfully ran 400 tests against StringSpecification.

Let’s do a little statistical analysis to figure out what this means about our the string class.

According to one view of statistics, every phenomenon has a “true” probability distribution which is really just an abstract mathematical formula, and we use data collection methods to estimate the parameters of the “true” distribution.

We will assume this is the case for this discussion.

Suppose that we do not know anything the String class. Under this assumption, the maximum entropy principle dictates that we assign a 1 to 1 odds to every test that scalacheck runs.

That basically means that we should treat every individual test like a coin flip. This is known as a Bernoulli trial.

Now, some really smart guy named Wassily Hoeffding figured out a formula that we could use to bound the accuracy and precision of an experiment based exclusively on the number of trials. We, unsurprisingly, call it Hoeffding’s inequality.

I will not bother explaining the math. I’m sure that I’d do a horrible job at it.

It would also be boring.

I will instead give a basic breakdown of how the number of trials relate to the accuracy and precision of your experiment.

number of trials margin of error confidence interval
80 10% 95%
115 10% 99%
320 5% 95%
460 5% 99%
2560 2.5% 95%
3680 2.5% 99%
8000 1% 95%
11500 1% 99%

The margin of error measures the accuracy of our experiment and the confidence interval measures the precision of our experiment.

Consider the margin of error as a measurement of the experimental results reliability, and the confidence interval as a measurement of the experimental method’s reliability.

For example, if I had an experiment that used 80 trials and I obtained a point estimate of 50% then this would mean that the “real” value is somewhere between 40% and 60% and that the experiment itself would be correct 95 times out of 100.

In other words, 5% of the time an experiment like this one would generate completely bogus numbers.

Now that I have explained that, let us apply this concept to our StringSpecification object. Based on the fact that we had 400 successful runs we can objectively say that the String class’s “true” accuracy is roughly between 95% – 100%, and that there is only a 1% chance that I am completely wrong.

Easy. Right?

I totally understand if you didn’t understand a single thing of what I just said. Are you still reading?

You might be able to set-up an experimental design and measure the results if you are a statistician. However, it is probably beyond the ability of most people.

It would be nice if there was some general set of methods that we could apply in a cookie cutter way, but still have robust results.

Enter Six Sigma.

Six Sigma

Officially, Six Sigma is a set of techniques and tools for process improvement; so, I do not believe that it is generally applicable to software engineering. However, there are a few six sigma techniques that I think are useful.

For example, we could probably use DPMO to estimate how often out system would create a bad user experience (this is analogous to creating a bad part in a manufacturing process).

DPMO stands for Defects per million opportunities, and it is defined by the formula

Let’s suppose that we decided to use scalacheck to test user stories with randomly generated values.

This would immediately open up the prospect of measuring “user experience” using DPMO.

For example, let’s consider the scenario “Valid Account Information” for the feature “Create Account”.

According to the scenario, there are two things that would make this test fail:

  • not seeing the message “Account Created”
  • not seeing the link to the login screen

Suppose that we ran 200 randomized tests based on this user story, and had 5 instances where we did not see the message “Account Created” and 2 instances where we did not see the link to the login screen.

This means we have 7 defects out of 2 opportunities from 200 samples. Therefore, DMPO = (7 / (200*2)) = 0.0175 * 1,000,000 = 17,500, which implies that if we left our system in its current state then we can expect to see 17,500 defects for every 1,000,000 created accounts.

Notice how much easier the math is compared to the math for pure generative testing.

These are formulas and techniques that an average person could understand and make decisions from.

That is huge.


This is just me thinking out loud and exploring a crazy idea. However, my preliminary research suggests that this is very applicable. I’ve gone through a few other six sigma tools and techniques and the math seems very easy to apply toward the generative testing of user stories.

I did not provide a concrete example of how you would use scalacheck to generatively test user stories because I didn’t want it to distract from the general concept. In a future blog post, I will use an example scenario to walk through a step-by-step process of how it could work in practice.

Stay tuned.

How To Build Testable and Framework Agnostic Single Page Applications with Node.js

In a previous blog post, I spoke about the package principles of cohesion:

  • The Release/Reuse Equivalency Principle
  • The Common Closure Principle
  • The Common Reuse Principle

In this blog post, I will use the principles of package cohesion to build cohesive npm packages, and I will use these npm packages to build testable and framework agnostic single page applications.

For this demonstration, we will create a fake analytics dashboard. Here is what the final product will look like.

Screen Shot 2016-01-31 at 10.14.36 PMGathering Requirements

We will describe the application requirements using the “User Story Mapping” approach.

The following snippet describes the feature of our application

We want our “system under test” to be testable; so, we need to place some architectural restrictions to guarantee “testability”.

At a minimum, a testable system must have the qualities of (a) observability, and (b) controllability.

Observability refers to the property that I can arbitrarily observe the states of the “system under test”, and controllability refers to the property that I can put the “system under test” into an arbitrary state.

The two properties are inherently related to each other: you can’t fully observe a “system under test” that you can’t fully control, and there we don’t care to fully control a “system under test” that we can’t fully observe.

I claim (but will not prove) that these architectural restrictions guarantee “testability”.

In order to support the “controllability” requirement of our architecture, we will use the following background block to load fake data into our “system under test” for every new test run.


By design, a user story will have various scenarios, and we will use the scenarios to determine what needs to be “observable”.

For this demonstration we will consider the following scenarios:

  • Visit the visualization page
  • Change the selected date on the page
  • Aggregate the items by month
  • Aggregate items by year

This snippet shows the scenario description for “visit visualization page”

This snippet shows the scenario description for “change the selected date on the page”:

This snippet shows the scenario description for “aggregate the items by month”:

This snippet shows the scenario description for “aggregate items by year”:

You can see the complete feature file here.

Testing the User Story

The following code snippets uses cucumber.js and babel (an es6 to es5 transpiler) to test the user stories. I use babel because it lets me use generators and coroutines to simplify the asynchronous nature of selenium code.

This snippet initializes the “system under test” based on the data provided in the background block for the user story. This satisfies the “controllability” property of our architecture.

This snippet executes the action(s) that change the state of the “system under test” for the behavior “When I visit the page at <some_url>”:

This snippet executes the action(s) that change the state of the “system under test” for the behavior “When I select the date <some_date>”:

This snippet executes the actions(s) that change the state of the “system under test” for the behavior “When I select to aggregate by <some_aggregation_type>”:

This snippet executes the post conditions checks of the “system under test” for the block “Then I should see a bar chart for <some_title>”.

Note: I will skip the rest of the blocks for brevity’s sake.

In the acceptance tests, I use a page object to encapsulate the data retrieval algorithms for each page.

For example, the following code snippet shows how the page object encapsulates the algorithm that “observes” when the page has finished loading.

This test looks for a DOM element with the html attribute “qa-chart-type”, and then waits for it to change.

We have to do this because the SPA fetches data using a http request when if first loads, and we want to detect when the SPA receives a response.

However, since we cannot easily detect when the SPA receives an http response, we make the SPA communicate this to our test script by creating and updating special “Quality Assurance” html attributes.

This makes testing with selenium much easier and maintainable.

I also created a page object factory to easily change implementations of the page object. It uses the same principles of a “contract test” that I mentioned in my blog post “Contract Tests: or, How I Learned to Stop Worrying and Love the Liskov Substitution Principle”.  

Let’s suppose that we implemented our application using (a) Backbone.js, and (b) React.js.

In order to test a Backbone implementation, we could use the following code

In order to test a React implementation, we could use the following code

This is what running the cucumber application looks like when I test a React implementation:


You can see the complete test suite here

Building Our Components

Suppose that you had the following directory structure for a JAVA application

  • src/
    • use_case001/
      • command/
        • File001.java
        • File002.java
        • File003.java
      • delegate/
        • File001.java
        • File002.java
      • model/
        • File001.java
        • File002.java
    • use_case002/
      • File001.java
      • File002.java
    • use_case003/
      • File001.java
      • File002.java
      • File003.java
    • controller/
      • Controller001.java
      • Controller002.java
    • view/
      • View001.jsp
      • View002.jsp
    • model/
      • Model001.java
      • Model002.java

This is an example of a “monolithic” or “big ball of mud” application architecture.

Architects typically have nothing but contempt for this architecture due to its well known rigidity, fragility, immobility, and viscosity.

These are well known application level architectural properties:

  • rigidity measures how easy an architecture can respond to change.
  • fragility measure how likely something will break when making a change
  • mobility measures how easy an architecture can support moving code
  • viscosity measure how easy an architecture supports maintaining the original design

When faced with a monolithic architecture, most architects I know would refactor the monolith into something that looks like this.

  • src/
    • controller/
      • Controller001.java
      • Controller002.java
    • view/
      • View001.jsp
      • View002.jsp
    • model/
      • Model001.java
      • Model002.java
  • lib/
    • use_case001.jar
    • use_case002.jar
    • use_case003.jar

Each .jar file is a self contained components that is independently testable, releasable, and versioned.

You might think that simply moving code from a directory to a .jar file does not significantly improve the design. However, when applied properly, it has a huge influence on reducing rigidity, fragility, viscosity, and increasing mobility.

We will use this approach, but in this case we will use npm modules as components: a npm module is equivalent to a JAVA .jar file; so, all the same rules apply.

I will use the discipline of Responsibility Driven Design to guide the cohesion strategy for our components. Responsibility Driven Design makes component “responsibility” the most important organizing principle for code.

I claim (but will not prove) that organizing our components by “responsibility” will force our components to obey (a) the Reuse-Release Equivalency Principle, (b) the Common Closure Principle, and (c) the Common Reuse Principle.

From the scenarios in the user story, I can identify the following responsibilities:

  • Bar Chart Visualization
  • User Interface Interaction
  • Data Retrieval
  • QA HTML Attribute Generation

Therefore, I will respectively create the four following packages:

  • analytics-chart
  • analytics-facade
  • analytics-service
  • qa-locator-utility

Create the “Analytics Chart” Component

The actual details of how I created the component are not important; so, I will not talk about it.

However, I feel that I should mention the unit tests I wrote for the visualizations because it isn’t a common thing.

I created unit tests to prove partial correctness of my application using the approach I describe in my previous blog post.

For example, the following code snippet tests the visualization for some items.

This test verifies that certain DOM elements appear in the svg canvas. I consider it a very weak test because there is a good chance that it could pass and the visualization could still be wrong. However, it is also a cheap test to write; so, there is still value in writing it.

You can see the completed npm package here.

Create the “Analytics Facade” Component

This component encapsulates all the user interface “glue logic” (i.e. identifying which screen or page to display), and we use the command pattern and delegation pattern to promote loose coupling between the user interface and this component.

For example, these are the cucumber features that I wrote for this component.

From a client’s perspective, they only need to know how to execute a command from the facade, and how to delegate tasks to the facade. The cucumber tests above provide a very high level description of what types of commands and delegates this facade exposes.

For the sake of completeness, I will also show you the step definitions for this feature.

You can see the completed npm package here.

Create the “Analytics Service” Component

This component simply makes a request to our remote server and return the json results. It isn’t very interesting; so, I will not discuss it.

You can see the completed npm package here.

Create the “QA Locator Utility” Component

Officially, we do not need this component, but we would like to have an easy method to locate elements in our window’s DOM using xpath. This component encapsulates any algorithms that we might need to simplify the process of creating html attributes for the purpose of writing acceptance tests.

I don’t consider it a very interesting component; so, I will not discuss it. 

You can see the completed npm package here.

Composing the Components Into A Single Page Application Framework

To demonstrate how to migrate the application between frameworks, I will compose the components into (a) Backbone 1.2, and (b) React 0.14.

The following component diagram shows the general relationships between the components and framework.

Screen Shot 2016-02-03 at 11.27.24 PM

Backbone 1.2 and React 0.14

We have placed the bulk of the applications functionality inside of framework independent components. Therefore, in order to create a backbone version or a react version of the application, we have to create a special component that coordinates between our “independent” components and the framework.

In the case of React, we created the component backbone_app.

In the case of Backbone, we create the component react_app.

For example, here is the entry point for our Backbone application:

Here is the entry point for our React application:

The parallels in the code are very apparent. I will not go into detail into the code bases because it is not important. I only care about the “shape” of the design.

You can find the Backbone implementation here, and the React implementation here if you really care to dig into the code, however.


This approach might seem very complicated for such a simple application … and it is.

In reality, this approach only makes sense once your application reaches a certain scale. However, simple applications have a tendency of becoming big applications, and it is very expensive to put a test harness on an already existing system with no tests.

Personally, I recommend you always build an application with decoupled testable components from the very beginning because that investment will pay large dividends in the long run.

Further, decoupling your components from the framework has huge benefits even if you have no intention of switching frameworks.

One practical “use case” is with automated testing.

When you decouple your components from the framework then you can test the bulk of your application independent of the framework. This significantly simplifies the overall testing process.

In fact, the package principles were partially created to minimize the work of creating and maintaining tests.

In a future post, I will show you how you can use a Continuous Integration tool to automatically run your tests whenever someone releases a new version of a component.

That should demonstrate the practicality of strictly following package principles.

How Frameworks Shackle You, and How to Break Free (Part Deux)

In my last post, I talked about how over-reliance of a framework creates immobile code, and how you can use the dependency inversion principle to break the dependency on a framework.

I also mentioned that I would create a demo application to demonstrate how to do this.

However, before I live up to that promise, I have to introduce you to a little theory.

Principles of Package Management

Mobile code does not just happen. You have to design it.

To build mobile code we typically group code into reusable packages that follow proper package design principles.

For this discussion, I will only consider the principles of package cohesion:

  • The Release/Reuse Equivalency Principle
  • The Common Reuse Principle
  • The Common Closure Principle

Robert Martin codified these principles in his book “Agile Software Development: Principles, Patterns, and Practices”. This book is the gold standard for agile software development.

The Release/Reuse Equivalency Principle

The Release/Reuse Equivalency Principle says that “the granule of reuse is the granule of release”.

This principle makes an equivalence between reusability and releasability.

This equivalence has two major implications:

  • you can only release code that is reusable
  • you can only reuse code that is releasable

The reverse is also true:

  • You cannot release code that is not reusable.
  • You cannot reuse code that is not releasable.

This principle puts a very heavy burden on the maintainer of a package, and that burden forces package maintainer to have a package release strategy.

Package maintainers generally follow the “semantic versioning” strategy.

Semantic versioning has very strict rules related to “semantic version numbers”.

Semantic version numbers consist of the pattern of x.y.z where x, y, and z are integers.

Each position conveys a particular meaning (hence the name “semantic versioning”).

The first number is the major version.

We usually start a package with version 0. We should considered a version 0 package as unfinished, experimental, heavily changing without too much care for backwards compatibility.

Starting from major version 1, we consider the published API as stabilized and that the package has a certain trustworthiness from that moment. Every next increment of the major version marks the moment that parts of the code breaks backward compatibility.

The second part of the version number is the minor version.

We increment the minor versions when we add new functionality to the package or deprecate parts of the public API. A minor release promises your clients that the package will not break backwards compatibility. A minor version only adds new ways of using the package.

The last part of the version number is the patch version.

Starting with version 0 it is incremented for each patch that is released for the package. This can be either a bug fix, or some refactored private code.

Further, a package maintainer has the option to add meta-data after the release numbers. Typically they will use it to classifying packages as having a particular state: alpha, beta, or rc (release candidate).

For example, these items could be the releases of a package:

  • 2.10.2
  • 3.0.0
  • 3.1.0
  • 3.1.1-alpha
  • 3.1.1-beta
  • 3.1.1-rc.1
  • 3.1.1-rc.2
  • 3.1.1

These number communicate the following to a client:

  • release 2.10.x has two patches. The patches may have been bug fixes or refactors. We would need to look at the changelog or commit logs to determine the importance of each patch.
  • After release 2.10.x, the package maintainer decided to break backwards compatibility. The package maintainer signaled to clients that the package breaks backwards compatibility by creating the 3.0.0 release.  
  • At 3.1.0, the package maintainer introduced new features that did not break compatibility with 3.0.0.
  • 3.1.1-alpha signals that the package maintainer started a patch of release 3.1.0. However, the package maintainer does not want to call the patch stable. Someone may have submitted a bug report to the package maintainer for release 3.1.0, and the package maintainer may have started the initial phases of fixing the bug. In this scenario, the package maintainer likely added some testing code to isolate the particular bug, or validate that the bug is fixed.
  • 3.1.1-beta suggests that the package maintainer completed the “feature”. Most likely this signals that the package maintainer’s automated tests pass.
  • 3.1.1-rc.1 suggests that the package passed manual QA and that the package manager can potentially release the patch as a stable version. The package manager would likely tell clients to run their integration tests against this release. Manual QA likely happen against this release, also.
  • 3.1.1-rc.2 suggests that the package maintainer found regression errors in 3.1.1-rc.1. It may indicate that an integration test failed for a client. The package manager may have fixed issues that a client reported and released the fix as 3.1.1-rc.2.
  • 3.1.1 signals that the package maintainer has successfully patched the 3.1.0 release.

The Common Reuse Principle

The Common Reuse Principle states that “code that is used together should be group together”. The reverse is also true: “code that is not used together should not be grouped together.”

A dependency on a package implies a dependency on everything within the package; so, when a package changes, all clients of that package must verify that they work with the new version.

If we package code that we do not use together then we will force our clients to go through the process of upgrading and revalidating their package unnecessarily.

By obeying the Common Reuse Principle, we provide the simple courtesy of not making our clients work harder than necessary.

The Common Closure Principle

The Common Closure Principle says that “code that changes together, belong together”.

A good architect will divide a large project into a network of interrelated packages.  

Ideally, we would like to minimize the number of packages for every change request because when we minimize the number of effected packages we also minimize the work to manage, test, and release those packages. The more packages that change in any given release, the greater the work to rebuild, test, and deploy the release. 

When we obey the Common Closure Principle we force a requirements change to happen to the smallest number of packages possible, and prevent irrelevant releases. 

Automated Tests are Mandatory

While clean code and clean design of your package is important, it’s more important that your package behaves well.

In order to verify the proper behavior of a package, you must have automated tests.

There exists many different opinions on the nature and extent of automated tests, though:

  • How many tests should you write?
  • Do you write the tests first, and the code later?
  • Should you add integration tests, or functional tests?
  • How much code coverage does your package need?

In my opinion, the particulars of how you write tests or the extent of your tests are situational. However, you must have automated tests. This is non-negotiable.

If you don’t have automated tests then you are essentially telling your clients this:

I do not care about you. This works for me today, and I do not care about tomorrow. Use this package at your own parel.  

Putting the Principles to Practice

Now that we have the principles, we can start to apply it.

In a future post, I will create a basic application that uses the package design principles that I described. Further, I will compose the components into different frameworks to demonstrate how to migrate the application between frameworks.

Stay tuned.

Contract Tests or: How I Learned to Stop Worrying and Love The Liskov Substitution Principle

We software developers often regret past design decisions because we get stuck with their consequences. As an industry, we face this challenge so much that we have a name for it: accidental complexity.

Developers introduce “accidental complexity” when they design interfaces or system routines that unnecessarily impedes future development.

For example, I might decide to use a database to persist application state, but later I might realize that using a database introduces scalability problems. However, by the time I realize this, all my business logic depends on the database; so, I can’t easily change the persistence mechanism because I coupled it with the business logic.

In this hypothetical example, I only care about persisting application state, but I don’t necessarily care how I persist it or where I persist it. I could theoretically use any persistence mechanism. However, I “accidentally” coupled myself to a database, and that cause the “accidental complexity”.

I see this frequently happen with applications that use Object Relational Mappers (ORMs). Consider the following code snippet:

The Person class has special annotations from the Doctrine ORM framework. It allows me to “automatically” persist information to a database based on the annotation values. This significantly simplifies the persistence logic.

For example, if I wanted to save a new Person to a database, I could do it quite easily with the following code snippet.

However, as a consequence of our “simplification”, I have also coupled two separate responsibilities: (a) the business logic, and (b) the persistence logic.

Unfortunately, any complex situation will force us to make difficult trade-offs, and sometimes we don’t always have the information we need to make proper decisions. This situation can force us to make early decisions that unnecessarily introduce “accidental complexity”.

Fortunately, we have a tool that can let us defer implementation details: the Abstract Data Type (ADT).

Abstraction to the Rescue

An ADT provides me the means to separate “what” a module does from “how” a module does it — we define a module as some “useful” organization of code.

Object oriented programming languages typically use interfaces and classes to implement the concept of an ADT.

For example, suppose that I defined a Person class in PHP

I could use an interface to define a PersonRepository to signal to the developer that this module will (a) returns a Person object from persistent storage, and (b) save a Person object to persistant storage. However, this interface only signals the “what”; not the “how”.

Suppose that I wanted to use a MySql database to persist information. I could do this with a MySqlPersonRepository class that implements the PersonRepository interface.

This class defines (a) “how” to find a Person from a database, and (b) “how” to save a Person to a database.

If I wanted to change the implementation to MongoDb then I could potentially use the following class.

This would enable us to write code like the following.

Notice how I don’t make any reference to a particular style of persistence in the code above. This “separation of concerns” allows me to switch implementations at runtime by changing the definition of AppFactory::getRepositoryFactory.

To use a MySql database, I could use the following class definition.

and if we wanted to use a MongoDb datastore then we could use the following class definition

By simply changing one line of code, I can change how the entire application persists information.

I HAVE THE POWER … of the Liskov Substitution Principle

Recall, the original thought experiment: I originally used a MySql database to persist application state, but later needed to use MongoDb. However, I could not easily move the persistence algorithms because I coupled the business logic to the persistence mechanism (i.e. ORM).

Mixing the two concerns made it hard to change persistence mechanism because it also required changing the business logic. However, when I separated the business logic from the persistence mechanism, I could make independent design decisions based on my needs.

The power to switch implementations comes from the “Liskov Substitution Principle”.

Informally, the Liskov Substitution Principle states that if I bind a variable X to type T then I can substitute any subtype S of T to the variable X.

In the example above, I had a type PersonRepository, and two subtypes (a) MySqlPersonRepository, and (b) MongoDbPersonRepository. The Liskov Substitution Principle states that I should be able to substitute either subtype for a variable PersonRepository.

We call this “behavioral subtyping”. This type of subtyping differs from traditional subtyping because behavioral subtyping demands that the runtime behaviors of subtypes behave in a consistent way to the parent type.

Everybody (and Everything) Lies

Just because a piece of code claims to do something, does not imply that it actually does do it. When dealing with real implementations of an ADT, we need to consider that our implementations could lie.

For example, I could accidentally forgot to save the id of the Person object properly in the MongoDB implementation; so, while I intended to follow the “Liskov Substitution Principle”, my execution failed to implement it properly.

Unfortunately, we cannot rely on the compiler to catch these errors.

We need a way to test the runtime behaviors of classes that implement interfaces. This will verify that we at least have some partial correctness to our application.

We call these “contract tests”.

Trust But Verify

Assume that we wanted to place some behavioral restrictions on the interface PersonRepository. We could design a special class with the responsibility of testing those rules.

Consider the following class

Notice how we use the abstract function “getPersonRepository” in each test. We can defer the implementation of our PersonRepository to some subclass of PersonRepositoryContractTest, and execute our tests on the subclass that implements PersonRepositoryContractTest.

For example, we could test the functionality of a MySql implementation using the following code:

and if we wanted to test a MongoDB implementation then we could use the following code:

This shows that we can reuse all the tests we wrote. Now we can easily test an arbitrary number of implementations.


Of course, in practice, there are many different ways of implementing contract tests; so, you may not want to use this particular method. I only want you to take away the fact that not only can you implement contract tests, but that you can do it in a simple and natural way.

How To Measure Intuition in Agile Project Planing and Estimation

Developers will often estimate the time and cost of some work using their intuition. This implies that intuition acts like a measuring device.

Further, the statistical revolution brought us the concept that all measurements and measuring tools have inherent uncertainty and error, and attempts to deal with it.

Students typically learn this concept in physics class by measuring the acceleration of gravity. When the student runs a sequence of experiments to determine the acceleration of gravity they get a collection of different results. From that collection of data, they learn how to estimate the “true” acceleration of gravity and the error bound associated with their estimate.

Using that concept, I would like to tackle the problem of how to measure intuition with respect to agile project management.

A simple thought experiment

Let’s use the following thought experiment to illustrate how to measure “intuition”.

Suppose I predicted the amount of time it takes to finish some collection of user stories. Also suppose that I gave my confidence (measured in percent) in achieving these results within that time.

By comparing my predictions against what actually happens, we could estimate the quality of my intuition.

For example, estimates predicted at a 50% level mean that I expect to make correct predictions 50% of the time and incorrect predictions 50% of the time. Therefore, if I get 100% of my 50% predictions right then I incorrectly assigned my predictions to the 50% confidence level, but if I get 50% of my 50% predictions correct then I accurately assessed them.

You can apply the same reasoning for the 60% level, 70% level, etc …

For the sake of illustration, suppose that I tabulated the results of my predictions for user stories along with the results in the following table.


Predicted Time

Confidence Level

Actual Time



8 hours


8 hours



8 hours


9 hours



8 hours


8 hours



8 hours


8 hours



8 hours


8 hours



8 hours


9 hours



8 hours


8 hours



8 hours


8 hours



8 hours


8 hours



8 hours


9 hours



8 hours


8 hours



8 hours


8 hours



8 hours


8 hours



8 hours


8 hours



8 hours


9 hours



8 hours


8 hours



8 hours


8 hours


From this we can gather the following information

  • I correctly predicted 1/2 (50%) at the 50% confidence level
  • I correctly predicted 3/4 (75%) at the 60% confidence level
  • I correctly predicted 3/4 (75%) at the 70% confidence level
  • I correctly predicted 4/5 (80%) at the 80% confidence level
  • I correctly predicted 1/1 (100%) at the 90% confidence level
  • I correctly predicted 1/1 (100%) at the 100% confidence level

Let’s look at the regression line associated with this data.


In the figure above, the x axis represents “true” accuracy while the y axis represents predicted accuracy. Each point represents intuition at a particular “confidence level”. The dashed line represents “perfect” intuition; so, we ideally want
point as close to the dashed line as possible. The blue line is the regression line associated for all the points and represents a persons overall intuition.

Through this interpretive framework, the data suggest that I generally have under-confident estimates.

Now, suppose the results ended up looking like the following, instead:

  • I correctly predicted 1/2 (50%) at the 50% confidence level
  • I correctly predicted 2/4 (50%) at the 60% confidence level
  • I correctly predicted 2/4 (50%) at the 70% confidence level
  • I correctly predicted 3/5 (60%) at the 80% confidence level
  • I correctly predicted 1/1 (100%) at the 90% confidence level
  • I correctly predicted 1/1 (100%) at the 100% confidence level

The chart would then change to the following:


In this case, the regression line suggests that I have over-confident estimates.

Some Caveats

I used the discussion and examples purely for illustrative purposes. I want to appeal more to your intuition rather than providing something very mathematically rigorous.

Potential Applications

I can imagine many different applications to this framework. A couple applications from the top of my head include:

  • Suppose that we had a poker planning session and we had a difference between how people scored a user story. A project manager could use the quality of someones intuition to make decisions about project planning.
  • Someone could use their the regression line to help calibrate their own intuition (similar to how scientists have to calibrate their instruments). If someone knew that they had a tendency to over-estimate or under-estimate at a certain
    confidence level then they could theoretically use that information to train their intuition.
  • Suppose our team failed to meet our estimates 3 times in a row. However, suppose that we also only had a 50% confidence in those estimates. In this case, we can still consider our estimates as correct because we had a 1/8 (12%) chance of making 3 incorrect estimates in a row.
  • Failure to meet estimates becomes a source of information that helps improve estimates. Since we’ve treated estimates as a random variable, we’ve acknowledged that uncertainty and error exist. However, we also have a way to measure it, and use it to make predictions.


This is all pretty much theoretical, but I think that it might have useful applications. I will spend time thinking about it and I will continue to publish my thoughts and results.