I recently finished reading “Out of the Crisis” By Edward Deming. It blew my mind because some of his insights into organizational theory and management apply so well to our modern software industry (the book was written in the 80s).

Deming believed that management had to measure and analyze everything statistically. He spent many pages justifying this belief and provided great illustrations on how to apply it to particular situations.

After reading the book, I realized just how wrong the software industry does project estimation, and that Deming had an approach that was better than what we use today.

## Comparing Different Agile Estimation Styles

Let’s first talk about the way we currently do things.

Consider 3 of the most popular agile estimation methods that I have seen: (a) Fibonacci, (b) T-shirt sizing, and (c) Ad-hoc estimation.

Fibonacci (aka Poker Planning) uses numbers from the fibonacci sequence to estimate a unit of work. For example, if a project manager asked you to estimate the time it would take you to complete a user story on a kanban board then he would only allow you to choose from the numbers 1, 2, 3, 5, 8, 13, 24, etc …

T-Shirt sizing uses the concept of shirt sizes to estimate a unit of work. For example, if a project manager asked you to estimate the time it would take you to complete a user story then he could only allow you to choose from the options of XS, S, M, L, XL, XXL.

Ac-hoc estimation requires that someone make-up a “realistic” estimate for a unit of work. For example, if a project manager asked you to estimate the time it would take you to complete a user story then he could expect you to say 12 hours and hold you to that number.

To help illustrate how you would use each “system”, let us suppose we had the following user stories for a sprint

Story ID | Story |
---|---|

1 | As a coach, I can enter the names and demographic information for all swimmers on my team. |

2 | As a coach, I can define practice sessions. |

3 | As a swimmer, I can see all of my times for a specific event. |

4 | As a swimmer, I can update my demographics information. |

The following table illustrates the way you could estimate the size of each user story using each “system”.

Story ID | Story | Fibonacci Estimate | T-Shirt Estimate | Ad-Hoc Estimate | |
---|---|---|---|---|---|

1 | As a coach, I can enter the names and demographic information for all swimmers on my team. | 3 | Small | 3 | |

2 | As a coach, I can define practice sessions. | 5 | Medium | 4 | |

3 | As a swimmer, I can see all of my times for a specific event. | 5 | Medium | 4 | |

4 | As a swimmer, I can update my demographics information. | 5 | Medium | 4 |

## Testing the Goodness of Fit

I can see one major weakness with all of these methods: I do not know how to measure how good of a fit each estimate is to reality.

For example, suppose I “correctly” estimated the user stories (a) 75% of the time using Fibonacci, (b) 85% of the time using T-Shirt sizing, and (c) 50% of the time with Ad-Hoc Estimation.

Can I confidently claim that T-shirt sizing outperforms all the other methods?

We do not have a proper apples to apples comparison; so, I say we cannot make any general claims to effectiveness. Also, even if we could compare each approach how can we be sure that at some point in the future one of the other methods performs better?

In fact, I can’t think of any satisfactory way of solving this problem, which is exactly why I value Deming’s appeal to statistically measure and analyze everything.

## Why We Need To Use Statistics

If you could statistically measure and analyze everything then you could lean on decades worth of statistical methods. That would remove so much of vagueness and ambiguity that I deal with on a regular basis.

Let me give an example to illustrate my point.

Suppose I used a probability distribution function or a collection of probability distribution functions to determine the time estimates for the user stories. The following table illustrates what that would look like

Story Estimation from a Probability Distribution | ||||
---|---|---|---|---|

Story ID | Story | Expected Value | Standard Deviation | Actual Value |

1 | As a coach, I can enter the names and demographic information for all swimmers on my team. | 5 | 2.5 | 5.5 |

2 | As a coach, I can define practice sessions. | 4 | 2 | 4 |

3 | As a swimmer, I can see all of my times for a specific event. | 3 | 1.5 | 3.5 |

4 | As a swimmer, I can update my demographics information. | 3 | 1.5 | 3.5 |

If we had such situation then we could compare the observed results to the predicted values using a standard chi square goodness of fit test.

Let’s run through a chi square test using this specific example.

Suppose we have a hypothesis H0 that this distribution function fits the data. We can run the calculations to determine the likelihood of that hypothesis.

Chi^2 = (5-5.5)^2/5.5 + (4-4)^2/4 + (3-3.5)^2/3.5 + (3-3.5)^2/3.5 = 0.5^2/5.5 + 0 + 0.5^2/3.5 + 0.5^2/3.5 = 0.25/5.5 + 0.25/3.5 + 0.25/3.5 = .05 + .07 + .07 = .19

Using the chi square chart for 3 degrees of freedom we see that the probability of the H0 generating the actual values is above 95%. So we fail to reject the hypothesis and simultaneously “accept” that the probability distribution function since we have no reason to believe that it isn’t true.

Under this scenario, we would have high confidence in our estimates in the future, and if our estimates start to get bad we would also know the bounds of the error.

In my opinion, this is a much better position to be in than the one that we are in right now, which is pretty much the equivalent of educated guessing.

Notice, however, that I never actually specified what our “magic” probability distribution function actually was. Our ability to make statistical estimates depends on our ability to find it.

In my next post, I intend to detail some methods that MIGHT help us find good probability distributions.

Stay tuned.

I really like Demming. Unfortunately, traditional top-down mgmt companies continue to resist his ideas at all levels for the simple reason that accepting Demming’s proposal that the customer is the best judge of quality and that all efforts and structures inside a company should be viewed with the customer in mind destroys the classic org chart that almost all companies in America are built around. It’s like asking someone to give up a career’s worth of effort for the sake of the customer’s benefit.

Mostly, it’s all about me, how can I get ahead, how can I climb higher on the management ladder as set forth to me by the company’s own org chart.

It’s a meaningless pursuit, but I’ve barely made a dent in convincing anyone of this fact. If you like Demming, you might also like Peter Drucker. Drucker was very radical in framing business activity around the customer’s perspective. It’s sounds so simple, but the way companies in the U.S. are organized you would think that the CEO and/or the major company shareholders were the final customer, as most U.S. companies as structured to pander to ever higher levels of management.

I totally agree.

However, I don’t necessarily blame managers from acting that way. They are simply responding to incentives that are inherent to the system itself. At the end of the day, people will care more about themselves than the company. If you setup a political environment that promotes careerist mentality then you’ll get more of it.

I’ve come to value the inverted pyramid model of management for this reason. Inverting the responsibilities also inverts the incentives, and I believe that you’ll necessarily get less dysfunction.

The software development industry would probably be a much more successful if more statistics and data was collected and used wisely. But the problem with this kind of detailed estimation techniques is that they are too complicated to be employed continuously by most developers or project managers, imo. And besides, trying very hard to provide accurate estimates is not really worth the effort in agile development and planning. It’s more important to plan little by little, and frequently adjust your plans as you get further along the development.

With systems like poker planning the point is not to provide accurate estimates per se, but mainly to discuss the user stories or features in order to further hash out any questions or uncertainties, priorities and so on.

When it comes to being able to make plans and forecasts that span more than a single sprint or iteration (like 2-4 weeks), data is certainly important, but we can use simple analysis techniques such as burndown or burnup charts with the “yesterday’s weather” method. No need for statistical wizardry, just rigorous follow-up of a few core data metrics.

I get where you are coming from. I would have made a similar argument before reading Deming’s book.

I changed my opinion due to a change in values. Deming (and I) believed that management must commit to change and that they have the responsibility to promote and support new ways of attacking problems. He emphasized things like continuous improvement, cooperation throughout an organization, and concentration on “quality” rather than “numbers”.

In order to facilitate those values you need a means of measuring how well you meet those values.

For example, Deming believed that 95% of all variability in manufacturing was actually due to the workplace environment, and he promoted the idea that managers should treat areas of the business with low variability as stable, and the areas of business with high variability as unstable. From these premises, he showed that a business can get huge ROI on focusing on process improvements in the unstable areas.

That is why Deming demands managers statistically measure and analyze everything. It is their means of determining where they need to focus on process improvement.

My use of the null hypothesis testing as an illustration of Demings value system is a bit ironic, actually. Deming actually didn’t like null hypothesis testing because he felt it was useless. Simply rejecting a null hypothesis didn’t give anyone enough information where they could actually use it for process improvement. For him, managers should focus on statistical methods that had the ability to measure how well a intervention worked, or how stable a particular area of the business actually is.

I just used a null hypothesis test as an illustration because its easy.

Couldn’t agree more. And I think a lot of his values are getting a renaissance now (or at least wider recognition) with all the buzz around Lean and so on.

If i am reading this correctly, you are saying we should collect actual data per user story to compare against the estimates so we know how accurate our estimates were, just so that we know how well we estimate?

I have tried to use statistical methods in the past and spent a lot of time on it only to find the results inconclusive, i.e. a waste of time.

Estimates are just that – estimates. Software development is not an exact science and trying to get estimates more precise is another waste of time.

Story points are relative not absolute estimates, meaning there IS no estimated v actual; it’s just a points value. You could look back and say something took us 5 days not 4, so it should have been 3 points not 2 but by then it’s too late anyway. You are ignoring the use of velocity as a measure of progress and predictor of the future. It’s simple and it works. Read Mike Cohn’s Agile Estimating and Planning.

You had an inaccurate takeaway. I might not have explain my intentions very well in the introduction. Deming’s 14 points of management impressed me, and I wanted to see if you could apply his principles to software. However, that necessarily implies that I apply statistics to estimates.

I don’t know exactly how to go about doing that, but I have a few ideas about it. It’s pretty much just a research thing for me. For example, I could measure the accuracy of the estimates themselves, or I could measure the accuracy of the human making them. Measuring the former would require applying a completely different set of statistical principles than measuring the latter.

Now is the approach practical. Maybe and maybe not. That’s what I’m trying to figure out.

Also, I have read “Agile Estimating and Planning”. The user stories I used as an illustration came from his book.