I recently finished reading “Out of the Crisis” By Edward Deming. It blew my mind because some of his insights into organizational theory and management apply so well to our modern software industry (the book was written in the 80s).
Deming believed that management had to measure and analyze everything statistically. He spent many pages justifying this belief and provided great illustrations on how to apply it to particular situations.
After reading the book, I realized just how wrong the software industry does project estimation, and that Deming had an approach that was better than what we use today.
Comparing Different Agile Estimation Styles
Let’s first talk about the way we currently do things.
Consider 3 of the most popular agile estimation methods that I have seen: (a) Fibonacci, (b) T-shirt sizing, and (c) Ad-hoc estimation.
Fibonacci (aka Poker Planning) uses numbers from the fibonacci sequence to estimate a unit of work. For example, if a project manager asked you to estimate the time it would take you to complete a user story on a kanban board then he would only allow you to choose from the numbers 1, 2, 3, 5, 8, 13, 24, etc …
T-Shirt sizing uses the concept of shirt sizes to estimate a unit of work. For example, if a project manager asked you to estimate the time it would take you to complete a user story then he could only allow you to choose from the options of XS, S, M, L, XL, XXL.
Ac-hoc estimation requires that someone make-up a “realistic” estimate for a unit of work. For example, if a project manager asked you to estimate the time it would take you to complete a user story then he could expect you to say 12 hours and hold you to that number.
To help illustrate how you would use each “system”, let us suppose we had the following user stories for a sprint
|1||As a coach, I can enter the names and demographic information
for all swimmers on my team.
|2||As a coach, I can define practice sessions.|
|3||As a swimmer, I can see all of my times for a specific event.|
|4||As a swimmer, I can update my demographics information.|
The following table illustrates the way you could estimate the size of each user story using each “system”.
|Story ID||Story||Fibonacci Estimate||T-Shirt Estimate||Ad-Hoc Estimate|
|1||As a coach, I can enter the names and demographic information for all swimmers on my team.||3||Small||3|
|2||As a coach, I can define practice sessions.||5||Medium||4|
|3||As a swimmer, I can see all of my times for a specific event.||5||Medium||4|
|4||As a swimmer, I can update my demographics information.||5||Medium||4|
Testing the Goodness of Fit
I can see one major weakness with all of these methods: I do not know how to measure how good of a fit each estimate is to reality.
For example, suppose I “correctly” estimated the user stories (a) 75% of the time using Fibonacci, (b) 85% of the time using T-Shirt sizing, and (c) 50% of the time with Ad-Hoc Estimation.
Can I confidently claim that T-shirt sizing outperforms all the other methods?
We do not have a proper apples to apples comparison; so, I say we cannot make any general claims to effectiveness. Also, even if we could compare each approach how can we be sure that at some point in the future one of the other methods performs better?
In fact, I can’t think of any satisfactory way of solving this problem, which is exactly why I value Deming’s appeal to statistically measure and analyze everything.
Why We Need To Use Statistics
If you could statistically measure and analyze everything then you could lean on decades worth of statistical methods. That would remove so much of vagueness and ambiguity that I deal with on a regular basis.
Let me give an example to illustrate my point.
Suppose I used a probability distribution function or a collection of probability distribution functions to determine the time estimates for the user stories. The following table illustrates what that would look like
|Story Estimation from a Probability Distribution|
|Story ID||Story||Expected Value||Standard Deviation||Actual Value|
|1||As a coach, I can enter the names and demographic information for all swimmers on my team.||5||2.5||5.5|
|2||As a coach, I can define practice sessions.||4||2||4|
|3||As a swimmer, I can see all of my times for a specific event.||3||1.5||3.5|
|4||As a swimmer, I can update my demographics information.||3||1.5||3.5|
If we had such situation then we could compare the observed results to the predicted values using a standard chi square goodness of fit test.
Let’s run through a chi square test using this specific example.
Suppose we have a hypothesis H0 that this distribution function fits the data. We can run the calculations to determine the likelihood of that hypothesis.
Chi^2 = (5-5.5)^2/5.5 + (4-4)^2/4 + (3-3.5)^2/3.5 + (3-3.5)^2/3.5 = 0.5^2/5.5 + 0 + 0.5^2/3.5 + 0.5^2/3.5 = 0.25/5.5 + 0.25/3.5 + 0.25/3.5 = .05 + .07 + .07 = .19
Using the chi square chart for 3 degrees of freedom we see that the probability of the H0 generating the actual values is above 95%. So we fail to reject the hypothesis and simultaneously “accept” that the probability distribution function since we have no reason to believe that it isn’t true.
Under this scenario, we would have high confidence in our estimates in the future, and if our estimates start to get bad we would also know the bounds of the error.
In my opinion, this is a much better position to be in than the one that we are in right now, which is pretty much the equivalent of educated guessing.
Notice, however, that I never actually specified what our “magic” probability distribution function actually was. Our ability to make statistical estimates depends on our ability to find it.
In my next post, I intend to detail some methods that MIGHT help us find good probability distributions.