How To Measure Intuition in Agile Project Planing and Estimation

Developers will often estimate the time and cost of some work using their intuition. This implies that intuition acts like a measuring device.

Further, the statistical revolution brought us the concept that all measurements and measuring tools have inherent uncertainty and error, and attempts to deal with it.

Students typically learn this concept in physics class by measuring the acceleration of gravity. When the student runs a sequence of experiments to determine the acceleration of gravity they get a collection of different results. From that collection of data, they learn how to estimate the “true” acceleration of gravity and the error bound associated with their estimate.

Using that concept, I would like to tackle the problem of how to measure intuition with respect to agile project management.

A simple thought experiment

Let’s use the following thought experiment to illustrate how to measure “intuition”.

Suppose I predicted the amount of time it takes to finish some collection of user stories. Also suppose that I gave my confidence (measured in percent) in achieving these results within that time.

By comparing my predictions against what actually happens, we could estimate the quality of my intuition.

For example, estimates predicted at a 50% level mean that I expect to make correct predictions 50% of the time and incorrect predictions 50% of the time. Therefore, if I get 100% of my 50% predictions right then I incorrectly assigned my predictions to the 50% confidence level, but if I get 50% of my 50% predictions correct then I accurately assessed them.

You can apply the same reasoning for the 60% level, 70% level, etc …

For the sake of illustration, suppose that I tabulated the results of my predictions for user stories along with the results in the following table.

Story

Predicted Time

Confidence Level

Actual Time

Result

1

8 hours

50%

8 hours

Success

2

8 hours

50%

9 hours

Fail

3

8 hours

60%

8 hours

Success

4

8 hours

60%

8 hours

Success

5

8 hours

60%

8 hours

Success

6

8 hours

60%

9 hours

Fail

7

8 hours

70%

8 hours

Success

8

8 hours

70%

8 hours

Success

9

8 hours

70%

8 hours

Success

10

8 hours

70%

9 hours

Fail

11

8 hours

80%

8 hours

Success

12

8 hours

80%

8 hours

Success

13

8 hours

80%

8 hours

Success

14

8 hours

80%

8 hours

Success

15

8 hours

80%

9 hours

Fail

16

8 hours

90%

8 hours

Success

17

8 hours

100%

8 hours

Success

From this we can gather the following information

  • I correctly predicted 1/2 (50%) at the 50% confidence level
  • I correctly predicted 3/4 (75%) at the 60% confidence level
  • I correctly predicted 3/4 (75%) at the 70% confidence level
  • I correctly predicted 4/5 (80%) at the 80% confidence level
  • I correctly predicted 1/1 (100%) at the 90% confidence level
  • I correctly predicted 1/1 (100%) at the 100% confidence level

Let’s look at the regression line associated with this data.

figure_1

In the figure above, the x axis represents “true” accuracy while the y axis represents predicted accuracy. Each point represents intuition at a particular “confidence level”. The dashed line represents “perfect” intuition; so, we ideally want
point as close to the dashed line as possible. The blue line is the regression line associated for all the points and represents a persons overall intuition.

Through this interpretive framework, the data suggest that I generally have under-confident estimates.

Now, suppose the results ended up looking like the following, instead:

  • I correctly predicted 1/2 (50%) at the 50% confidence level
  • I correctly predicted 2/4 (50%) at the 60% confidence level
  • I correctly predicted 2/4 (50%) at the 70% confidence level
  • I correctly predicted 3/5 (60%) at the 80% confidence level
  • I correctly predicted 1/1 (100%) at the 90% confidence level
  • I correctly predicted 1/1 (100%) at the 100% confidence level

The chart would then change to the following:

figure_2

In this case, the regression line suggests that I have over-confident estimates.

Some Caveats

I used the discussion and examples purely for illustrative purposes. I want to appeal more to your intuition rather than providing something very mathematically rigorous.

Potential Applications

I can imagine many different applications to this framework. A couple applications from the top of my head include:

  • Suppose that we had a poker planning session and we had a difference between how people scored a user story. A project manager could use the quality of someones intuition to make decisions about project planning.
  • Someone could use their the regression line to help calibrate their own intuition (similar to how scientists have to calibrate their instruments). If someone knew that they had a tendency to over-estimate or under-estimate at a certain
    confidence level then they could theoretically use that information to train their intuition.
  • Suppose our team failed to meet our estimates 3 times in a row. However, suppose that we also only had a 50% confidence in those estimates. In this case, we can still consider our estimates as correct because we had a 1/8 (12%) chance of making 3 incorrect estimates in a row.
  • Failure to meet estimates becomes a source of information that helps improve estimates. Since we’ve treated estimates as a random variable, we’ve acknowledged that uncertainty and error exist. However, we also have a way to measure it, and use it to make predictions.

Conclusion

This is all pretty much theoretical, but I think that it might have useful applications. I will spend time thinking about it and I will continue to publish my thoughts and results.

Advertisements

Use Statistics for Agile Project Estimation

I recently finished reading “Out of the Crisis” By Edward Deming. It blew my mind because some of his insights into organizational theory and management apply so well to our modern software industry (the book was written in the 80s).

Deming believed that management had to measure and analyze everything statistically. He spent many pages justifying this belief and provided great illustrations on how to apply it to particular situations.

After reading the book, I realized just how wrong the software industry does project estimation, and that Deming had an approach that was better than what we use today.

Comparing Different Agile Estimation Styles

Let’s first talk about the way we currently do things.

Consider 3 of the most popular agile estimation methods that I have seen: (a) Fibonacci, (b) T-shirt sizing, and (c) Ad-hoc estimation.

Fibonacci (aka Poker Planning) uses numbers from the fibonacci sequence to estimate a unit of work. For example, if a project manager asked you to estimate the time it would take you to complete a user story on a kanban board then he would only allow you to choose from the numbers 1, 2, 3, 5, 8, 13, 24, etc …

T-Shirt sizing uses the concept of shirt sizes to estimate a unit of work. For example, if a project manager asked you to estimate the time it would take you to complete a user story then he could only allow you to choose from the options of XS, S, M, L, XL, XXL.

Ac-hoc estimation requires that someone make-up a “realistic” estimate for a unit of work. For example, if a project manager asked you to estimate the time it would take you to complete a user story then he could expect you to say 12 hours and hold you to that number.

To help illustrate how you would use each “system”, let us suppose we had the following user stories for a sprint

Story ID Story
1 As a coach, I can enter the names and demographic information
for all swimmers on my team.
2 As a coach, I can define practice sessions.
3 As a swimmer, I can see all of my times for a specific event.
4 As a swimmer, I can update my demographics information.

The following table illustrates the way you could estimate the size of each user story using each “system”.

Story ID Story Fibonacci Estimate T-Shirt Estimate Ad-Hoc Estimate

1 As a coach, I can enter the names and demographic information for all swimmers on my team. 3 Small 3
2 As a coach, I can define practice sessions. 5 Medium 4
3 As a swimmer, I can see all of my times for a specific event. 5 Medium 4
4 As a swimmer, I can update my demographics information. 5 Medium 4

Testing the Goodness of Fit

I can see one major weakness with all of these methods: I do not know how to measure how good of a fit each estimate is to reality.

For example, suppose I “correctly” estimated the user stories (a) 75% of the time using Fibonacci, (b) 85% of the time using T-Shirt sizing, and (c) 50% of the time with Ad-Hoc Estimation.

Can I confidently claim that T-shirt sizing outperforms all the other methods?

We do not have a proper apples to apples comparison; so, I say we cannot make any general claims to effectiveness. Also, even if we could compare each approach how can we be sure that at some point in the future one of the other methods performs better?

In fact, I can’t think of any satisfactory way of solving this problem, which is exactly why I value Deming’s appeal to statistically measure and analyze everything.

Why We Need To Use Statistics

If you could statistically measure and analyze everything then you could lean on decades worth of statistical methods. That would remove so much of vagueness and ambiguity that I deal with on a regular basis.

Let me give an example to illustrate my point.

Suppose I used a probability distribution function or a collection of probability distribution functions to determine the time estimates for the user stories. The following table illustrates what that would look like

Story Estimation from a Probability Distribution
Story ID Story Expected Value Standard Deviation Actual Value
1 As a coach, I can enter the names and demographic information for all swimmers on my team. 5 2.5 5.5
2 As a coach, I can define practice sessions. 4 2 4
3 As a swimmer, I can see all of my times for a specific event. 3 1.5 3.5
4 As a swimmer, I can update my demographics information. 3 1.5 3.5

If we had such situation then we could compare the observed results to the predicted values using a standard chi square goodness of fit test.

Let’s run through a chi square test using this specific example.

Suppose we have a hypothesis H0 that this distribution function fits the data. We can run the calculations to determine the likelihood of that hypothesis.

Chi^2 = (5-5.5)^2/5.5 + (4-4)^2/4 + (3-3.5)^2/3.5 + (3-3.5)^2/3.5 = 0.5^2/5.5 + 0 + 0.5^2/3.5 + 0.5^2/3.5 = 0.25/5.5 + 0.25/3.5 + 0.25/3.5 = .05 + .07 + .07 = .19

Using the chi square chart for 3 degrees of freedom we see that the probability of the H0 generating the actual values is above 95%. So we fail to reject the hypothesis and simultaneously “accept” that the probability distribution function since we have no reason to believe that it isn’t true.

Under this scenario, we would have high confidence in our estimates in the future, and if our estimates start to get bad we would also know the bounds of the error.

In my opinion, this is a much better position to be in than the one that we are in right now, which is pretty much the equivalent of educated guessing.

Notice, however, that I never actually specified what our “magic” probability distribution function actually was. Our ability to make statistical estimates depends on our ability to find it.

In my next post, I intend to detail some methods that MIGHT help us find good probability distributions.

Stay tuned.

Is your software project failing? Blame management

I just read the Chaos Manifeto 2013: Think Big, Act Small paper from the Standish Group. It blew my mind.

Let me give a few introductions about the Standish Group and the chaos manifesto before I elaborate on my mind blowing revelation.

Who’s the Standish Group and Why Should You Care?

The Standish Group has been collecting information on IT projects since 1985. This situation allows them to give unique and informed commentary on what does and does not work for building software projects.

The Chaos Manifesto provides their insight and perspective into software projects based on the data they have. The Chaos Manifesto 2013: Think Big, Act Small paper provides that for “small projects” based on 50,000 projects since 2002.

I would recommend everyone involved in building software to read it (particularly if you have a management role).

The paper claims that there exists 10 “Factors of Success” to a project, and that some factors matter more than others. Each one of the factors has a very technical meaning, and I will not talk about it in this post. I challenge you to download the pdf and read it for yourself.

I have replicated a table that summarizes their findings below.

Factors of Success Points
Executive management support 20
User involvement 15
Optimization 15
Skilled resources 13
Project management expertise 12
Agile process 10
Clear business objectives 6
Emotional maturity 5
Execution 3
Tools and infrastructure 1

The Lesson To Be Learned

I’d like you to notice just how little developer actually matters according to the Standish Group.

Within this paradigm, the individual developer only has control of their own knowledge, skill, and emotional maturity, and that falls under the “skilled resources” and “emotional maturity” category.

That means that developers can really only contribute at most 18% to the success of a project. The other 72% belongs to management.

That blew my mind.

However, once I really thought about it, I realized just how consistent it was with all of my experiences.

All of my best (and successful) projects had good management, and most of my worst (and failing) projects had horrible management.

For example, I have been in many situations where very important aspects of my job were completely out of my control, or I had to work in an environment or with people and tools that made me very unproductive.

This revelation both humbles and horrifies me.

It humbles me because I understand just how much those above me contribute to my success.

It horrifies me because I understand just how much those above me contribute to my success.

I wish that management really understand just how much their actions (or lack thereof) affects their people.