So far in this series, we’ve been building a Plan bottom-up from its small components, the Tasks. We’ve explored it by minimizing the unpredictable, making sure we know what will hit us in the future so less interruptions and surprises would occur during the execution of the Plan. We’ve seen how each Task is born into uncertainty. Minimizing uncertainties, is making clear what we do, when we do, how we do and what is required to do. Various kinds of efforts are needed to make it certain enough (95% certain).
A Plan, as it is a collection of Tasks, also has uncertainties to remove and unpredictables to handle. In this chapter, we’ll explore it combined.
Things that we Know
We’re more than half way through the modeling process of Planning. Let’s take another step forward and model the relationship between Tasks and Plan via Predictions & Uncertainties. A model that would help us classify Tasks according to the last two.
Each quadrant is actionable:
Predictable & Certain Enough Tasks – We are as sure as we can be about these (who/what/how/when), then nothing left but to get it done. Commit these Tasks to the Plan.
Highly Uncertain & Highly Unpredictable Tasks – We can not commit to Tasks that could not be known in advance. We should leave a High Variance Buffer outside the Plan for these kinds of Tasks.
Certain Enough but Highly Unpredictable Tasks – These are the real time incidents, the down time. Again we can not commit to what we do not know. But even without knowing, we can guarantee in advance that less effort would be required. Monitoring and tracing allow us to faster catch and handle real time issues. So does Continuous Deployments. Both result in faster recovery time, smaller means and lower variance (MTTR) in execution when facing the unpredictable. Leave a Low Variance Buffer outside the Plan for these kinds of Tasks.
Predictable but Highly Uncertain Tasks – There are Tasks that have a self contradiction, where reducing uncertainty is the majority of the Task itself. Most effort on a Bug, for example, is into tracing it, understanding it and reproducing it. A Bug’s fix is mostly marginal to this. In these cases, it is better to not reduce its uncertainty and Fence the Effort instead: “In the next Sprint we’re going to invest up to X days into fixing our top priority bugs”. Yes, I’m honestly saying do not estimate bugs in advance. It’s non-beneficial.
But that’s not the end of it at all. The process of predicting and uncovering uncertainties to each Task independently, has an effect on the plan in its entirety. They optimize the sizes of the predictable/certain quadrants.
The outcome would be a less unpredictable and less uncertain Plan.
The Committed Plan is highly certain, and so is the effort required for it. The Estimated Tasks have gone through the process of uncovering certainties, and are certain enough (95%). Tasks whose uncertainties should not be reduced (bugs), are all fenced by a constant number which is 100% certain.
A Distributed Buffer
Actually, the Low Variance Buffer should not exist. Yes, we’ve made it all the way here just to say that there is a buffer we should not take. But we have to make sure we meet the conditions to do that.
If truly unpredictable incidents do occur, the act of predicting also ensures minimizing the frequency of these incidents. And we’ve seen that Continuous Deployment, Monitoring and Tracing have the potential to minimize the reaction’s effort to a matter of mere hours. Even if we have 5 such incidents a month (and that’s a lot!), it would add up to 1-2 days. A team of 5 has 100 days of effort a month (20 per person), so these incidents add up to 1%-2% of a buffer. That is a negligible buffer to take.
Lastly, this Low Variance Buffer does exist, but spread across the Estimated Tasks. Remember that each Task is a Change. Change entails Feedback that entails Change (The Wheel of Change). We can add sub-tasks to the Task to accommodate that by Planning for Change. And that each Change has the potential to cause Instability. It is predictable, and each Estimated Task already includes a variance/buffer just for that, done when uncertainty was minimized.
As such, we can entirely remove the Low Variance Buffer and commit more Tasks to the plan. That is a Plan that gets more job done, with the right expectations.
We’re left with the High Variance Buffer. Maybe finally this is the once we call the maintenance buffer. It isn’t.
There is no Maintenance
In order to understand what the High Variance Buffer buffer is, we need to understand first what it isn’t. Let’s classify our Tasks in a very high-level way according to their expected outcome on the business:
- Move the business forward – product-driven/customer facing Tasks, such as Features, User Stories, Products. (Bugs that directly affect customers belong here).
- Loss prevention – Tasks that we know will cause a loss if not eventually done at the right time. (Bugs that indirectly affect customers or affects the team belongs here)
- Inefficiency removal – Tasks that increase our Velocity and give us more capacity to do more Tasks.
I suspect the last two are what we commonly refer to as maintenance, but all three are predictable Tasks that affect the business. All belong to the Committed Plan and are not a part of the High Variance Buffer. It is not work v.s. maintenance. Maintenance is part of the work itself.
That answers the question we started this series with, “how much of a buffer to leave for maintenance?”. The answer is none because there is no maintenance. What we do leave a buffer only for is the truly unpredictable incidents.
Another truly unpredictable, but not an incident, is a sudden change in priorities during execution. It could be “this loss prevention Task must be done immediately”, or “this feature must be done now or the customer will bail”. Priorities change, that’s fine. But that’s not what we leave a buffer for, because they are predictable Tasks as well. As they also affect the business, they should change the Committed Plan. The buffer is left for incidents alone until nearly the end of the Sprint.
Now that we know what we should leave a buffer for, we still need to know how big of a buffer it should be. If we want smaller buffers, we know what we have to do. The better our monitoring is, the lower the frequency of incidents happening, the lesser the buffer needed. Monitoring is a predictable Task with the outcome of loss prevention.
It’s highly uncertain what the next Sprint’s buffer should be. It’s 100% certain what the previous buffer was. A good rule of thumb would be to take a buffer equal to the time spent on incidents in the last Sprint, or a moving average of the last few sprints. Meaning it should be measured. 1 day more or 1 day less than expected, for the buffer is just 1% out of 100 days of effort a month and that is negligible. If the worst happens, and a week of the entire team was spent on a disastrous incident – it is what it is. We’d need to let the Committed Plan go. There is no need to prepare for it, better to replan when it happens.
Do notice that the Sprint’s length itself changes these assumptions. An engineer’s month would always have ~20 work days. 1 day of a month would always be ~5%. If the Sprint is a week, that’s 20% of the Sprint. So the shorter the Sprint, the more likely incidents would break the Committed Plan.
We have one more gap to bridge. The gap between a Task and an Estimated Task, which is to practically remove the uncertainty of it. To properly uncover the Task’s work. On this, in the next chapter.