Agile, Waterfall, or Hybrid: An IF4IT Framework for Choosing Delivery Methodology - Fail Fast — A Shared Principle, Applied Differently
Agile, Waterfall, or Hybrid: An IF4IT Framework for Choosing Delivery Methodology
Chapter 5. Fail Fast — A Shared Principle, Applied Differently
A widespread misconception holds that Agile fails fast and Waterfall does not. This is wrong, and understanding why it is wrong is essential to understanding the framework. Both Agile and Waterfall pursue the principle of failing fast — of detecting and correcting failure as early and as cheaply as possible. They differ not in whether they fail fast, but in where, relative to the delivery of the Product or Service, they locate the point at which failure is detected.
Agile locates the failure point at or near delivery. A delivered increment is itself permitted to be the thing that surfaces a problem, because a failed increment is cheap to detect and cheap to correct. The speed in failing fast refers to the iteration loop: deliver, discover, correct, deliver again. Agile does not abandon verification before delivery, but it does not need to drive the probability of a production failure to near zero, because a production failure is inexpensive and quickly recoverable.
Waterfall fails fast as well, but it deliberately relocates the failure point to before delivery. The stages of a Waterfall delivery — simulation, emulation, integration testing, user acceptance testing, and the like — exist precisely so that failure is detected early, within a contained stage, while it is still inexpensive to correct, rather than in the delivered Product or Service where it would be catastrophic. Each stage is a fail-fast checkpoint, and the sequence of stages is a sequence of progressively more realistic opportunities to detect failure before release. Waterfall’s stages are not bureaucratic overhead. They are fail-fast discipline, relocated to before delivery because that is where failure must be caught when failure in production cannot be tolerated.
The real distinction, then, is this: where is the cheapest place for this work to fail? When failure in production is inexpensive and quickly recoverable, the cheapest place to fail is production itself — and that is Agile. When failure in production is ruinously expensive or otherwise intolerable, production is the most expensive possible place to fail, and the failure point must be relocated earlier, into staged verification before release — and that is Waterfall. This is the reasoning that underlies the framework’s gating indicator, Consequence of Failure. The location of the cheapest failure point is determined by the cost of failing in production, and that cost is what the Consequence of Failure indicator measures.
Two examples illustrate the principle across a range of severity. The first is the design of an integrated circuit. Failure of a chip in production is not unthinkable, but it is ruinously expensive: correcting an error after fabrication requires a chip turn — going back, fixing the error, retesting, and manufacturing a corrected version — which is enormously costly and time consuming. The entire discipline of pre-fabrication verification, including simulation and emulation, exists so that errors are found before the chip is fabricated. The chip teaches that when failure in production is economically severe, the detection of failure must be relocated into the stages that precede manufacture.
The second example is more extreme. Consider the software that controls a six-armed robotic surgical system. Such software cannot fail during surgery, because the cost of failure is measured in human life and is therefore beyond measure. There is no opportunity to discover the problem in production and correct it in the next iteration. Failure must be driven out of production entirely, through exhaustive verification before the system is ever used in a live procedure. The surgical system teaches that when failure in production is unmeasurable, the discipline of failing before delivery becomes absolute. The chip and the surgical system are the same principle at two intensities: the chip because production failure is extremely expensive, the surgical system because production failure is unthinkable. In both, the response is the same — relocate the detection of failure to before delivery.
One implication of this principle should be made explicit, because it is easily missed. Waterfall-shaped work may be, and very often is, internally iterative. Engineers refine designs through repeated cycles of analysis, simulation, and test; teams revisit work as new information emerges from staged verification. What Waterfall-shaped work cannot do — and the reason it is not Agile-shaped — is rely on production failure as its learning mechanism when the consequence of that failure is severe. The iteration happens, but it happens against simulations, emulations, integration tests, and other staged checks rather than against the delivered Product or Service. The distinction between Agile and Waterfall is not the presence or absence of iteration; it is where the iteration is permitted to surface a problem.
Copyright for the International Foundation for Information Technology (IF4IT): 2008 - Present
Legal Disclaimers