Recent experiences with agile coaching and reading has brought me to a concept of resiliency, which I have found to contain fascinating implications for modern product development and project management, especially as it relates to resilient plans. This idea has been studied in biological, ecological, engineering and psychological fields to name just a few facets of this concept.
Simply stated, resiliency is the capability of a system to recover to a stable, functioning state after failure or adverse events. The specifics of this vary widely based on the field of study; in this post I’d like to first talk about some of the common elements of resilient systems and look closer at how this would apply to software development and IT project management. I believe it offers several key insights in how to operate more effectively.
Let me first illustrate with an example from biology to distinguish between two frequently confused concepts: robustness and resiliency. The blue whale is the largest animal in the ocean; it is so massive that is has no natural predators. Before commercial whaling, there were an estimated 300,000 of these whales swimming through the oceans of the planet. Today, even after 50 years of moratorium on commercial whaling, there are only about 5,000. The blue whale is an excellent example of a robust organism. Its great speed and large size allow it to resist natural predators—and even early whalers—quite effectively. However, once confronted with a challenge it could not resist, the blue whale was unable to adapt and quickly declined. To find an example of a resilient organism, we may want to look much smaller. Let’s take the example of the termite, an animal that we humans continue to try to eliminate. Termites operate as colonies; while we may kill some of them—and we frequently do—overall, the colony continues to thrive in spite of these minor setbacks. Indeed, if we were to judge the success of a species by its combined biomass (the combined mass of all living members of the species), we would be humbled by the resilient termite.
|Species||Biomass (in millions of tons)|
|Blue whale (before commercial whaling)||36|
|Blue whale (today)||<1|
What Makes a System Resilient?
Not all resilient systems are the same, but many share common characteristics which help them respond quickly and recover. Zolli and Healy do an excellent job of detailing some of the key concepts, but here’s a quick summary.
- Decoupled and modular systems: As opposed to redundant systems, which simply store duplicate capacity, resilient systems frequently have a decoupled and decentralized structure. Numerous organizations, ranging from elite military teams down to swarming insects, leverage this concept to work with multiple, autonomous groups in which individuals are interchangeable. In modern projects, this looks very similar to the team-based structures of methodologies like Scrum and eXtreme Programming, where work is planned, managed, and executed jointly by small teams as opposed to being centrally managed.
- Rapid Feedback and dynamic reorganization: Resilient systems must be responsive to change, and that can only happen if they have rapid sources of feedback to detect changes very early. In an earlier post, I discussed how the simple behaviors of ants can lead to advanced detection and optimization of work roles, as well as navigation routes.
- Swarming: When changes are detected, resilient systems may swarm to respond to a threat or change. Swarming behavior allows for disproportionate energy to be exerted at a key area. This may be a human body sending antibodies to fight an infection, ants clustering to defend a hive, or even a committed team focusing their energy on clearing an impediment.
- Diverse at the Edge, but Simple at the Core: Resilient systems frequently have quite a bit of complexity, but are bound together by simple common strains. The Internet consists of numerous computers of various designs running innumerable different operating systems, but they all communicate and interface with a simple set of shared protocols. Agile teams are made of people who have numerous diverse skills, yet they work together based on shared purpose, team commitment, joint ownership of code, agreed-upon criteria for work, and paired execution.
Why Favor Resilient Plans?
As countless failed projects have demonstrated, there are innumerable unknown variables that can undermine even the best-laid plan. In fact, the nature of the numerous low probability, high impact—or “black swan”—events show the limits of a robust but rigid system. We can never plan for all contingencies, and the number of uncommon, but profoundly impactful, risks guarantee that we will likely experience some event that we could not have specifically anticipated. This was demonstrated quite nicely by the mathematician J. E. Littlewood who mathematically calculated the frequency of the average person observing a “one-in-a-million chance” to be 35 days. This is not a purely negative situation either. There are frequently opportunities that are impossible to specifically predict, but quite valuable if they can be rapidly exploited.
These concepts do not appear to conflict with the basic tenets of agile software development, but rather they slightly refocus the frame. Many people think of agile as working well in an environment where the question of what to build is still open, and indeed agile offers much in this area. Resiliency seems to be more focused on the nature of the type of team, project and program you want to run. No matter how well we think we know a domain, invariably some change will emerge and then the question becomes— regardless of methodology—how will we react?
Resilient systems allow us to recover from those inevitable unexpected occurrences. This leads us to some apparently counter-intuitive conclusions. Most of us are led to believe that prevention is always the ultimate goal. To this day, I follow many people in the Scrum field by recommending they methodically confront risk and burn it down. However, my experience has shown that no matter how well we confront risk, this type of approach will not address those unpredicted and rare events. In fact, for those who do not follow an agile or lean process, too much focus on risk management and mitigation may actually make a project more brittle. Projects that try to agree on the architecture entirely upfront or attempt to arrange delivery in a way to optimize testing (without investing in automation) will find themselves undertaking a very fragile project. Their plans may be made more robust with further analysis, buffers and mitigation plans, but they will still be unable to recover once a key assumption is invalidated or a buffer exhausted. It seems Ben Franklin wasn’t entirely right when advised, “an ounce of prevention is worth a pound of cure”.
In the domain of complex business and organizational problems, there are too many ailments for us to prevent them all. Resiliency gives us a vocabulary and concepts that we can incorporate into our projects to ensure that not only are they agile, but that they are quite able to rebound back when we can’t adjust fast enough to avoid adverse events. I’m sure I’ll have more to write on this soon, but I’ve found this to be a very rich vein of research and have been quite encouraged to see the strong parallels between what resiliency engineering has found so far and what we within the agile software development realm are discovering as well.