The Basics of Software Resilience and Security Chaos Engineering

The software resilience transformation I pioneered with my book — coalescing¹, defining, and innovating the principles, practices, and patterns needed to pursue a comprehensive resilience strategy — is gathering momentum in mindshare. This means humans, in their general propensity towards least effort, want the cliff notes² version of what this “software resilience” and “security chaos engineering” stuff means. This is a reasonable request, which I am happy to oblige in this post.

Below are the chapter takeaways from my book, Security Chaos Engineering: Sustaining Resilience in Software and Systems. My hope is these summaries can serve as a cliff notes study guide on the mosaic of concepts within the tome³ — at least the basics — and make the software resilience approach more accessible to curious beings wanting to do software security better.

As a sneaky bonus, this post serves as a suitable citation for those of you who are only allowed to cite public / non-commercial sources (which my book is not):

Shortridge, Kelly. “The Basics of Software Resilience and Security Chaos Engineering.” Sensemaking by Shortridge (blog). January 4, 2024. https://kellyshortridge.com/blog/posts/security-chaos-engineering-sustaining-software-systems-resilience-cliff-notes/

tl;dr — My standard definition of Security Chaos Engineering is “a socio-technical transformation that enables the organizational ability to gracefully respond to failure and adapt to evolving conditions.” This applies to my concept of Platform Resilience Engineering, too. Really, it’s all about sustaining resilience in practice.

Resilience in Software and Systems

Takeaways from Chapter 1

All of our software systems are complex. Complex systems are filled with variety, are adaptive, and are holistic in nature.
Failure is when systems — or components within systems — do not operate as intended. In complex systems, failure is inevitable and happening all the time. What matters is how we prepare for it.
Failure is never the result of one factor; there are multiple influencing factors working in concert. Acute and chronic stressors are factors, as are computer and human surprises.
Resilience is the ability for a system to gracefully adapt its functioning in response to changing conditions so it can continue thriving.
Resilience is the foundation of security chaos engineering. Security Chaos Engineering (SCE) is a set of principles and practices that help you design, build, and operate complex systems that are more resilient to attack (and other types of failures, too).
The five ingredients of the “resilience potion” include understanding a system’s critical functionality; understanding its safety boundaries; observing interactions between its components across space and time; fostering feedback loops and a learning culture; and maintaining flexibility and openness to change.
Resilience is a verb. Security, as a subset of resilience, is something a system does, not something a system has.
SCE recognizes that a resilient system is one that performs as needed under a variety of conditions and can respond appropriately both to disturbances — like threats — as well as opportunities. Security programs are meant to help the organization anticipate new types of hazards as well as opportunities to innovate to be even better prepared for the future (whether new incidents, market conditions, or more).
There are many myths about resilience, four of which we covered: resilience is conflated with robustness, the ability the “bounce back” to normal after an attack; the belief that we can and should prevent failure (which is impossible); the myth that the security of each component adds up to the security of the whole system (it does not); and that creating a “security culture” fixes the “human error” problem (it never does).
SCE embraces the idea that failure is inevitable and uses it as a learning opportunity. Rather than preventing failure, we must prioritize handling failure gracefully — which better aligns with organizational goals, too.

Systems-Oriented Security

Takeaways from Chapter 2

If we want to protect complex systems, we can’t think in terms of components. We must infuse systems thinking in our security programs.
No matter our role, we maintain some sort of “mental model” about our systems — assumptions about how a system behaves. Because our systems and their surrounding context constantly evolve, our mental models will be incomplete.
Attackers take advantage of our incomplete mental models. They search for our “this will always be true” assumptions and hunt for loopholes and alternative explanations (much like lawyers).
We can proactively find loopholes in our own mental models through resilience stress testing. That way, we can refine our mental models before attackers can take advantage of inaccuracies in them.
Resilience stress testing is a cross-discipline practice of identifying the confluence of conditions in which failure is possible; financial markets, healthcare, ecology, biology, urban planning, disaster recovery, and many other disciplines recognize its value in achieving better responses to failure (versus risk-based testing). In software, we call resilience stress testing “chaos experimentation.” It involves injecting adverse conditions into a system to observe how the system responds and adapts.
The E&E Approach is a repeatable, standardized means to incrementally transform toward resilience. It involves two tiers of assessment: evaluation and experimentation. The evaluation tier is a readiness assessment that solidifies the first three resilience potion ingredients: understanding critical functions, mapping system flows to those functions, and identifying failure thresholds. The experimentation tier harnesses learning and flexibility: conducting chaos experiments to expose real system behavior in response to adverse conditions, which informs changes to improve system resilience.
The “fail-safe” mindset is anchored to prevention and component-based thinking. The “safe-to-fail” mindset nurtures preparation and systems-based thinking. Fail-safe tries to stop failure from ever happening (impossible) while safe-to-fail proactively learns from failure for continuous improvement. The fail-safe mindset is a driver of the status quo cybersecurity industry’s lack of systems thinking, its fragmentation, and its futile obsession with prediction.
Security Chaos Engineering (SCE) helps organizations migrate away from the security theater that abounds in traditional cybersecurity programs. Security theater is performative; it focuses on outputs rather than outcomes. Security theater punishes “bad apples” and stifles the organization’s capacity to learn; it is manual, inefficient, and siloed. Instead, SCE prioritizes measurable success outcomes, nurtures curiosity and adaptability, and supports a decentralized model for security programs.
RAV Engineering (or RAVE) reflects a set of principles—repeatability, accessibility, and variability — that support resilience across the software delivery lifecycle. When an activity is repeatable, it minimizes mistakes and is easier to mentally model. Accessible security means stakeholders don’t have to be experts to achieve our goal security outcomes. Supporting variability means sustaining our capacity to respond and adapt gracefully to stressors and surprises in a reality defined by variability.

Architecting and Designing for Software Resilience

Takeaways from Chapter 3

Our systems are always “becoming,” an active process of change. What started as a simple system we could mental-model with ease will become complex as it grows and the context around it evolves.
When architecting and designing a system, your responsibility is not unlike that of Mother Nature: to nurture your system so it may recover from incidents, adapt to surprises, and evolve to succeed over time.
We — as individuals, teams, and organizations — only possess finite effort and must prioritize how we expend it. The Effort Investment Portfolio concept captures the need to balance our “effort capital” across activities to best achieve our objectives.
When we allocate our Effort Investment Portfolio during design and architecture, we must consider the local context of the entire sociotechnical system and preserve possibilities for both software and humans within it to adapt and evolve over time.
There are four macro failure modes for complex systems that can inform how we allocate effort when architecting and designing systems. We especially want to avoid the Danger Zone quadrant—where tight coupling and interactive complexity combine — because this is where surprising and hard-to-control failures, like cascading failures, manifest.
We can invest in looser coupling to stay out of the Danger Zone. In this chapter, I covered numerous opportunities to architect and design for looser coupling; the best opportunities depend on your local context.
Tight coupling is sneaky and may only be revealed during an incident; systems often inadvertently become more tightly coupled as changes are made and we excise perceived “excess.” We can use chaos experiments to expose coupling proactively and refine our design accordingly.
We can also invest in linearity to stay out of the Danger Zone. We described many opportunities to architect and design for linearity, including isolation, choosing “boring” technology, and functional diversity. The right opportunities depend, again, on your local context.
Scaling the sociotechnical system is where coupling and complexity especially matter. When immersed in the labyrinthine nest of teams and software interactions in larger organizations, we must tame tight coupling (by investing in looser coupling) and find opportunities to introduce linearity — or else find our forward progress crushed.
Experiments can generate evidence of how our systems behave in reality so we can refine our mental models during design and architecture. If we do our jobs well, our systems will grow and therefore become impossible to mentally model on our own. We can leverage experimentation to regain confidence in our understanding of system behavior.

Building and Delivering for Software Resilience

Takeaways from Chapter 4

When we build and deliver software, we are implementing intentions described during design, and our mental models almost certainly differ between the two phases. This is also the phase where we possess many opportunities to adapt as our organization, business model, market, or any other pertinent context changes.
Who owns application security (and resilience)? The transformation of database administration serves as a template for the shift in security needs; it migrated from a centralized, siloed gatekeeper to a decentralized paradigm where engineering teams adopt more ownership. We can similarly transform security.
There are four key opportunities to support critical functionality when building and delivering software: defining system goals and guidelines (prioritizing with the “airlock” approach); performing thoughtful code reviews; choosing “boring” technology to implement a design; and standardizing “raw materials” in software (like memory safe languages).
We can expand safety boundaries during this phase with a few opportunities: anticipating scale during development; automating security checks via CI/CD; standardizing patterns and tools; and performing dependency analysis and vulnerability prioritization (the latter in a quite contrary approach to status quo cybersecurity).
There are four opportunities for us to observe system interactions across spacetime and make them more linear when building and delivering software and systems: adopting Configuration as Code; performing fault injection during development; crafting a thoughtful test strategy (prioritizing integration tests over unit tests to avoid “test theater”); and being especially cautious about the abstractions we create.
To foster feedback loops and learning during this phase, we can implement test automation; treat documentation as an imperative (not a nice-to-have), capturing both why and when; implement distributed tracing and logging; and refine how humans interact with our processes during this phase (keeping realistic behavioral constraints in mind).
To sustain resilience, we must adapt. During this phase, we can support this flexibility and willingness to change through five key opportunities: iteration to mimic evolution; modularity, a tool wielded by humanity over millennia for resilience; feature flags and dark launches for flexible change; preserving possibilities for refactoring through (programming language) typing; and pursuing the strangler fig pattern for incremental, elegant transformation.

Operating and Observing for Software Resilience

Takeaways from Chapter 5

Operating and observing the system is the phase where we can witness system behavior as it runs in production, which can reveal where our mental models are inaccurate. It is when we can glean valuable insights about our systems and incorporate this data into our feedback loops.
Security is woven into all three key aspects of reliability that reflect user expectations: availability, performance, and correctness.
Site reliability engineering (SRE) goals and security goals overlap to a significant degree, making those teams natural allies in solving reliability and resilience challenges. A key difference is that SRE understands that moving quickly is correlated with reducing the impact of failure; security must adopt this mindset too.
Attackers can directly measure success and immediately receive feedback, giving them an asymmetric advantage. We must strive to replicate this for our goals too.
To measure operational success, we can borrow established metrics like the DORA metrics and craft thoughtful SLOs that help us learn more about the system.
Success is an active process, not a one-time achievement. We must support graceful extensibility: the capability to anticipate bottlenecks and “crunches,” learn about evolving conditions, and adapt responses to stressors and surprises as they change.
We want to mimic the interactive, overlapping, and decentralized sensitivities of biological systems in our observability strategy. In particular, we want to observe system interactions across space and time. We must maintain the ability to reflect on three key questions: How well is the system adapted to its environment? What is the system adapted to? What is changing in the system’s environment?
Tracking when a system is repeatedly stretching toward its limit (“thresholding”) helps us uncover the system’s boundaries of safe operation. Increasingly “laggy” recovery from disturbances in both the socio and technical parts of the system can indicate erosion of adaptive capacity.
Attack observability refers to collecting information about the interaction between attackers and systems. It involves tracing attacker behavior to reveal how it looks in reality versus our mental models. Deception environments can facilitate attacker tracing, fuel a feedback loop for resilient design, and serve as an experimentation platform.
A scalable system is a safer system. System signals used to measure scalability can be used as indicators of attack too; we discussed many, including autoscaling replica count, heartbeat response time, and resource consumption.
Being a gatekeeper to growth is not an effective way to achieve security outcomes. Scalability forces high-friction processes and procedures to adapt to growth, which is healthy for sustaining resilience.
We should apply the concept of toil, from SRE, to security. For any task that a computer can perform better than a human — like those requiring accurate repetition — we should automate it. Doing so frees up effort capital that we can expend on higher-value activities that leverage human strengths like creativity and adaptability.

Responding and Recovering for Software Resilience

Takeaways from Chapter 6

Incidents are like a pop quiz. To prepare for them and ensure we can respond with grace, we must practice incident response activities—and can do so through chaos experimentation.
The Effort Investment Portfolio applies to incident response too. Effort expended earlier in the software delivery lifecycle will reduce the effort required when responding to incidents (this does not mean “shift left,” at least in its popularized / monetized form).
Humans often feel an impulse toward action (action bias), which can reduce effectiveness during incident response. Practicing “watchful waiting” can curtail knee-jerk reactions.
There is no “best practice” for all incidents. The best we can do is practice incident response activities to nurture human responders’ adaptive capabilities.
Repeated practice of response activities through chaos experimentation can turn incidents from stressful, scary situations into confidence-building, problem-solving scenarios.
Recovering from incidents requires adaptation, and learning is a prerequisite for this adaptation. Learning from incidents to develop memory of failure is about community, so if we blame community members for the incident, we will struggle to learn.
A blameless culture helps organizations stay in a learning mindset — uncovering problems early and gaining clarity around incidents — rather than play the “blame game.” It encourages people to speak up about issues without fear of being punished for doing so.
There are two contributing factors always worth discussing during incident review: relevant production pressures and system properties.
Humans at the “sharp end,” who interact directly with the system, are often blamed for incidents by humans at the “blunt end,” who influence the system but interact indirectly (like administrators, policy prescribers, or system designers). The disconnect between the two can be summarized as the delta between “work-as-practiced” and “work-as-imagined.”
The cybersecurity industry often (unproductively) blames users for causing failures, as evidenced by the acronym PEBKAC: problem exists between keyboard and chair. A more useful heuristic is PEBRAMM: problem exists between reality and mental model. An error represents a starting point for investigation; it is a symptom that indicates we should reevaluate design, policy, incentives, constraints, or other system properties.
There are numerous biases that tempt us to blame human error during incidents, which hinders our capacity to constructively learn from and adapt to failure. With hindsight bias, we allow our present knowledge to taint our perception of past events (the “I knew it all along” effect). With outcome bias, we judge the quality of a decision based on its eventual outcomes. The just-world hypothesis refers to our preference for believing the world is an orderly, just, and consequential place. All of these biases warp our perception of reality.
During incident review, use neutral practitioner questions to stay curious and intellectually honest. Neutral practitioner questions re-create the context surrounding an event and ask practitioners what actions they would take given this context. It helps sketch a portrait of local rationality: the reasonable course of action in the presence of contextual trade-offs and constrained information-processing capabilities.

Platform Resilience Engineering

Takeaways from Chapter 7

At the “meta-design” level, we can sustain resilience through organizational structure and practices—transforming from a siloed security program into a platform engineering model (“platform resilience engineering”).
We must be aware of production pressures and how they tip sociotechnical systems toward failure. Production pressures involve the incentivization of less expensive and more efficient work, with quality (and security as its subset) as the typical sacrifice.
A platform engineering approach to resilience treats security as a product with end users, as something created through a process that provides benefits to a market (with internal teams as our customers). Platform Engineering teams identify real problems, iterate on a solution, and prioritize usability to promote adoption. Resilience is a natural fit for their purview.
Any product requires a long-term vision — a unifying theme for all your projects toward a defined end. The vision tells a story of what is being built and why.
Treating resilience — including security — as a product starts with identifying the right user problems to tackle. To accurately define user problems, we must understand their local context. We must understand how our users make tradeoffs under pressure, maintain curiosity about the workarounds they create, and respect the limitations of their brains’ computational capacity (“cognitive load”).
Security solutions become less reliable as their dependence on human behavior increases. The Ice Cream Cone Hierarchy of Safety Solutions helps us prioritize how we design security solutions, from best to least effective. Starting from the top of the cone, we can eliminate hazards by design; substitute less hazardous methods or materials; incorporate safety devices and guards; provide warning and awareness systems; and, last and least effective, apply administrative controls (like guidelines and training).
There are two possible paths we can pursue when solving user problems: the control strategy or the resilience strategy. The control strategy designs security programs based on what security humans think other humans should do; it is convenient for the Security team at the expense of others’ convenience. The resilience strategy promotes and designs security based on how humans actually behave; success is when our solutions align with the reality of work-as-done. The control strategy makes users responsible for security while the resilience approach makes those designing security programs and solutions responsible for it.
We should build minimum viable products (MVPs) and pursue an iterative change model informed by user feedback.
We should gain consensus about our plans for solving resilience problems — from vision through to implementation of a specific solution — and ensure stakeholders understand the why behind our solutions. Success is solving a real problem in a way that delivers consistent value.
To facilitate solution adoption, we must plan for migration and pave the road for our customers to adopt what we’ve created for them (hence the strategy of creating “paved roads”). We should never force solutions on other humans; if that is the only way to drive adoption, then it is a failure of our design, strategy, and communication.
Measuring product success is necessary for our feedback loops, but can be tricky. If we design solutions for use by engineering teams, the SPACE framework offers numerous success criteria we can measure. In general, we should be curious about the factors contributing to success and failure for our internal customers.
Any metrics related to how “secure” or “risky” something is, like percentage of “risk coverage,” are busywork based on measuring the (highly subjective) unmeasurable. We need to measure our program’s success—and any solutions we design as part of it—based on tangible, realistic goals.

Security Chaos Experiments

Takeaways from Chapter 8

Experimentation is a cycle of discovery and learning, which is what drives scientific progress. Resilience stress tests (aka security chaos experiments) are like applying the scientific method to software and systems security.
Early adopters of security chaos experimentation learned three key lessons: first, it’s fine to start in nonproduction environments because you can still learn a lot; second, use past incidents as inspiration for experiments and to leverage organizational memory; third, make sure to publish and evangelize your experimental findings because expanding adoption will become your hardest challenge (the technical work is comparatively easy).
To set chaos experiments up for success, especially the first time, we need to socialize the experiment with relevant stakeholders. Investing in the right messaging and framing at the beginning will reduce friction later.
The next step is designing an experimental hypothesis. Hypotheses typically take the form of: “In the event of the following X condition, we are confident that our system will respond with Y.”
Once we have a hypothesis, we can design our experiment so we uncover the behavior about which we want to learn. There are numerous considerations: where we conduct the experiment, how we measure success, potential impacts, fallback procedures, and more.
Documenting a precise, exact experiment design specification (“spec”) is critical. Our goal with the spec is for our organization to gain a luculent understanding of why we’re conducting this experiment, when and where we’re conducting it, what it will involve, and how it will unfold.
Launching an experiment is not unlike a feature release. Our preparation in socializing the experiment, designing the hypothesis, and defining the experiment specifications makes this one of the easier phases.
What evidence we collect when conducting an experiment is defined by the spec; we should already know what we’re monitoring and what evidence we expect.
The first step after we’ve collected evidence is confirming we collected the evidence we sought from the experiment. The second step is to analyze the data with regard to the hypothesis. Our goal is to compare our observations with our predictions — to verify and refine our mental models of the system, which informs what actions we can take to sustain its resilience to adversity.
We should communicate our experimental findings through release notes. Most stakeholders don’t need lots of detail; we should synthesize and summarize our experimental insights, highlighting any action items. Once those action items are performed, we can rerun the experiment.
After your first experiment, or after you run an experiment the first time, you can automate it for continuous use. Because our systems — and the reality around them — are constantly changing, we must continuously generate evidence lest it grow stale.
Game days, a more manual form of conducting a security chaos experiment, can help more hesitant organizations ease into chaos experimentation.
There is no end to the kinds of security chaos experiments you can conduct in your systems. In this chapter, I enumerated many applicable to production infrastructure, build pipelines, service-oriented environments, and Windows environments.

Case Studies

Takeaways from Chapter 9

It’s hard to capture the cliff notes for the case studies Aaron Rinehart compiled from UnitedHealth Group, Verizon, OpenDoor, Cardinal Health, Accenture Global, and Capital One. Plus, it’s the only chapter I didn’t write on my own in the book, so I feel I wouldn’t do it justice.

But, my personal takeaways from the case studies are:

Collaboration is key; to succeed in security, we must not only establish healthy communication with other teams but be open to them teaching us, too — especially platform engineering and SRE teams, who possess a wealth of experience we can leverage in our resilience journey.
The resilience / SCE transformation is not exclusive to a certain type of organization. Fortune 10, highly-regulated organizations can pursue it. So can smaller, scrappy startups. And this speaks to something I really tried to emphasize in the book: the resilience approach isn’t about revolutionizing X, Y, Z overnight and doing them perfectly; it’s about iteratively changing how you do things towards more resilient outcomes. It’s about experimentation at basically all levels, whether experimenting with new modes of collaboration with software engineering teams, conducting resilience stress tests, or trying any of the zillion strategies I described in the book. For some organizations, it’ll be easier to iteratively migrate to memory safe languages; for others, it’ll be easier to migrate to running workloads on immutable infrastructure or implementing integration testing or so many other things that can make a difference outcome-wise.
A learning culture is critical. This doesn’t mean we defenestrate all caution so it flails on the restless winter winds. It means we conduct small experiments to generate evidence and inform what we do next. It means we assume by default that most humans are just trying to do their jobs, and that means we don’t automatically blame them when something goes wrong — nor do we treat our colleagues like our adversaries (as it turns out, cyber criminals are our real adversaries⁴, who knew!). We look critically at how our systems and processes are designed, then brainstorm how to iteratively improve them over time. If a system is confusing or cumbersome to use, then that’s a design problem, not a “human error” problem. Nearly every case study in this chapter highlights the imporance of psychological safety in the transformation towards resilience for a reason.

Enjoy this post? You might like my book, Security Chaos Engineering: Sustaining Resilience in Software and Systems, available at Amazon, Bookshop, and other major retailers online.

The book has like nine million citations, so in no way do I think I am solely responsible for software resilience being a thing. However, to my knowledge, I am the only person to have synthesized disparate research across dozens of disciplines; filled in gaps both philosophically and practically; extended all of that with tons of original contributions from concepts to specific activities; and packaged it into an end-to-end strategy for organizations of all kinds to adopt. While I’ve often been the Agitator in the Agitator-Innovator-Orchestrator model of change, the book is my attempt at being the Innovator for the movement – conceptualizing and communicating, at great length, potential solutions to the problems wrought by traditional cybersecurity and poor software quality. ↩︎
I am purposefully using the misnomer “cliff notes” to avoid getting sued, jk hopefully ↩︎
Presenting the takeaways as digestible bullet points also makes it easy for you to copy and paste into your next LinkedIn Thought Leadership post (with attribution :) so you can dazzle your followers with your newfound knowledge as they doomscroll. ↩︎
Yes, I am aware nation state actors exist. But per the Verizon Data Breach Investigations Report year after year, nation states are like less than 5% of incidents, while money-driven cyber crimes are like 95%. Your threat model should start with cyber criminals before you assume you’re gonna get a comex- / PinkiePie-style chain of 0day thrown your way. ↩︎