69 Ways to F*** Up Your Deploy

Co-authored by Kelly Shortridge and Ryan Petrich

We hear about all the ways to make your deploys so glorious that your pipelines poop rainbows and services saunter off into the sunset together. But what we don’t see as much is folklore of how to make your deploys suffer.¹

Where are the nightmarish tales of our brave little deploy quivering in their worn, unpatched boots – trembling in a realm gory and grim where pipelines rumble towards the thorny, howling woods of production? Such tales are swept aside so we can pretend the world is nice (it is not).

To address this poignant market painpoint, this post is a cursed compendium of 69 ways to fuck up your deploy. Some are ways to fuck up your deploy now and some are ways to fuck it up for Future You. Some of the fuckups may already have happened and are waiting to pounce on you, the unsuspecting prey. Some of them desecrate your performance. Some of them leak data. Some of them thunderstrike and flabbergast, shattering our mental models.

All of them make for a very bad time.

We’ve structured this post into 10 themes of fuckups plus the singularly horrible fuckup of manual deploys. For your convenience, these themes are linked in the Table of Turmoil below so you can browse between soul-butchering meetings or existential crises. We are not liable for any excess anxiety provoked by reading these dastardly deeds… but we like to think this post will help many mortals avoid pain and pandemonium in the future.

The Table of Turmoil:

Identity Crisis
Loggers and Monitaurs
Playing with Deployment Mismatches
Configuration Tarnation
Statefulness is Hard
Net-not-working
Rolls and Reboots
Disorganized Organization
Business Illogic
The Audacity of Spacetime
Manual Deploys

Identity Crisis

Permissions are perhaps the final boss of Deployment Dark Souls; they are fiddly, easily forgotten, and never forgiven by the universe.

1. Allow all access.

“Allow all access” is simple and makes deployment easy. You’ll never get a permission failure! It makes for infinite possibilities! Even Sonic would wonder at our speed!

A gif of Sonic drawn by someone who really cannot draw well. The effect is humorous. Sonic is running on a road with the caption Gotta go fast.

And indeed, dear reader, what wonder allow * inspires… like a wonder for what services the app actually talks to and what we might need to monitor; a wonder for what data the app actually reads and modifies; a wonder for how many other services could go down if the app misbehaved; and a wonder for exactly how many other teams we might inconvenience during an incident.

Whether for quality’s sake or security’s, we should not trade simplicity today for torment tomorrow.

2. Keys in plaintext.

Key management systems (KMS) are complex and can be ornery. Instead of taming these complex beasts – requiring persistence and perhaps the assistance of Athena herself to ride the beast onward to glory – it can be tempting to store keys in plaintext where they are easily understandable by engineers and operators.

If anything goes wrong, they can simply examine the text with their eyeballs. Unfortunately, attackers also have eyeballs and will be grateful that you have saved them a lot of steps in pwning your prod. And if engineers write the keys down somewhere for manual use in an “emergency” or after they’ve left the company… thoughts and prayers.

3. Keys aren’t in plaintext, they’re just accessible to everyone through the management system.

You’ve already realized storing keys in plaintext is unwise (F.U. #2) and upgraded to a key management system to coordinate the dance of the keys. Now you can rotate keys with ease and have knowledge of when they were used! Alas, no one set up any roles or permissions and so every engineer and operator has access to all of the keys.

At least you now have logs of who accessed which keys so you can see who possibly leaked or misused a key when it happens, right? But how useful are those logs when they are simply a list of employees that are trusted to make deploys or respond to incidents?

4. No authorization on production builds.

The logical conclusion of fully automated deployments is being able to push to production via SCM operations (aka “GitOps”). Someone pushes a branch, automation decides it was a release, and now you have a “fun” incident response surprise party to resolve the accidental deploy.

One option is to enforce sufficient restrictions on who can push to which branches and in what circumstances. Or, you can go on a yearlong silent meditation retreat to cultivate the inner peace necessary to be comfortable with surprise deployments.

The common “mitigation” “plan” is to only hire devs who have a full understanding of how git works, train them properly² on your precise GitOops workflow, and trust that they’ll never make a mistake… but we all know that’s just your lizard brain’s reckless optimism telling logic to stop harshing its vibes. Make it go sun on a rock somewhere instead.

5. Keys in the deploy script… which is checked into GitHub.

Sometimes build tooling is janky and deployment tooling even jankier. After all, you don’t ship this code to users, so it’s okay if it’s less tidy (or so we tell ourselves). Because working with key management systems can be frustrating, it’s tempting to include the keys in the script itself.

Now anyone who has your source code can deploy it as your normal workflow would. Good luck maintaining an accurate history of who deployed what and when, especially when the “who” is the intern who git clone’d the codebase and started “experimenting” with it.

6. New build uses assets or libraries on a dev’s personal account.

You’ve decided that developers should be free to choose the best libraries and tools necessary to get the job done, and why shouldn’t they? For many, this will be a homegrown monstrosity that has no tests or documentation and is written in their Own Personal Style™. The dev who chose it is the only one who knows how to use it, but it’s convenient for them.

But is it the most convenient choice for everyone else? What about when the employee leaves and shutters their github account? The supply chain attack³ is coming from inside the house!

7. Temporary access granted for the deployment isn’t revoked.

As MilTOR Freedmem quipped years ago, “Nothing is so permanent as a temporary access token.”

The deployment is complicated and automating all of the steps is a lot of work, so the logical path is to deploy the service manually just this once. The next quarter, there’s an incident and to get the system operational again, it’s quickest to let the team lead log in and manually repair it.

But after the access is added, it’s all too easy to overlook removing the access. Employees would never take shortcuts or abuse their access, right? And their accounts or devices could never be compromised by attackers, right?

8. Former employees can still deploy.

Leadership claims your onboarding and offboarding checklists are exhaustive and followed perfectly every time. And, indeed, your resilience and security goals rely on them being followed perfectly. A safety job well done! No one will be able to deploy your application after they’ve put in notice!

What’s that? That wasn’t part of your checklist, too? Or did you skip over that item because it’s too hard to rotate the keys if some employees quit because they’re too essential and baked into too many systems?

You’ve replaced those keys but they aren’t destroyed and aren’t revoked and don’t expire, so your only hope now is the org didn’t piss off the employees enough for them to YOLO rage around in prod. Sure, former employees have always expressed goodwill towards your company and no one has ever left disgruntled… but would you bet on that staying true?

9. Your app uses credentials associated with the account of the employee you just fired.

Sharing credentials isn’t just something engineers and operators share between themselves. If you’re extra lucky, they’ll bake them into the software or services and then when they leave or transfer to a new department, the system will fail when their permissions are revoked. Maybe sharing isn’t caring.

Some businesses run on engagement. The more users interact with the platform, the more they induce others to interact, which means more advertising messages you can show them with a more precise understanding of what they might buy. Teams track engagement metrics closely and every little design change is justified or rescinded by how it performs on these metrics. It’s a merry-go-round of incentives and dark patterns.

But one day you migrate to a new login token format or seed, forcing everyone to log in again and the metrics are fucked because many users don’t want to go to the trouble. Those fantastic growth numbers you hoped would bolster your company’s next VC round no longer exist because you broke the cycle of engagement addiction.

Loggers and Monitaurs

Logging and monitoring are essential, which is why getting them wrong wounds us like a Minotaur’s horn through the heart.

11. Logs on full blast.

Systems are hard to analyze without breadcrumbs describing what happened, so logging is an essential quality of an observable system.

Ever-lurking in engineering teams is the natural temptation to log more things. You might need some information in a scenario you haven’t thought of yet, so why not log it? It will be behind the debug level anyway, so it does no harm in production…

…until someone needs to debug a live instance and turns the logging up to 11. Now the system is bogged down by a deluge of logging messages full of references to internal operations, data structures, and other minutia. The poor soul tasked with understanding the system is looking for hay in a needlestack.

Worse, someone could enable debugging in pre-production where traffic isn’t as high⁴ and not notice before deploying to the live environment. Now all your production machines are printing logs with CVS receipt-levels of waste, potentially flooding your logging system. If you’re extra unlucky, some of your shared logging infrastructure is taken offline and multiple teams must declare an incident.

12. Logs on no blast.

Who doesn’t want peace and quiet? But when logs are quiet, the peace is potentially artificial.

Logs could be configured to the wrong endpoint or fail to write for whatever reason; you wouldn’t even be aware of it because the error message is in the logs that you aren’t receiving. Logs could also be turned off; maybe that’s an option for performance testing⁵.

Either way, you better hope that the system is performing properly and that you planned adequate capacity. Because if the system ever runs hot or hits a bottleneck, it has no way of telling you.

13. Logs being sent nowhere.

Your log pipelines were set up years ago by employees long gone. Also long gone is the SIEM to which logs were being sent. Years go by, an incident happens, and during investigation you realize this fatal mistake. Your only recourse is locally-saved logs, which, for capacity reasons, are woefully itsy bitsy and you are the spider stuck in a spout, awash in your own tears.

A meme from a Wonder Woman movie. In the first panel, she uses her lasso to constrain a man; she says: the lasso of Hestia compels you to tell the truth. In the second panel, the man grins; he says: our logs aren’t being sent anywhere. In reaction, in the third panel, Wonder Woman looks thoroughly disturbed.

14. Canary is dead, but you didn’t realize it so you deployed to all the servers anyway and caused downtime.

You’ve been doing this DevOps thing awhile and have a mature process that involves canary deployments to ensure even failed updates won’t incur downtime for users. Deployments are routine and refined to a science. Uptime clearly matters to you. Only this time, the canary fails in a way that your process fails to notice.

An alternative scenario is that some part of the process wasn’t followed and a dead canary is overlooked. You miss the corpse that is your new version and kill the entire flock.

Having a process and system in place to prevent failure and then completely ignoring it and failing anyway likely deserves its own achievement award. Do you need a better process, or do you need to fix the tools? How can you avoid this in the future? This will be furiously debated in the post-mortem, which, if blameful rather than blameless will likely result in this failure repeating within the next year.

15. System fails silently.

A system is crying out for help. Its calls are sent into the cold, uncaring void. Its lamentable fate is only discovered months later when a downstream system goes haywire or a customer complains about suspiciously missing data.

“How could it be failing for so long?” you wonder as you stand it back up before adding a “please monitor this” ticket to the team’s backlog that they’ll definitely, totes for sure get to in the next sprint.

16. New version puts sensitive data in logs.

Yay, the new version of the service writes more log data to make it easier to operate, monitor, and debug the service should something ever go wrong! But, there’s a catch: some of the new log messages include sensitive data such as passwords or credit card details. This may not even be purposeful. Perhaps it logs the contents of the incoming request when a particular logging mode is enabled.

Unfortunately, there are very specific rules that businesses of your type must follow when handling certain types of data and your logging pipeline doesn’t follow any of them. Now your near-term plans are decimated by the effort to clean up or redact logs that you otherwise wouldn’t have to if the engineer that added that logging knew about the data handling requirements. By the way, the IPO is in a few months. XOXO.

Playing with Deployment Mismatches

There were assumptions about what you deployed and those assumptions were wrong.

17. What you deployed wasn’t actually what you tested.

Builds are automated and we tested the output of the previous build, so what’s the harm of rebuilding as part of the deployment process? Not so fast.

Unless your build is reproducible, the results you receive may be somewhat different. Dependencies may have been updated. Docker caching may give you a newer (or older, surprisingly!) base image⁶. Even something as simple as changing the order of linked libraries⁷ could result in software that differs from what was tested.

Configurations fall prey to this, too. “Well, it works with allow-all!” Right, but it doesn’t work in production because the security policy is different in pre-prod. Or, the new version requires additional permissions or resources which were configured manually in the test environment… but, cranial working memory is terribly finite, and thus they were forgotten in prod.

There are numerous solutions to this problem (like reproducible builds or asset archiving), but you may not bother to employ them until a broken production deploy prompts you to. And some of the solutions descend into a stupid sort of fatalism: “If we don’t have fully reproducible builds, we don’t have anything, there’s no point to any of this.” And then Nietzsche rolls in his grave.

18. Not testing important error paths.

We have to move fast. New features. Tickets. Story points. Ship, ship, ship. Developers with the velocity of a projectile. Errors? Bah, log them and move on.

If something is incorrect, surely it will be noticed in test or be reported by users – spoken by someone who has never faced an angry customer because their data was leaked or discovered their lowest rung employee fuming with resentment when they see the company’s fat margins.

Alas, too often we see a new version which forgets to check auth cookies, roles, groups, and so forth because devs test it as admin with the premium enterprise plan, but forget lowly regular users on the the free tier can’t do and see everything.

19. Untested upgrade path.

Your infrastructure is declarative, but the world is not. The app works in isolation, but doesn’t accept the data from the previous version or behaves weirdly when faced with it.

Possibly the schema has changed, but the migration path for existing data (like user records) was never tested. You didn’t test it because you recreated your environment each time. The new version no longer preserves the same invariants as the old version and you watch in horror as other components in the system topple one by one.

Possibly you’re using a NoSQL database or some other data store for which there isn’t a schema and now the work of data migration falls on the application… but no one designed or tested for that.

Or, maybe you’re pushing a large number of updates to a rarely used part of your networking stack. For those that are all-in on infrastructure as code (IaC), supporting old schema, data, and user sessions can be a thorny problem.

20. “It’s a configuration change, so there’s no need to test.”

A shocking number of outages spawn from what is, in theory, a simple little configuration change. “How much could one configuration change break, Michael?”

Many teams overlook just how much damage configuration changes can engender. Configuration is the essential connective tissue between services and just about anything that can be configured can cause breakage when misconfigured.

21. Deployment was untested because the fix is urgent.

The clock is ticking and sweat is sopping your brow. Something must be done to avoid an outage or data loss or some other negative consequence. This fix is at least something and this something seems like it should work⁸. You deploy it now because time is of the essence. It fails and you now have less time or have caused more mess to clean up.

Only in hindsight do you realize a better option was available. Or, maybe the option you chose was the best one, but you made a small mistake. Was the haste worth it?

Urgency changes your decision-making. It’s a well-intentioned evolutionary design of your brain that causes unfortunate side effects when dealing with computer systems. In fact, “urgency” could probably be its own macro class of deploy fails given its prevalence as a factor in them.

22. App wasn’t tested on realistic data.

“Everything works in staging! How could it have failed when we pushed it live? I thought we did everything right by testing the schema migration with our test data and load testing the new version.”

Narrator: The software engineer is in their natural habitat. Observe how they pull at their own hair, a hallmark of their species to signal that something has distressed them. It is very difficult to replicate everything that’s happening in production in an artificial test environment without some sort of replication or replay system. This vexes our otherwise clever engineer.

“It causes a crash!? What kind of deranged mortal would have an apostrophe in their name? Oh, it’s common in some cultures? Hmmm…”

If you keep your service online as you deploy, you should really test your upgrade path under simulated load. If you don’t, you can’t be sure if your planned upgrade process will work or how long it will take.

23. Deploying to the wrong environment accidentally.

When you make deployments easy, it is possible to make deploying to prod too easy. And easy to use doesn’t necessarily mean easy to understand. When a slip of the finger results in code going live, you may want to consider just how far you’ve taken automation and if other parts of your process need to catch up.

Because one day, a sleep-deprived Future You is going to run a deploy script where you have to pass in an environment name and you will type dve instead of dev. Once it dawns on you that the deploy system falls back to “prod” as the default, adrenaline shocks you awake with the force of 9000 espressos and you will never sleep again.

The regrettable reality is that internal tools often offer terrible UX because engineers refuse to give themselves nice things (including therapy). These tools, akin to a rusty sword with no hilt, make these sorts of failures tragically common. The rise of platform engineering is hopefully an antidote to this phenomenon, treating software engineers as user personas worthy of UX investments, too.

24. No real pre-production environment.

You have a staging environment (congrats!), but it’s an ancient clone from production which has seen so many failed builds, bizarre testing runs, and manual configs that it bears only a pale resemblance to the system it’s supposed to epitomize. It gives you confidence that your software could deploy successfully, but not much else.

You wish you could tear it down and rebuild it anew, but everyone’s busy and it’s never quite important enough for someone to start working on it rather than some other task. Thus you’re doomed to clean up small messes that could be caught by a true staging environment.

At the next DevOps conference you attend, every keynote speaker refers to the “fact” that “everyone” has a “high-fidelity” staging environment (“obviously”) as you weep in silence.

25. Production OS has a different version than pre-prod OS and the app fails to start.

Production systems are incredibly important and we must patch frequently to keep them in compliance. But the same diligence isn’t applied to pre-prod, development, build and other environments.

The systems in these environments may therefore be wildly out of date and the software they produce may be incompatible with the up-to-date, patched production system. Systems will drift so far from the standard that QA systems look like an alternate reality from production and make you a believer in the multiverse hypothesis.

26. Backup botch-ups.

A production deploy requires a backup because hot damn have we fucked it up so many times and a backup makes everyone feel more confident. The administrator responsible for performing the backup writes the backup over the live system, causing an outage. Furthermore, because the data was overwritten by the botched backup, any existing backups are not recent.

Backup fuckups happen more than anyone admits and when they go down, they go down hard. Recovering from them is rough because no one thinks it will happen to them.

Lesser failures in this category include saturating the disk or network IO of the host taking the backup or filling the disk – each perfectly capable of causing an outage, too.

27. Audit logs are turned off.

Audit logs are accidentally turned off during a configuration change or as part of a software upgrade and now the system is out of compliance. No one notices until the auditors ask for the audit logs months later and a wave of panic ripples through the teams involved.

Will we fail the audit? Will customers drop us? How much revenue is impacted? Will we still get raises at our quarterly review? Will I have to switch to getting artisanal roasted bean elixirs every other day?

28. Iceberg dependencies.

Only the simplest services run entirely isolated without any other dependencies. When done right, dependencies are properly documented and the infrastructure dependencies of each component are clear. Even better, the dependencies are specified declaratively, rendering it impossible for the human-generated documentation to drift from the machine specification.

But in less auspicious cases, the dependencies are hazy and can even form chains which loop back on themselves like a branching ouroboros eating its own rotting tails. Debugging a production incident for a system with unknown dependencies is software archeology where the only treasure is tears.

The infrastructure upgrade toppled some of the apps and services running on top of it, but the people deploying the upgrade lack context on those casualties and you all wonder when the Jigsaw puppet will come into view and reveal this has all been a grand experiment to pit you against each other.

“We upgraded the OS, clearly everything will be fine!” My brother in christ your system fetches Kerberos creds automatically on boot, but your first boot on a fresh host fails because the Kerberos fetch infra depends on a QA host that was decommed 6 months ago!

And then there’s the ultimate iceberg dependency: DNS. If DNS is borked or misconfigured, all sorts of thorny problems can emerge.

29. Enabling a new feature in vendor software without load testing it.

Vendors make all sorts of claims about the behavior of their wares. It’s fast and stable. It migrates its data format. It slices and dices. It follows semver. Should you believe them? In a word, no.

Configuration Tarnation

Playing god with your environments does not always result in intelligent design.

30. Per-environment configuration isn’t updated ahead of the deploy.

Per-environment configuration is a fact of life. Hostnames, instance counts, and other configuration settings will be necessarily different between environments. Keeping these up to date can be a challenge and it’s all too easy to overlook updating the production template when new configs must be added.

New configuration values are often copied from a staging template into the production one without appropriate adjustments like switching the hostname. You will wonder which evil eldritch god you pissed off when deploying to prod takes down both the production and staging environments. This is so frequent and yet! and yet.

31. Deployed new configuration, but forgot to restart the associated services.

Deploying a configuration change is easy: apply the configuration, restart the service. You might think it should be easy to remember the steps when there’s only two of them, but it’s easy to overlook for quick deployments. Only later do you realize you set a new config variable in prod without applying it to the prod instances.

Design patterns like the D.I.E. triad can help — there’s no way for infrastructure to drift if it’s redeployed from scratch on each deployment. And, of course, automated deployments can help, too.

32. Feature flag fuckups.

Feature flags are a simple and amazing way to explode the number of system states you must test. N flags make for 2^N combinations. Are all of them tested? Are you sure they’re all set correctly? Do the people who test your application have the same flags as the unwashed masses? Are there old feature flags in your app that should be retired? What could happen if they were activated mistakenly? (just ask Knight Capital).

Maybe you push a release before the company holiday party and deploy the entire release successfully… until a few intoxicants in you realize you forgot to flip the feature flag and now you’re crying in the bar’s bathroom shakily singing along to Mariah Carey (though you suspect the “baby” she desperately wants for xmas isn’t a feature flag).

It’s also possible that you do the exact opposite and flip the flag too soon. Maybe a new product is accidentally announced early, deflating all the carefully constructed marketing plans leading up to the company conference and leading customers to ask why the new feature is “broken.” You had just regained the respect of the customer support team too…

Perhaps the new feature simply uses too many resources and you haven’t scaled your infrastructure appropriately. Or maybe the freemium gate is broken and everyone gets access to premium features. Good luck explaining to customers why you now have to take away their new shiny feature unless they pay up for it…

33. Delayed failures.

Faulty configuration may not necessarily cause failures immediately. It’s only after you do some other, seemingly unrelated operation does the fault cause any symptoms. Like medicine, it can be difficult to untangle exactly what faults are the cause of what symptoms.

For systems like load balancers or orchestrators, a bad configuration can remain in place and as long as the system is stable, the misconfiguration will cause no ill effects. But one day when you decommission a cluster as planned, another cluster immediately shits itself – suffering a total outage baffling everyone – and only after many painful hours of debugging do you realize its healthchecking was configured against the one you decommed.

If the team owning that other cluster has poor monitoring hygiene, they may only discover their service is dead much later. But the outage gods care not for your mortal troubles and will do nothing to ease the pain of what is now a multi-day incident all due to faulty health checks.

34. What lies beneath.

The layers far underneath your application can still cause your deployment to fail.

Orchestrator fails? Your service is dead. Operating system fails? Dead. Disk controller fails? Dead. BGP? Dead. DNS? Dead. Backhoe cuts the backbone to your sole datacentre? Dead. NVMe subtly violates DMA protocol? Dead. NIC driver fails or goes rogue? Dead. Baseboard management controller borks? Dead. Deploy a bunch of new machines into a cluster with a bad BIOS? This may shock you, but: dead.

35. Scheduled failures.

Deployments may appear to succeed only to fail hours or days later if you have periodic background jobs or the ability to schedule tasks. The deployment isn’t successful until these jobs and tasks run successfully.

Perhaps you deploy a busted systemd timer which causes all your nodes to self-destruct after 8 hours… and only discover this “fun” fact after you deploy to your first tranche in prod. See also: the dreaded slow memory leak.

Another variant is the odd date/time bug which causes the application to malfunction only on leap years or when daylight savings time occurs. If you’re not swift with your incident response, the incident resolves itself and you’re left scratching your heads until someone realizes it’s because the clocks rolled back.

Do you bother fixing the bug? Or do you hope to find another job before the next orbital period elapses?

36. Accidentally push components beyond their limits.

Components may have poorly documented or undocumented limitations or may simply become unusably slow when assigned more work than they were designed for. Does your database have a limit on the number of connections? Better not scale the number of clients beyond that number, then!

Is it a deployment failure? Yes, if a deployment pushes the system beyond its limits, which is more likely to happen when you add new v2 replicas before retiring old v1 replicas.

In the microservices world, this can manifest as running so many k8s jobs without deleting them that all the k8s operations on jobs begin taking tens of seconds because the cluster is bogged down with so much cluster metadata. Is the inevitable conclusion of microservices simply more microservice instances and metadata than actual work and user data? Makes u think.

37. Builds always use the latest version of a library.

Some well-meaning person may decide that builds for a piece of software always use the latest version of its dependencies. This ensures that whenever you release, you always have the latest security patches.

This sounds wise until one of the dependencies causes a subtle API breakage and your app fails to function. Or, any of your dependencies’ authors could decide “fuck this, I’m not maintaining this open source project anymore and giving corporations free labor” and push a dead version of a package.

A photograph of a ginger cat sauntering away from a large fire behind it. The cat is labeled: random dev in Nebraska pushing a dead version of their open-source package. The fire is labeled: all modern digital infra.

Now you’re unable to build new versions of the app until someone resolves the dependency situation. Worse, if that fed-up developer has pulled their old versions out of spite or frustration with the pain of maintaining OSS, and if you haven’t archived builds of old versions, then you may not even be able to deploy at all. And this is how you end up cursing a random dev you hadn’t even heard of until just now when you should be taking your lunch break.

Statefulness is Hard

Mere mortals cannot maintain accurate mental models of data in distributed systems. Even the divines struggle.

38. An irreversible process fails part way through.

Some irreversible process fails part way through your deploy. Possibly it was a migration or some other critical step during your deployment. For whatever reason, this step didn’t happen when deploying to the other environments; it only happened in the one environment that matters most.

What state is the system actually in? Should you rollback? If you try to roll back, will it even work? You’re in uncharted waters under shrouded stars.

Data migrations are often a one-way process. Have you tried migrating all of your existing data to see what happens? How long does it take? Do you have backups? Could you even use the backups, or would restoring result in yet more downtime?

If you don’t know the answers to these questions, you might find yourself deploying an ORM/data model layer which automatically migrates read-only database values to a new format and somehow corrupts the records, resulting in you frantically trying to patch and deploy a fix before too much of your DB becomes unreadable.

Or perhaps you set --timeout 10 on your ORM migration with the innocent assumption that “10” here refers to second. It’s 10 milliseconds. There are no down migrations. And migrations can be arbitrary JS and therefore not guaranteed to be atomic or idempotent and now you’ve started a slow-motion train crash that you cannot stop. One hour of scheduled downtime becomes 18 hours. Your youth and zeal is irreversibly drained.

39. Distributed data vore.

Distributed storage / database systems require careful understanding of their operational characteristics if you are to operate them safely. They can be used to achieve better uptime, reliability, and possibly even lower latency if operated within their safety margins… but they also require more care and feeding than traditional databases with an authoritative primary and can be quite temperamental.

If operated incorrectly, distributed storage can silently lose data or disagree on the data they contain if nodes aren’t retired correctly or if an insufficient number of nodes remain healthy. Do you know enough about your data storage layers to operate them safely? Or when you next roll-reboot your Elasticsearch cluster will it silently eat 30% of your data for seemingly no reason at all? The customers now complaining that all their graphs are 30% too low are certainly not silent.

When you deployed new database nodes to prod, did you assume the cluster would rebalance on them? Oopsies, it didn’t! And thus when you decommissioned the old nodes, you destroyed 99% of your data in the process. There are not enough oofs in this universe to reflect this oofiness.

40. Cache is an unhealthy monarchy.

If caches aren’t healthy, rolling restart instructions aren’t followed or are insufficient and the system fails to start.

Where to begin? Let’s start with why caches exist in the first place: to avoid repeated execution of expensive computations by storing a mapping between inputs and their results in memory (aka “caching them”). Caches will typically discard infrequently-used results automatically to make space for frequently-used results, and can be asked to drop any results that are no longer valid. How can this go wrong? Oh so many ways!

First off, just like database schema, the format of data in the cache might not be compatible with the new version of the app. Similarly, when there is more than one application instance, the old version of the app will run alongside the new version and could see cache entries from its successors. This can cause problems where either the old version or the new version of the app could malfunction from improper data. Deploying a canary can cause all of the instances of the software version to fail.

Have your engineers thought about cross-version compatibility? Do they reject linear notions of spacetime and thus believe compatibility is a blasphemous act against the holographic principle? “Spacetime is just an abstraction,” they tell you cooly while sipping their matcha latte.⁹ You are tempted to remind them that money is also an abstraction and therefore they should abstain from it, too, but it’s faster if you just fix it yourself.

Second, the keys might change. If version A of an app uses one nomenclature for keys but its successor (version B) uses another, version B will operate as if the cache is empty. The app now must perform much more work to populate the cache in the new format. If both versions of the app are running simultaneously, they will fight for space in the cache – and the cache is limited in how much data it can hold by necessity. Now the cache has a lower hit ratio and more requests must go through the more costly “uncached” path.

Third, a common ReCoMmEnDaTiOn is to flush caches when deploying new versions of software (“it’s a caching issue, clear your browser cache” said the frontend dev to the product manager as the PM rolled their eyes). This can be dangerous when using a shared cache since so much extra work must now be performed with every request.

With healthy cache hit ratios commonly being in the 90% range for some workloads, that means the part of the application beyond the cache must handle ten times the throughput until the cache is rebuilt. Could you handle a sudden 10x increase in your workload?

Net-not-working

We make piles of thinking sand talk to each other through light and wonder why weird shit happens.

41. Accidental self-DoS.

The accidental self-DoS could be due to many reasons. Maybe new versions of the application inhibit the CDN’s ability to cache, but this non-functional requirement wasn’t recorded anywhere. Maybe a new analytics feature inundates the application backend with data collected to appease the whims of product management. Maybe a new retry mechanism is being used for failed requests, causing traffic amplification if the backend becomes even a little sluggish.

The end result is the same: the new version of the app swamps the backend service and causes downtime. Engineers tirelessly work to restore service by standing up more instances or filtering the unnecessary traffic the application created for itself.

You ask your devs what happened and they said, “Well, it didn’t work with CDN so we added cache-busting headers to make it work.” You nod quietly while gazing into the abyss.

42. Poorly configured caching.

The previous version of the app configured common static assets with a long cache duration. This caches the asset for long periods of time in CDNs and in users’ browsers. Fabulous! The app loads more quickly for users, especially those that visit frequently.

You build a new version of the app with new cached assets. The new version looks great in staging and dev, where testers are unlikely to have stale cached assets. But when you deploy it to production, you receive reports from your most fervent supporters that the app “looks weird.” It’s a Frankenstein’s monster mismatch of static assets from the old and new versions and behaves unpredictably.

Before enough understanding of what has happened filters through to the development team, all of the stale caches expire and the dev marks the JIRA ticket closed. The issue repeats again when you release the next minor redesign.

Due to the nature of CDNs and prod websites, there’s a category of people for which this is a persistent problem and they should be able to fix it… and yet can’t. The entirely avoidable fuckup is a formidable beast.

43. A hurricane of reconnections foments a flash flood.

You disconnect clients simultaneously during your deployment, leading to them all trying to reconnect simultaneously shortly thereafter. Your system was never designed to handle a flash flood of connections, so it stays down until it’s scaled manually well beyond what it was originally budgeted for.

Someone throws a ticket to add exponential backoff with randomization to the bottom of the client team’s backlog. Years pass and it happens again as their backlog only grows.

44. DoS yourself via CDN purge.

Purging a CDN with a cache hit ratio of 90% results in an immediate 10x throughput increase to the origin. Did you deploy the required additional capacity?

It’s such an easy button to press, too. Some CDNs don’t put a glass case around the button nor require administrator permission to press it. Pressing it immediately grants you the rank of “rogue developer” and now you’ve given your security team a reason to require ten more hours of annual security awareness training. Your access to the secret cools kids Slack channel is purged, too.

45. Accidental network isolation.

Adjust some network config you read about on Stack Overflow and suddenly the site is down and no one has access to the systems that can bring it back up and ahhhhh. You frantically call your AWS or colo account rep to see what they can do as your mobile device buzzes incessantly.

The essence of this fuckup is that the outage locks you out of the systems which need to be accessed to resolve the outage. This can be something as simple as firewall rules or as complex as unicast BGP configurations across complicated multi-vendor networks that locks everyone out of your data centers.

46. The orchestrator goes down with the ship.

A core service on which your orchestrator depends is down. You would normally use the orchestrator to deploy the service, but since the service is down, the orchestrator no longer functions. Now someone must dig out the dusty documentation on the old manual way to do this as the clock is ticking. Does the manual way even still work? Who even has access?

Elsewhere, you put Consul into the deploy path six months into its lifespan and it packet-storms itself into oblivion, taking down not only service discovery but also your ability to deploy anything or even log into nodes.

Rolls and Reboots

“No plan of operations reaches with any certainty beyond the first encounter with production” – Helmchart von Faultke¹⁰

47. No rollback plan.

It’s truly shocking how often orgs don’t have a rollback plan. But just like your mom told you about jumping off bridges, just because everyone is doing it doesn’t mean it isn’t dangerous.

There’s more than one way to handle this properly, like CD with canaries, blue / green deploys, full rollback of everything… but to not have a strategy for this at all and YOLO it? If only we gatekept less against liberal arts majors to fill this chasm of critical thinking.

A special mention goes to the untested rollback plan, too. “We have a complicated deployment that went smoothly in staging, pre-prod, and every other environment, so why would we ever need to rollback?” you say. “It can’t possibly fail in production,” you say.

You’d be correct 9 out of every 10 times… but how many times do you deploy a year again? So, you painstakingly craft a rollback plan for your deployments, but never test it since it’s unlikely to be used. And how little confidence you have in your rollback plans leads to this next fuckup.

48. Forward “fixing.”

Something didn’t go as planned, so you decide to roll forward with some new plan you came up with on the spot instead of rolling back – and then something fails in the roll forward.

This is a fuckup sprouting from the “developers are optimistic by nature” problem. A deployment fails on what you believe is some minor technicality. And then you fail to resist the temptation of making a “quick fix” to patch it while on the call and build a new version of the software so your team can ship…

…But it might not be a quick fix and you’re proposing deploying something completely untested straight to production. Somehow the SRE team is okay with this, or maybe they’re hesitant but let it slide since there are already too many hills on which they must die.

Either way, you’re risking your uptime and stress for deploying a little earlier than you otherwise would. A worthy heuristic for this might be: because developers appear to be optimistic by nature, even the “tiniest” of hotfixes are incomplete and require more testing.

49. Scheduled tasks build up while the system is down for maintenance and DoS the system upon startup.

Your system has a job queue with workers that are carefully tuned not to consume too much money and still complete their work. While the maintenance page is up, the workers are shut off. Deploying the app takes longer than expected and scheduled tasks pile up. The original pool of workers is no longer sufficient to process the backlog of scheduled tasks and people waiting on their results find your team to be insufficient.

50. Circular dependencies in infrastructure.

Circular infra dependencies result in a particularly nefarious failure pattern. If anything in the chain ever goes down completely, it’s impossible to stand the system back up without yolo-rushing a new version of a component to break the chain. For instance, perhaps you store the latest deployed revision on your own host, which means you can’t access it when something goes wrong.

You may design your system nicely, but time inexorably marches forward without regard for your intentions. This failure is an emergent property of all the changes people make over time. It’s an iceberg failure that only emerges when another failure has already emerged and is plaguing you. That is to say, circular infra dependencies result in a particularly nefarious failure pattern…

Disorganized Organization

No amount of fancy automation can truly save you from disorganized organizational processes.

51. No one wants to write docs.

Raise your hand if you’ve ever worked at a company with great internal documentation. Try to recall when you’ve ever read truly complete and up to date deployment documentation. For many of you (most of you, even), nothing comes to mind, right?

The closest might be a well-commented deployment script and some associated high level description. Perhaps it’s a design doc that you trust to be sort of right but cannot assuage your suspicion that the implemented system has drifted away from it. If you trust your documentation to be 100% accurate when deploying software, you’re going to have a bad time because it’s inevitable that there will be errors in it.

And because no one wants to write docs, numerous fuckups occur. You followed outdated or misleading docs on how to make the release, which fucked up the deploy. You forgot to update customer-facing docs and they configured something incorrectly and now all your other customers are suffering from the outage.

You forgot to send release notes which, wait, how is that a fuckup? Oh right, the account manager for your largest customer added in terms about releasing their requested feature by a certain date (without telling anyone in product or engineering of this, naturally) and now you’re re-negotiating their multi-year contract and giving them a serious discount to stay which is going to be difficult for your CEO to explain on the next earnings call.

52. People only get rewarded for diving saves.

People are congratulated for resolving the downtime or for catching a failure as it’s happening, but no one is rewarded for anticipating failures ahead of time.

The CEO wants things to be shipped now so everything is a rush to get half-baked features out the door quickly. But that causes quality problems elsewhere. At least half the deploys have an emergency “oh shit something is borked” follow-up deploy. And either you roll forward or the app limps along and languishes in a janky existence for the next five days until someone builds the fix and ships it.

Whoever ships the fix is lauded for restoring sanity, but it never should have been broken in the first place. And everyone knows if they had chosen to roll back, the CEO would’ve been angry because his little gamification feature wouldn’t have been there for five days. You suffer, your team suffers, your customers suffer, but bossman is happy and the bleary-eyed engineer who spent days on the recovery gets a pat on the back. Well done, naive salaryman.

Then a conference is coming up; your CEO and CMO demand a splashy announcement for it. That means your Q3 deploys are now beginning-of-Q2 deploys… which is in two weeks. You ship a ton of stuff that is half-baked and barely strung together, but the press release goes out (along with the press releases of all your competitors in an unnavigable sea of babblespeak that the market largely ignores).

The team is congratulated while the architect cries in the bathroom grieving their multiple quarters of work of carefully planned releases as support tickets now pile up with customer complaints about how features are broken. By end of year, half the features are still being “stabilized” and the other half are mothballed.

53. No process for rarely-performed tasks.

A task is rarely performed, so there’s no documentation on it. Regrettably, someone must perform the task now and today the universe has decided for that person to be you. You go to look for documentation and find nothing. You look at the code for the systems involved and it’s unintelligible. You git log the associated files and discover that everyone involved with the system has already moved on. You wonder if you should move on, too.

When disparate teams try to coordinate on rarely-performed tasks a special sort of confusion emerges.

54. Have to build a replica for noobs who can’t write queries.

It’s deemed necessary for internal data analysts to be able to run queries against production data so they can serve customers and forecast future business (or other such violations of linear time). They’re granted read-only credentials to the production database because that should be sufficient. Later, you are paged because the service is down and the database is wedged.

You discover that one of the data analyst’s queries is taking up way too much memory and has locked a critical table. You kill the query, sever access, and prepare for hell in the morning. In the end, you deploy a replica so the internal teams can query production data without killing the production database. Leaders considered it too expensive to set up originally, but how expensive was the outage and all the effort which went into restoring service?

55. Layer 8 denial of service.

Once upon a time, you and your team decided to rewrite an app because your company’s business model changed and thus very little of it was still useful. You also didn’t like Ruby, so you decided to rewrite it in Scala because Scala was hot and everyone on the team wanted to learn Scala. Great, let’s trust our important business function to people learning a new language!

The first version of the app was supposed to be deployed alongside the Ruby version and coexist with it. That deployment failed and also caused the Ruby app to fail. Repairing that took 8 hours of downtime. Naturally, the sysadmin didn’t particularly appreciate having to stay for an extra 8 hours on a Friday because your team wanted to deploy outside of business hours.

A month later, you try again. It deployed successfully! …But the migration for the user accounts fucked up. You could use the new app, but no one had accounts for it other than the root account. A week later, you try again with a script to deploy all the user accounts – and that was successful.

Later, your team discovers the v1 of the app is very slow when actual work is done in it. So, you switch to using Cloudsearch to “optimize” part of the app. And it does! …Except Cloudsearch is eventually consistent and now users complain that when they add something to the app and click refresh, it doesn’t show up until 30 seconds later.

Your team rushes a hotfix to undo the Cloudsearch integration and restore the previous functionality. The sysadmin says no. You gave them less than a day’s notice to deploy this new version, even though your team knew about it for a week while you worked on undoing the integration. You will be lucky if you ship anything else the rest of the year now.

tl;dr the sysadmin is fed up and doesn’t trust anything your team deploys now.

56. Engineers take key bumps of YOLO in prod.

Your company prides itself on being a meritocracy with a flat hierarchy, which is why senior leaders (like your boss) can disregard deploy processes – like making a production fix for a bug on the production node and recompiling, re-introducing the bug on the subsequent deploy because they never fixed the issue in tree.

This travesty is an argument in favor of making manual deployments impossible or difficult (see #69), but there’s no guarantee that any proposed safeguards would avoid veto by the Director of YOLO Engineering who is responsible for the fuckup in the first place. Because it’s never their fault, is it?

There’s also a coding variant to this fuckup: someone yolo-typing new code into a live virtual machine. They hot patch at the Erlang console because they relish living in sin. It might be called performance art if it wasn’t fated to desecrate service performance.

An AI render of the pope wearing a stylish puffy coat, giving him the appearance of a hip hop artist with ample swagger in juxtaposition with his role as pope. The image is captioned: your manager on their way to hot patch at the Erlang console.

That anyone would be allowed to do this assuredly reflects organizational dysfunction. It is so bonkers to be able to just like, write code on a production box and expect that it works. It is a pathological level of optimism. It is suspiciously reminiscent of the Pyro in TF2 who runs around burning everyone to a crisp with a flamethrower while, from their deranged vantage, they are showering the world in glittering rainbows and bubbles and whimsy.

“Well, I’d never do that!” you say, thinking this doesn’t apply to you. And then you’d proceed to attach VisualVM to the JMX port and yolo some gc tuning. Or you’d run some exploratory bash or SQL on the prod instance to get some data without having tested it fully in a test environment. Maybe you aren’t debugging in prod, but using tracing or performance analysis tools in prod to debug problems or tune settings without having tried first in QA at the very least makes you a co-conspirator and likely a Staff YOLO Engineer (maybe even Senior Staff if you continue to do it after reading this! Don’t let your dreams be memes).

57. Cloud credits are about to run out so you rush deploys to reduce your AWS bill.

You have to scale down really quickly because your cloud credits ran out and you can no longer afford your infra… which means you were spending money you didn’t have for a long time because Papa Bezos was your sugar daddy for a bit. As you scale down in a panic, you fail to load test the new database and regret not just selling out at one of the tech giants. Now your organization has successfully reduced costs… but also revenue.

58. Behavior in your dependents fucks up your deploy.

It’s trivial to mentally model your service in isolation; the rest of the world is immutable and your deployment is the only change in motion. In reality, other teams are hurling themselves at their OKRs, your sales team is onboarding new accounts, and your data integrations are data pipelines haphazardly built with popsicle sticks and glue. Like nature, the system is in constant flux and no matter how confident you are in your deploy, an unexpected shift in the system elsewhere can result in your system failing.

Maybe another team has worse deployment hygiene than you do and they yolo’d a version straight to prod without giving you a chance to integrate with it. Maybe they’re hotfixing an incident themselves and your service is collateral damage. Maybe a data partner changes their data format without announcing it (see #51) and every system in the path falls flat on its face.

It’s not your fault, but it is your problem. Scream into a pillow and sing lamentations to your pet or whatever you need to do to process your grief and move on to acceptance. Because if you want to prevail, you must be nimble and maintain the capacity to recover from unexpected failure.

Business Illogic

The deployment may pass your tests but it can still break your business logic.

59. Breaking API change for a partner.

Your team finally tackles tech debt and deploys the new, shiny, streamlined version of the API. A few hours later, a partner is screaming at your CTO because they were using the API in a way you never fathomed was even possible and their integration no longer works due to your change.

Another time, you’re celebrating the successful update of the auth method in your SaaS app. It passed all tests, got approval from the security team, and nothing broke after deployment… but, as you’ll soon realize upon wading into a shit show the next morning, you forgot to tell customers about the auth method update. Everyone built access using a certain type of token and switching the service to use a new method completely broke customer access. Guess who will be blamed for lower renewal numbers this quarter?

The “funny” thing about breaking API changes is devs will often argue what is or isn’t breaking. Semver this, semver that. It still takes the same signature and they only fixed a “bug” in the behavior of the other parameters… but what if the other software was relying on that behavior? Now it’s different and different is bad when customers rely on things staying the same.

60. Compliance calamity.

Compliance stuff is boring but it matters. Some subtle design, layout, wording, or data retention change in a highly regulated part of the system causes it to no longer be in compliance with one of the onerous compliance regimes it must be a part of for the business to remain viable.

For instance, your payment flow changed and now you’re no longer in compliance with PCI. This remains undiscovered until much later, as most failures of this type are. If you’re unlucky it’s the auditor who discovers it and you’re now buried in paperwork. Or you erode trust by violating user expectations about how you handle their data.

61. robots.txt that inhibits search engine indexing and traffic plummets as a result.

You change something in a way that results in search engines or other traffic sources deranking or delisting you. Maybe it’s as subtle as borking the preview cards; sure, the links still work, but it’s no longer as clickbaity to the ever-shortening attention spans of the plebeian spectators. Congratulations, you just killed your traffic source and meal ticket!

Everyone frantically tries to figure out what is going wrong as bank accounts drain. It might not even be something you changed — sometimes giants simply roll over in their sleep and crush smaller players. But it could also be that you messed up the robots.txt and are now poor.

The Audacity of Spacetime

Deploying the system at scale is different than deploying the little test sandbox version of it.

62. Deployment assumes all servers are updated at the same time, but they’re not.

This fuckup is so, so common. It breaks the simplified, but wrong, mental model that users will talk to your servers and only to that one server. It’s a useful model because it simplifies a bunch of things and is mostly true; when it’s not true, it’s often fine to overlook the effects. But, occasionally, the effects are catastrophic and nothing behaves properly until reality settles.

63. A new deployment begins while a previous one is still in progress.

Canaries and staged multi-region deploys can, by design, take a while – so your upgrade is only partially tested and deployed, resulting in an outage.

Most of the fuckups on this list are due to immature processes. But this one emerges as your processes begin to mature. Observing how your failures transform over time can elucidate your progress, a kind of mindfulness that is admittedly difficult to cultivate when feeling the crushing weight of disappointment.

64. Multi-stage deploys of unrelated components.

You’ve had so many deployment failures in the past and every deployment has been painful. Some well-meaning person has decided that deployments need to be surveilled with hawkish intensity. Deployment frequency plummets accordingly and every deployment is a potpourri of changes that various stakeholders demand go live.

Good ol’ batch deploys take forever. People get burned out or fatigued and then naturally make mistakes. Or it’s not their component and they don’t have skin in the game¹¹ and consequently are careless when handling it.

When failure does transpire, everyone’s frustration inflames. It’s either their component that failed and they’re frustrated at the lack of care by their peers, or it’s not their component and they’re frustrated that they have to be on this stupid Zoom call until 04:00.

The answer is probably splitting the deploys out; the only reason not to do separate deploys is likely organizational process or dysfunction (see also: Disorganized Organization).

65. Accidentally deploy more than you thought you did.

You’ve put a ton of work into automating your deployments. The automated tooling is effective and deploys exactly what you asked of it – but what you asked of it didn’t match your expectations.

Perhaps you thought you were deploying a branch containing only a hotfix, but it was started from the wrong base branch. Or maybe you thought you were asking it to target only a few canary nodes, but accidentally rolled the whole fleet. Perhaps the automation tries its best to make all of the servers consistent by ensuring changes must be deployed in the same sequence. Whatever it was, automation ruthlessly executed your command and now you’re scrambling to recover.

In many organizations, it’s difficult to justify improving the safety and user experience of internal tools since it doesn’t directly affect customers and “just” makes the system confusing for our engineers working with it. The silver lining is this outage will at least make the case that developer experience is important.

66. Zombie hosts.

Your new version operates under the assumption that the fleet is only running the new version and all instances speak the same protocol. But in reality, some hosts came back from the dead (i.e. maintenance) running an old version of the software after the deployment completed.

Now you have a zombie apocalypse on your hands with nothing to defend yourself but your laptop. You now regret choosing the ultraportable version rather than the hefty tank boi. And just like zombies, zombie hosts can sneak up on you when you least expect it, long after your deployment is complete when the post-apocalyptic landscape that is your prod environment seems almost serene.

67. Running out of cloud resources.

One fine morning, you discover you’ve run out of the specific instance type your service needs. Like, there are literally no more i3.16xlarge instances that exist for you to purchase in this universe (or possibly just the availability zone).

It turns out you are their largest customer, which, of course, the vendor never made clear for strategic reasons. Scaling beyond the capabilities of a vendor inevitably results in downtime. Either you convince the vendor to git gud or you patch to make the app creak along as you frantically build a migration path to a substitute, disrupting the roadmap in the process.

Or, on a Zoom meeting with a bloated attendee list, a dev notes that the app is slower: “I refactored the code to make it easier to read, but now it’s slower, so we need 3x the servers to run it.” You swallow bile. Lucille Bluth asks in your head, “How much could one server cost, Michael?”

If you have rollbacks, you should be fine. If you have autoscaling, you can just pay to address this problem. But nothing can help you automatically scale your tolerance to bullshit or rollback your life choices.

68. Proactively overloading your systems.

Scaling one part of the system puts pressure on other parts… and now they’re failing. You now must deal with an outage somewhere you weren’t expecting, all because you were proactive in anticipating capacity you’d need in the future. Worse, if that capacity is required right this millisecond, you face the dilemma of choosing which part of the system to sacrifice temporarily while you figure out how to fix the bottleneck.

Manual Deploys

69. Manual deploys.

Manual deploys are truly terrible. If there is a villain in the story of DevOps, it is manual deploys. They are not the serpent in the garden promising forbidden knowledge. Manual deploys are the Diablo boss that probably smells like rotten onions and toe fungus IRL and whose only purpose is to destroy any and all life.

Not convinced yet? Here are reasons A through Z to stop living in Clown Town. Each should be enough to convince you to automate at least the tedious parts of your deploys. Please, we beg you on behalf of humanity and reason, automate all the repetitive tasks you can, even if your org has an aversion to it. Humans are not meant for executing the same thing the same way every time.

An engineer walks into a bar, has two beers, and now is deploying to the entire cluster as they order a third. The bartender says, “You know, if you used an orchestrator, you could order something stronger.” That bartender’s name? Q. Burr-Netty.

Backups of the database probably don’t work. Every time you take a snapshot, it’s someone reading the docs off a DigitalOcean post on how to back up MySQL.

Copy pasta is always served with failsauce. Copying a config from an existing build to a new one, then forgetting to change the version number. Copying SSH authorized keys between machines… and if you’re managing them like that, it’s probably append-only which means your old ops people still have access to your prod servers.

Disk management as a matryoshka doll of disasters: capacity management, failing to provision enough space¹², IOPS management, SAN management and all the babysitting required for distributed disks, we probably don’t need to go on.

Expiration of certificates or domains, the tech tragicomedy. You know this will happen again in a year. You see the rhino charging towards you in the distance but there’s always something more urgent to do until it’s too late.

Forget to smoke test the whole environment. You perform manual tests but they only hit the “good” servers. Luck favors the automated.

GeoDNS routing with manual region switching so you can take down a data center and update it without any traffic… but actually DNS takes awhile to propagate so you still have a trickle of traffic coming in (does anyone care that much about those lost requests?).

Handling hardware failures is nigh on impossible. Are your systems even failing over?

Improper sequence when deploying components. Just like your dance moves, the order of your deploy steps is all wrong.

Jumpbox that people use as a dumping ground for random assets they need in prod, like random JAR files or Debian packages, movies they torrent at the office that they want to get on their home machine, random database dumps that people need for various purposes…

Keen to have the deploy done, you do not wait for changes to propagate, the cache to become warm, nor the system to become healthy. “No, sir, the engineer really worth having won’t wait for anybody.” ~ F. Scoff Gitzgerald¹³

Lonesome server runs the wrong version because you forgot to update all the servers. Or, you forgot one region when you’re doing multi-region updates.

Mismatched component versions. It’s very easy to do when you’re slinging deploys manually and how many database servers do we have again? Is Tantalum down or decommissioned? This IP naming scheme makes no sense. Is it even a database server?

Not copying code to all of the servers and not removing the old code from it, leading to conflicts worse than the tantrums on your executive team.

Overlook which environment you’re in. If it happens, it’s probably a process failure. Because it’s an easy thing to overlook, but there should be a lot more processes in place to catch someone from accidentally farting about in prod. Ideally, you shouldn’t even be able to make this mistake.

Provision users manually. Not only is it a pain in the ass, it is also fraught with peril.

Quarrels between IP addresses and hostnames that rival a Real Housewives reunion special.

Rotate the password or keys, but forget to update the service config with the new password. You rotate the password, so of course you have to update the config, but there may be numerous configs and it can be easy to miss one if it’s not documented or automated.

Smoke tests aren’t performed after manual production deploy. If you’re doing deploys the wrong way (i.e. the manual way), smoke tests are a way to mitigate some of the issues – but you must remember to actually conduct them.

Trusting that your on-call team will be paged despite never testing the paging plan.

Updating the monitoring system is overlooked. If you autoscale, the system managing the autoscaling will self-monitor the hosts. If you add a host manually to a system that doesn’t autoscale, you probably want the system to register with the agent that’s supposed to do the monitoring.

VPN that is a single-point-of-failure and held together with duct tape and twine. The VPN is required to access the network to do the deploys but apparently making it not suck is not required.

Wait for DNS propagation? Who has time for that?

X11 and RDP-based deploys where a tired sysadmin remotely logs into the virtual desktop of a system that shouldn’t even have a graphical environment and haphazardly drags files around until the new release is live. The commands can’t even be audited because there were no commands, only mouse movements.

Your sysadmin does maintenance on the database so that it can stay up, but in the morning you discover the settings they’ve changed cause the database to no longer run its background maintenance processes and you’ve just deferred your downtime until later.

ZIP or JAR file is copied from the developer’s laptop and now you have no record of what was deployed.

Thank you to the following co-conspirators for their contributions to this list: C. Scott Andreas, Matthew Baltrusitis, Zac Duncan, Dr. Nicole Forsgren, Bea Hughes, Kyle Kingsbury, Toby Kohlenberg, Ben Linsay, Caitie McCaffrey, Mikhail Panchenko, Alex Rasmussen, Leif Walsh, Jordan West, and Vladimir Wolstencroft.

Enjoy this post? You might like my book, Security Chaos Engineering: Sustaining Resilience in Software and Systems, available at Amazon, Bookshop, and other major retailers online.

This brings to mind Vonnegut’s advice of “Be a sadist. No matter how sweet and innocent your leading characters, make awful things happen to them—in order that the reader may see what they are made of.” ↩︎
As “Duskin” rightly noted in an investigation of a fire at an ammonia plant back in 1979: “If you depend only on well-trained operators, you may fail.” ↩︎
These sorts of seldom-used libraries are much less likely to be poisoned than the mainstream libraries which occasionally have CVEs, but infosec folks ambulance chase off them until our sanity is flattened and bloodied like roadkill. ↩︎
And why should traffic in pre-prod be as high as prod? Replaying all traffic to pre-production all the time is expensive af! So it’s a reasonable assumption, in isolation. ↩︎
but oh honey why are you performance testing an option that’s faster than what you’ll actually deploy?? ↩︎
The options are even more misleading than you might expect. --no-cache only inhibits the cache for layers created by the Dockerfile and does not skip the image cache. You need --pull to skip the image cache. ↩︎
Usually linkers order the objects they’re instructed to link by the order they’re presented. If you specify the order, you’ll always get the same order. If you have Make or whatever build system you’re using send the linker all the .o files in the directory, it will send them in the order the filesystem lists them, which can change depending on some internal filesystem properties (usually what order their metadata was last written). Usually it doesn’t matter, but maybe the code has some undefined behavior based on the layout of the code itself. Maybe there are static initializers that get run in a different order and some data structure is corrupted before the program even starts doing anything useful. ↩︎
Action bias is a bitch. See also a recent paper I co-authored: Opportunity Cost of Action Bias in Cybersecurity Incident Response ↩︎
I did not have to come at myself this hard. (That’s what they said). ↩︎
The original quote by Helmuth von Moltke is “No plan of operations reaches with any certainty beyond the first encounter with the enemy’s main force.” from Kriegsgeschichtliche Einzelschriften (1880). It is commonly quoted as “No plan survives first contact with the enemy.” ↩︎
“Skin in the game” is such a strange idiom. It makes me think of skeletonless fleshlings flailing around on a football pitch trying to flop wobbling meatflaps at the ball. Neurotypical lingo never ceases to amaze. ↩︎
You might think that with the growth of data collection, machine learning, and other ~~flagrant privacy violations~~ business intelligence practices that data storage is the primary dimension of capacity planning. This is often not the case. In the last decade or so, capacity has grown phenomenally but throughput and latency have not kept pace. As a result, IOPS and throughput are more commonly the bottleneck that needs planning while storage capacity is overprovisioned. On the cloud, allocated throughput and IOPS are assigned based on volume size, so it’s common to see vast overprovisioning of volume size to realize sufficient IOPS. It also occurs on storage SANs, where the number and capacity of disks are selected to match the required sustained read and write rates. All of this is phenomenally complicated but as a first approximation, IOPS and throughput matter more than storage capacity for many use cases. ↩︎
Paraphrased from Chapter 2 of This Side of Paradise by F. Scott Fitzgerald: https://www.bartleby.com/115/22.html ↩︎