A cyberpunk painting of a cat with goggles tampering with a pipeline in a data center. Everything is bathed in neon in shades of electric pink, vivid purple, shocking cyan, and lime green. The hacker cat looks intent, one paw on the pipeline to ensure all the fluid bits are pilfered into its canister.

The cybersecurity discourse is, of late, festooned with fear mongering about vulnerabilities in build pipelines. If an attacker exploits a vuln in our build pipeline, are we doomed? No, because it’s pointless for them to do so. But there is a real problem revealed by this clucking and clamoring: many security professionals (and vendors) don’t know how build pipelines work.

The twisted security tale they’ve spun is: One horrible day, our build infrastructure reads attacker-controlled data that triggers exploitation of a vulnerability. Yet, to achieve this, the attacker must gain access to our build system; if they can access the build system, they can change what it does and what gets built. Why do they need to exploit a vulnerability when they’ve already cinched their victory? Even male peacocks aren’t this wasteful.

Here’s how the real story unfolds. I, a nefarious attacker, want to corrupt the software builds coming out of BlandCorp’s GitHub Actions build farm. I’m already versed in how most build processes work at modern enterprises because attacking them is part of my job1.

BlandCorp, like many enterprises, runs the Actions runner inside numerous pods on Kubernetes (or another build runner inside build infrastructure). These pods receive builds from BlandCorp’s GitHub server and then run the build steps that are specified in the target repository, such as:

  • checkout the code
  • install the language toolchain
  • fetch the dependencies
  • build the software
  • run the automated tests
  • upload the resulting artifact2

Build pipelines are not like a web application where there’s pervasive interactivity or it takes input by design. Build infrastructure is designed to grab the source artifacts, perform some work to verify and transform those source artifacts, and then deploy the results of that work somewhere.

The main interaction with the outside world is grabbing the artifacts. Build pipelines don’t ingest form fields or input from a command line; a build pipeline does its thing very well but its thing, in the grand scheme of things, is limited.

If there’s a vulnerability in the build pipeline I want to exploit as an attacker, I must find a way to interact with it. This interactivity is designed to be impossible – a testament to the efficacy of design-based solutions for security and reliability. Security vendors will not tell you this for somewhat obvious reasons. Vendors want to scare you about build pipeline vulnerabilities because if it were possible to exploit them, it would be dire, and they want you to pay them to soothe your fears.

If not through exploitation, then how does the story unfold? Imagine I have a code execution vuln I want to exploit. If I can change the data, I can already commit code – so I may as well write code that does what I want. As Raymond Chen said decades ago, “You shouldn’t be surprised that allowing people to run code lets them run code.”

Or, if I can change the software that runs the build runners, I can replace it with a malicious version. I don’t need to exploit a vulnerability at all because I already have the access I want to gain control over BlandCorp’s build infrastructure.

So, as an attacker, I can ferret my way into BlandCorp’s build infrastructure through three primary paths:

  1. tampering the source code
  2. substituting the dependencies or language toolchain
  3. corrupting the underlying runner that performs the work

How do I reason about these paths as an attacker?

Path #1: Tampering the source code

The most direct and obvious way for me to tamper the source code is to commit new code to whatever I want to tamper, like a component that will be built by BlandCorp’s build runner. This is also likely the least stealthy way to compromise a build pipeline.

I, as an attacker, cannot simply submit a pull request (PR) with my malicious modification; or maybe I can, but it’s very unlikely to be approved by a human involved with the project. BlandCorp likely has branch protection, too, which prevents my ability to force push my malicious code. This displeases me as an attacker.

Path #2: Substituting the dependencies or language toolchain

Next is the most expensive path. The dependencies and language toolchain are where, as an attacker, I can inject data in the build process (like substituting or replacing the dependencies). But BlandCorp’s runner, like any runner, will fetch dependencies from their upstream locations on the internet and cryptographically verify them to ensure they match what developers expect.

Thus, to interlope with this software, I must incinerate tens of millions of dollars of CPU time to find a hash collision and meddle-in-the-middle the build workers. As an attacker, this also displeases me.

Path #3: Corrupting the underlying runner that performs the work

The runner that performs all this work in the build process is not cryptographically verified. But if we trust GitHub (or an equivalent vendor) to store our source code, we should trust them to be able to run it, too. Verifying the underlying infrastructure and keeping it safe is a lot of work. If we do it ourselves instead, we don’t gain any assurances about tampering and lack thereof.

To corrupt the underlying runner, I (the attacker) must invest ample time, money, and cognitive effort to either:

  1. Compromise BlandCorp, who maintains a stuffy “no SaaS allowed” policy and thus self-hosts; if I compromise BlandCorp to gain enough access to their self-hosted build infra to tamper with it, I’m already deep inside BlandCorp (so no need to exploit a vuln unless I want to flirt with future incident responders)
  2. Compromise GitHub itself (or an equivalent vendor), specifically in a way that allows me to successfully modify the GitHub Actions code or infrastructure as befits my devious schemes.

For either option, I can social engineer a developer or admin to poach their credentials or gain access to their machine, from which I can pivot (with varying degrees of difficulty depending on their IAM architecture). In the CircleCI compromise, attackers stole customers’ keys in this fashion (by pwning a CircleCI dev’s laptop) – a terrifying scenario for customers. But, for the purposes of this post, it’s worth noting the attackers didn’t corrupt the underlying runner because they already accessed the resources they wanted and why pursue something harder?3

I’m not spending tens of millions of dollars in either case, but this option likely leaves me wanting something easier as an attacker.

The caveat

But ay, here’s the rub4. BlandCorp might take shortcuts or they might use vendors that take shortcuts. One of those shortcuts is disrupting the verification steps – or not applying them to some component included in the build.

What do these verification steps involve? The worker (the tool in the build step that downloads dependencies) verifies a cryptographic hash provided in the application’s source code, usually right after the asset is downloaded and before it’s extracted or used (see path #1). The cryptographic hash is stored in a lock file that is versioned alongside the application source code.5

So far, so good. Let’s zoom out to the build steps themselves. Each CI system has its own little language for describing how to build a project. These CI systems want to bequeath us the freedom to build whatever we want, like building something custom with a bash script. But this freedom allows us to do things we shouldn’t do, like download unverified files from the internet and run them.

Thus, the special trust you must maintain is that whoever writes the build steps doesn’t include randomly fetching data from the internet in those steps. Generally, they don’t – it’s very uncommon to do that because it’s frowned upon by all parties but also because verifying things is the default. You’d have to go out of your way as a developer to write build steps that download data unverified from a remote location. So, you know, don’t.

It only takes a single step of “download data and install it, without verifying it” to poison your builds – as we witnessed in the CodeCov compromise. CodeCov offered an install process of “copy this line of code into your build pipeline” – specifically bash <(curl -s https://codecov.io/bash) – and that line of code (now deprecated, but at the time) fetches a file from their website, downloads it, and runs it. No one likes this.

The reason why security professionals detest this installation process is obvious; they are paranoid, even when unwarranted, so they distrust most code downloads. But software engineers also dislike this form of installation process because it destabilizes and jeopardizes reliability.

Security is a subset of software quality

I mention reliability because reliability is the reason why build systems are designed this way – a way that frustrates attackers by design. Engineers want to ensure that when they make a build, test it and deem it correct, then deploy another build to production, that the second build won’t be meaningfully different from the first one. Security may not serve as the primary motivator, but it benefits from this stringent reliability requirement.

Much of what we seek from a security perspective is enveloped by reliability. Security is ultimately a subset of software quality. This is a lesson that more security professionals should heed, especially those that protest that software engineers “don’t care about security.”

Reliability is also why many software engineers feel like the less security teams meddle in the build process (and other parts of software delivery), the better – the higher quality, more secure – it would be. Many of the things I read about “securing” build pipelines are half-baked and result in less reliable software, which means less secure software.

Is adding more things with opaque and unverified steps in your build pipelines a good thing? Check your security vendors’ install processes, too; how many used (or still use) CodeCov’s same approach to shove their scanners and wares into your pipelines? Glass houses, pots and kettles, etc.

Instead of barking up errant trees, security professionals should seek opportunities to invest in reliability with auxiliary security benefits so everyone wins. When we propose security “solutions” that destabilize reliability – like some newer security solutions requiring you to completely renovate your build pipelines to accommodate them – our colleagues will be baffled by our audacity. Understand the thing you are trying to “secure” before you thrust yourself in it.

If a security professional isn’t familiar with reliability in the context of software, that’s an urgent problem. If we hope to “secure” the software delivery process, we need to understand the innovations around reliability – whether site reliability engineering or software quality – that enable the ability to achieve all this high-falootin’ modern software stuff in the first place.

Start with the wiki on reproducible builds. Build reproducibility is something we care about for reliability purposes, but most cybersecurity teams today aren’t equipped to assess it. That must change.

Tech leaders and software engineers should consider where reliability investments may impart security benefits. We should teach our security colleagues about our software reliability efforts – especially how these investments exasperate attackers and impede their objectives (by skyrocketing the effort attackers must invest). Brainstorm how to further exacerbate attacker frustrations through these innovations.

Create a decision tree of how attackers might compromise your build infrastructure and capture existing design-based mitigations (as described above). It may spare you from unreasonable demands to fix CriTiCaL SuPeR uRgEnT bugs that can’t be exploited through any reasonable means.

My hope is that both communities can find common ground by thinking more about security solutions by design – but that starts with understanding our organizations’ systems and what purpose they serve. Otherwise, I fear the entrenched “vuln scan all the things” monomania will deepen and waste our precious time and effort on tilting at windmills.

Enjoy this post? You might like my book, Security Chaos Engineering: Sustaining Resilience in Software and Systems, available at Amazon, Bookshop, and other major retailers online.

Thanks to Alex Rasmussen, C. Scott Andreas, Camille Fournier, Leif Walsh, and Ryan Petrich for feedback.

  1. If only security people took this understanding as seriously as attackers. ↩︎

  2. If BlandCorp integrates security cruft into their build pipelines, there might be steps like: run the vuln scanner or generate the SBOM ticket (alas). ↩︎

  3. This is why I’ve been saying for quite a few years now that IAM is the hardest security problem related to modern infra. Most security vendors in that area are not very helpful (especially the like Identity Posture Hygiene Surface ones). Solutions like time-based access feel more promising. ↩︎

  4. Of course a nod to Hamlet, Act III, Scene I https://poets.org/poem/hamlet-act-iii-scene-i-be-or-not-be ↩︎

  5. If you want to update a dependency on your local machine, your local build tooling (like npm install) automates the process for you. The engineer selects the new version they want and asks their local build tooling to install it for them. The local build tooling downloads the requested version from the upstream repository then records the new version and its cryptographic hash in the project’s lock file. The engineer commits this lock file – as well as any other changes needed to use the new version of the dependency – in their PR, which their peers will review before merging it into the codebase. ↩︎