Kubelist Podcast: CI is the new bottleneck

Depot co-founder and CEO Kyle Galbraith joined the Kubelist Podcast to talk about how Depot went from making container builds faster to a complete build acceleration platform — and why CI is the bottleneck of AI-era software engineering. We've collected some highlights from the conversation.

From motocross to Depot

Kyle's path into engineering didn't start with engineering.

My story is probably similar to most engineers-turned-founders — it all started with video games. Way back in the day, I grew up racing motocross. I was on two wheels from the time I was six years old, racing dirt bikes around the country. I fell in love with a computer game called Motocross Madness and started hacking mods into it.

I thought about going to film school in Southern California, but I ended up taking a computer science elective at community college. I was like, "Actually, I really enjoy this. This is my preferred thing." I went through the whole CS program while working full-time as a software engineer for a startup back in Portland.

I've worked in startups my entire career, and from that first one I was dead-set on starting my own someday. After a few years I did a short stint of consulting. I absolutely hated it. It was a peek behind the curtain that I did not need in my technical life. You don't need to know that some technologies holding up billion-dollar companies are built on toothpicks.

That's when Jacob — Depot's other co-founder — and I met at a nonprofit called Thorn.

The original problem

Jacob and Kyle left Thorn together and joined Era Software, building an Elasticsearch replacement. The seed of Depot came from a pain point they kept hitting as platform engineers.

We'd always faced specific problems as platform engineers building out platforms-as-a-service. We just started building a better way to build container images. That was the first product Depot ever built.

The idea was: building a container image inside GitHub Actions is painfully slow for two reasons. One, saving and loading layer cache over the network to the GitHub Actions cache API is — for lack of a better way of putting it — a dumpster fire. Two, building multi-platform images is rough because you have to emulate ARM if you're on an x86 runner.

We built the first prototype, rolled it out in beta in May 2022, had our first 10 customers within a week, and then spent May to July building the first public version. When we launched, it was something like 25 customers in the first week. That's nothing nowadays, but back then it was, "Oh — other engineers actually have this problem. This could be more than a side project."

Two insights that made it work

Behind those early 5–10x speedups: a fork of BuildKit and a willingness to move builds off the CI runner.

There were really two insights, and they go back to the original pain points.

One, building a container image inside a CI runner is slow because you don't have your previous layer cache. You can persist it off the machine over the network, but networks are slow — they're even slower inside CI systems, and they're flaky. The first real unlock was: we can fork BuildKit, the build engine behind Docker.

We have our own fork of BuildKit. Nowadays that fork isn't really a fork anymore — it's our own container build product, our own build engine. A lot has changed inside that, all the way down to the internal metadata database, which was this weird archived repository inside BuildKit. We replaced it with proper SQLite.

The container build product is EC2. We shift the container image build off the CI runner and onto a remote machine. The original prototype used EBS — we wrote your layer cache to an EBS volume. When your build is over, we kill the machine, keep the EBS volume, and when you do another build, we reattach the volume — your cache is instantly available. (Except for the "EBS factor.") Nowadays Depot's container build product doesn't use EBS.

In 2022, we were talking 5–10x speedup from layer caching alone. Then we added multi-platform builds — take the Intel portion, build it on an Intel machine; take the ARM portion, build it on an ARM machine; skip emulation; merge the image back. You get one image that runs anywhere. That's where you get into 40–60x faster, because you're skipping the emulation. On GitHub Actions, you're running in a shared, constrained environment, generally on top of QEMU. Self-hosted ARM runners are very, very slow to spin up.

Getting the first customers

Depot's go-to-market has been technical writing since day one.

We lean very heavily into writing deeply technical content on our blog. Depot has always been a pretty open book on the technical side. There aren't really many secrets about how we accelerate things, and that approach gave us material that resonated on Hacker News and various subreddits. That helped build the initial customer base.

We applied to YC on a whim — we had no real plans to actually start a company and do fundraising. But when we applied, we sweated the details. That's one thing many people don't get about YC even today: they just apply with "I have this idea, Twitter for cats with AI generated images, you'll fund this, right?" We sweated the details. What's the actual business here?

We got into the Winter '23 batch — the last hybrid batch. Jacob and I didn't move to San Francisco. I literally interviewed for YC on the floor of my totally empty house back in Portland, because I had already sold everything. I was moving to France that weekend.

YC really hires for founders who will be resilient. Being a founder is not a walk in the park. You have trials and tribulations not daily but hourly once you start hitting scale. So you have to reflect: when have you been through some shit and come out the other side? That's "sweating the details" inside a YC application.

The biggest thing we got out of YC was the network — being able to reach any founder in the group in a non-spammy way. And YC instills something in you: you think big, then you talk to them and they say, "Okay, but how would you 10x that? How would you think even bigger?" You thought you'd reached the limit of your imagination — you 10x'd it, 100x'd it in your head before you came to them — and they ask, how would you 100x it in six months? I carry that with me today. Whether it's a revenue number or a usage metric, I always ask: what would I have to do to 100x this in some constrained timeline?

How Depot works

The current container build product looks very different than the prototype. Different storage, different provisioning, same goal: builds that feel instant.

Today, our build engine is our own. We've simplified it down to: what does it take to build a container image as quickly as possible inside a cloud environment? That goes all the way down to scheduling. BuildKit has the concept of deduplicating builds — a single BuildKit build can actually be 10 different container image builds, and it has the ability to dedupe that work. We've modified that to make it faster.

It's not EBS anymore. We run our own Ceph storage cluster — purely for throughput. EBS doesn't have great throughput at scale.

For both our GitHub Actions Runners and our container build product, we've created our own provisioning system. We couldn't just use auto-scaling off the shelf from Amazon, because auto-scaling is too slow. When somebody does a depot build, or runs on a Depot GitHub Actions runner, they expect that to start in one to three seconds. They don't want to wait for it to launch.

So we keep machines around that have been warmed and flashed with the AMI. If you're not familiar with how AMIs work inside Amazon, the AMI is effectively streamed off S3 — as blocks are read, those blocks are streamed off S3 into the machine. That wouldn't be performant for starting a brand-new instance, because we're talking about an AMI that's 70–80 GB in size. So we start the machine with that AMI, read all the relevant blocks into the machine state, and then stop it. When we go to start it again, it boots up instantly.

Snapshotting an EBS volume or AMI is also very slow — it can take 10 to 60 minutes. So instead we maintain a pool of warmed machines. When a build request comes in, we pull one out, start it, run the build.

A local build, remote

One of Depot's best features was a happy accident.

Originally we built depot build as a drop-in replacement for docker build — people could just swap it into their GitHub Actions or CircleCI workflow. Then we realized: wait, I can do the same thing locally too. I can move my local container image build off my machine and onto a remote build host.

That unlocked something we hadn't considered: the layer cache is now shared across machines. You could build the image on your machine, I could run depot build, we go against the same BuildKit host with the same cache, and I can reuse your build results. That was a total accident — but a really cool one.

The tradeoff: a local build means the build context has to flow over the wire to the remote BuildKit. We sync only what's changed in the build context. Similarly, when you build locally, you want to run the image — so you have to pull it back down. BuildKit by default assumed the result stays in the cache or gets pushed onward to a registry, but there's a third option: load it back.

So we wrote our own sync on the load side. BuildKit's default load was naive — it always sent everything back. We made load behave more like docker pull: only send back the layers that have changed. We built that ourselves.

A registry of our own

Depot has built its own container registry, a side experiment called Depot.ai, and the world's most elaborate solution to AWS egress.

Depot.ai isn't really a product — more like an experiment. Pre-AI-craze, it built popular open-source models as images, hosted in our own registry, using a format called eStargz. When you say FROM depot.ai/some-model, you can COPY out a specific file. eStargz is smart enough to look up the file's index in the layer and only send back that file, instead of pulling the whole layer.

Back when we started Depot, you pushed to your own registry. Now we have our own, built over Tigris and fronted with various CDNs. Before that, we had a version where layer blobs were stored in Tigris but the manifest was stored in ECR. That was a Rube Goldberg machine — you had layer blobs distributed worldwide thanks to Tigris, but a pinch point at ECR. Generally fine, until us-east-1 goes down, and now your manifests are unreachable, which means you can't fetch the blobs.

We recommend customers push to our registry. Performance is faster — pushing is faster because it's inside our network. We use Tigris's "siphon out of S3" concept: the registry writes to our own S3 bucket and we replicate to Tigris. If you do a depot pull from your machine, it goes to the closest Tigris edge location. The side benefit to us: it doesn't touch the AWS account, so we don't have to think about egress. And our infrastructure is smart enough to know: if a container build is happening inside our AWS account and you have FROM registry.depot.dev/my-image, we don't go all the way out to Tigris to fetch it. We know exactly where that image lives.

Security: nuke it from orbit

Build hosts run arbitrary code at root. Depot's answer: never trust a machine twice.

Two common misconceptions about how we run things.

One: people assume we leave machines on. That would be wildly inefficient — you might run one container build then nothing for three hours.

The other: people assume we stop and reuse. But if you know anything about Docker builds, they require the highest level of access on the machine — effectively root. We can't trust it. You could have tainted it, stashed something in memory, anything. Reusing would be a major security hole.

What we do instead: we nuke it from orbit. All build hosts and GitHub Actions runners — the backing EC2 instances — are single-tenant. We launch it, it runs your job, then we kill it. That's why our provisioning system has to maintain a fleet of compute: pull one out, instantly start it, run, kill it.

I've been on a number of EC2 service team calls about how we use the EC2 API. Depot is making tens of millions of API calls to EC2 per day. We have to be in the top 10% of daily volume for fresh EC2 instances.

Why not rack and stack?

Every infra company asks at some point.

I believe you're buying a different problem if you rack and stack. You're solving two problems at the same time: the software business and a real estate problem. How do you maintain enough capacity for upcoming demand? You'd have to over-provision massively. You're seeing this with all the frontier lab companies — if they don't hit a revenue number, they've already bought servers that are five years away that money is earmarked for.

Spot instances are interesting for build acceleration, because spot can be pulled out from under you. One of the trickiest things about operating in our space: we're literally talking about a Linux VM where anybody can do anything inside. If you could box that in — only certain things happen inside the VM — then a 30-second eviction timer is something you could engineer around. But because it can be anything (somebody could be running a massive database migration inside the runner), you don't want to kill that machine. There are interesting companies doing live migration — snapshotting memory, replaying it into a new machine so you fail over at the memory level. At some scale that's worth it, but there are a lot of other levers you can pull first.

Beyond container builds

Building one fast thing showed Depot how to build a lot of fast things.

We built the container image build product and focused on it for a year after YC. But it became apparent that all the building blocks we'd assembled could be applied to other builds. Instead of focusing on the individual build, why not focus on the entire CI workflow?

So we built our own managed GitHub Actions Runners with our own tech baked in — 3–10x faster than GitHub-hosted runners. We optimized the runner binary and short-circuited a bunch of stuff. We built a system that doesn't rely on GitHub's webhooks to know when a job needs to run, because webhooks are slow and very flaky.

Pro tip: go read GitHub's docs on webhooks. You'll quickly find they're "best effort" — meaning they won't always be delivered. Which is extremely problematic when you want to run a job.

Depot CI

The newest product takes Depot's "control everything" thesis to its logical conclusion.

Building and scaling the managed GitHub Actions Runners revealed what's challenging about that offering: the dependency on GitHub. You can only accelerate 30% of the workflow because the other 70% still lives with GitHub. Their plumbing. GitHub actually delivering the job to the runner. The runner reporting back to the mothership.

So we built our own CI engine: Depot CI. What if Depot controlled everything? From the ground up: Depot's compute, on Depot's infrastructure, with Depot's caching, connected to all the other Depot products — with a programmable interface. Everything can be done via an API or CLI command.

The power of that: I can give it to any agent, and the agent can write code, trigger CI, monitor whether it's green, dump the logs if it's red, and fix it itself.

Depot CI uses a different architecture from the EC2 product. It runs on bare metal inside AWS, using Cloud Hypervisor under the hood — with our own bit of spice on top. Depot CI is built on top of our own sandboxes — we've built sandboxing over these metal hosts. Fundamentally different than one-time-use EC2 instances.

The driver: we got really good at optimizing how fast an EC2 instance could start. The general number for an on-demand EC2 instance is 30 to 60 seconds, depending on AMI and machine size. We got that to two seconds. But I want it to be 200 milliseconds. That's where you get into a fundamentally different architecture — a micro VM or sandbox on a metal host. There are things you can do to start it significantly quicker that you just can't do with a virtualized service like EC2.

Two seconds for an EC2 instance — that's ripping out everything at the kernel level you don't need. EC2 hosts start all kinds of random things you don't care about. It comes back to: how do you warm the machine?

One thing that's unique about Depot CI: it understands other CI syntaxes. It speaks GitHub Actions today. We're working on our own SDK so you can define your own CI language in code, and we're working on GitLab syntax. We translate it into our own intermediate representation, then turn that IR into the individual sandbox commands.

A lot of people think a CI engine is directly tied to a YAML syntax. We've fundamentally broken that. You can bring a YAML syntax, we'll do the translation, and it just runs. For people adopting Depot CI, they literally drag-and-drop from the .github folder into a .depot folder and it works.

One thing we learned early: secrets are tricky to pull back out of GitHub Actions manually. Some orgs have hundreds of secrets — nobody wants to copy-paste those into a new CI system. So we figured out a way to run a one-time GitHub Actions workflow during a Depot migration that ports the secrets over. What might be hours of manual work becomes five seconds.

The five products

We have five products today:

Depot CI — our own CI engine.
Container Image Build — original product. 40–60x faster.
GitHub Actions Runners — managed runners, 3–10x faster than GitHub-hosted.
Depot Registry — our own container registry.
Depot Cache — remote caching service. The cache performance gains of our container layer cache, applied to other build tools like Bazel, Turborepo, and Gradle.

There's network latency in there, but when you stitch all of these together, you get a compounding effect on build performance.

Running EC2 at scale

Lessons from running EC2 at a scale where edge cases stop being edge cases.

Everybody is feeling the CPU crunch. We're in a strategic advantage being inside AWS. But here's what shocked me at scale: many people doing small things with EC2 — launching 10, 20, 30 instances a day — come to rely on the fact that when you launch an instance, it's good. At Depot's scale, we're talking millions of EC2 instances per day. It's not uncommon for us to launch one and have it be bad — corrupt, EBS screwed, local instance storage not operating.

Once Depot reached a certain scale, we built systems to detect that. PlanetScale's team has fantastic blog posts about seeing this at their scale. You pull a machine out, build your own health checks to confirm it's good. We had one case where we'd determine the machine wasn't good, give it back, make the API call again — and get back the same flipping machine.

So we have to reverse-engineer APIs. Credit to Amazon — why would they cover that scenario? It's not the 90% case. But we have to build a system: pull out an instance, it's bad, hold onto it until we pull one out that's good, then give it back. This is error correcting at the infrastructure level.

CI is the new bottleneck

What changes when one engineer writes code at 20x the throughput.

CI has always been critically important, but it's also been "not product work, not feature work — you're not directly delivering value to customers via CI." That's the take engineering teams have historically had.

But now, a single engineer with five agents at their side can author code at 20x the throughput they could three years ago. Carry that out to even a five-person team — the throughput at the code-authoring stage is massive. But look at what happens after committing. All of that code and all of that velocity flows through one thing: CI.

We've taken this new technology and bent it into our existing paradigm, which is a very human-centric paradigm. It goes back decades — commit code, open a PR, CI runs. I might be reviewing two or three PRs a day. Maybe five or ten deployments a day. But now we have hundreds of PRs a day, all going through CI, all still needing to be reviewed.

What CI is changing into is the verification layer for all of this code being authored. Agents are writing more and more code. Engineers can't really review all of it — even with AI agents reviewing code. CI sits in a unique space where it can be the verification substrate: can I trust this code? Is it high quality? Can it go to production?

A lot of what we're working on over the next six months: how do we unlock that? How do we surface it to engineers in a high-level way — they don't need to go all the way into the details — but also automatically surface it back into the agents writing the code? Right now there's a clunky workflow where the human has to copy the error out of CI and paste it back into the agent that wrote the code. CI should just know that code failed, and the agent should fix it itself. The loop continues.

Everybody's looking at this through the same paradigm we used for software engineering yesterday — but we're not doing that software engineering anymore. We're doing a totally different type of software engineering. It doesn't mean software engineers are going away — not in a million years. We need more engineers than ever. But we now need to manage all of these workstreams. These are asynchronous workstreams we're doing with machines. We need to define a new paradigm for that — not bend our existing one into that technology.

Listen to the full conversation on the Kubelist Podcast.