What value can a Linux Distribution (Vendor) provide in the Container and Kubernetes age
Table of Contents
Preface
I have recently taken a look at a few containerized services that are quite popular in the Kubernetes world, namely Istio. I investigated how Istio is being built and what are the necessary pieces to assemble the whole container image.
The complexity of Istio's build process was quite astonishing for someone coming from the "traditional" distribution side of the world, where you'd provide all dependencies as individual RPMs or Debian packages (or any other package manager of your choice) and then build the final binary from that. In contrast to that Istio's Makefile starts with grabbing some of its dependencies as containers from Dockerhub instead of source tarballs. Fortunately, you can turn this behavior off rather easily. Unfortunately, the Makefile will then proceed to download prebuild binaries instead 😔.
Digging further uncovered that the first binary is envoy, a proxy for cloud native applications. In contrast to Istio, it is written in C++ (and not in Go), uses Bazel as the build system and drags in about 80 new direct dependencies. These include just about everything ranging from small system libraries like zlib to real heavyweights like LLVM or three distinct webassembly runtimes. All of these are pinned to specific versions and (of course) pull in even more dependencies. The "classic" approach of unbundling everything is just not applicable here, there are just far too many components in there and thus vendoring the dependencies is realistically the only option. But even that turned out to be non-trivial, as Envoy's dependencies were not setup to support offline builds 1 (probably, I did not dig too deep into this).
But Istio's dependency chain is actually not the main topic of this blog post. It rather became painfully apparent that our modern cloud native infrastructure has reached a complexity level, where rebuilding it is becoming increasingly impossible using the methods and tooling of your favorite Linux Distribution. So if a Linux Distribution Vendor can realistically only replicate upstream's build, how can an Enterprise actually add value here? Or to be more concrete, what do companies like SUSE, RedHat or Canonical have to do, so that an enterprise would rather use (and pay!) for their images instead of leveraging upstream's builds?
I believe that there are multiple options.
Securing the supply chain
Securing the supply chain has become even more important after the Solar Winds debacle of 2021 and appears like the obvious value add for an Enterprise vendor. However, given my previous illustration, is that even remotely realistic?
I believe that our "old" methods of building software will be less and less feasible in the future and we will not be able to package every single dependency of something like Istio and Envoy. But the question is: do we actually have to? Should we really "package the world"?
Therefore we have to ask ourselves: What is the actual benefit of unbundling everything into individual package? Besides pleasing our inner perfections, it allows your distribution's security team to track what is currently in the distribution (and it's artifacts) and whether any vulnerabilities are known for these components. Additionally, it allows you to patch security issues and bugs of individual packages independently of each other and track this whole process. Thus, you can provide your customers with the guarantee, that no known vulnerabilities are present in the distribution and on their machines (provided that they were kept up to date).
This whole process is nowadays moving more and more upstream: open source projects themselves are leveraging databases of known vulnerabilities and employing automated updates (e.g. via dependabot on GitHub). Nevertheless, this does not mean that a security team is providing less value. I would argue that it is even more important than ever! Given the incredible breadth and depth of today's application's dependency trees, it is highly valuable being able to track what exactly is in your software (i.e. having an up to date Software Bill of Materials). Having this attestation from an independent and trustworthy source will give your customers the confidence that their software is not susceptible to known vulnerabilities.
If a vulnerability is discovered, then things become more complicated unfortunately. As nearly every more modern programming stack relies on pinning and vendoring their dependencies, patching individual vulnerabilities becomes more difficult: bundled dependencies must be patched individually in every package, duplicating the required work. However, this is still much less work than unbundling the full dependency chain and maintaining thousands of packages just for a single container.
A hybrid approach?
Bundling/Vendoring your dependencies makes the life of upstream much easier but it complicates matters for downstream distributions. In a perfect world, where every upstream project keeps their dependencies at the latest version, bundling wouldn't be that much of an issue 2. But this world is not perfect: some upstream projects lack the tooling or the manpower to stay up to date all the time. Some projects get abandoned, maintainers quit, burn out, move on or maybe they just decided to hike the Appalachian trail and will return in half a year. Irrespective of the reason behind that, you're now stuck with a vulnerable library in your bundle and have to fix it.
Here a hybrid approach could help: instead of either bundling or unbunding everything, we could instead unbundle only the critical parts of the application. A critical part could be for instance system libraries, components with a known sub-optimal security history or libraries that directly handle untrusted input (i.e. where a security issue is most likely to cause damage). This approach is neither new nor my invention, if I recall correctly, then the long term goal of openSUSE's NodeJS packaging tools is to support exactly this scenario.
A Linux Distribution Vendor would thus have to asses the dependency chain of a
package, decide which dependencies are critical and need to be maintained
separately and unbundle these. This will require new tooling to be developed,
which would hook into existing programming language package managers (like pip
,
npm
, bundler
, etc.) and replace some of the dependent libraries with those
provided by the distribution.
This would reduce the overall maintenance burden to far more manageable levels, while the distribution vendor could focus on the really critical parts of the software stack and ensure it is audited and maintained.
Relevant Security Issues
We touched on the increasing breath and depth of nowaday's applications. This leads to a new problem with respect to security: the signal to noise ratio of vulnerabilities is constantly decreasing as the dependency chain is increasing. Especially in programming language ecosystems where tiny libraries are the norm (Node.js is notorious for this, but Rust to a certain extend as well), you frequently run into the issue that something in your dependency chain is vulnerable, but it is in a component where it is impossible to trigger this vulnerability. The prime example would be a library that is vulnerable to untrusted input, but you're only feeding it well known input.
But what has all this to do with Enterprise Distribution vendors? Assuming that
you ship a Node.js application in a container and your customer runs npm audit
or yarn audit
, they'll probably be greeted with a frighteningly long list of
vulnerabilities 3. You as an Enterprise Distribution vendor, or more
specifically, the security team, would now provide the customer with a list of
these vulnerabilities accompanied with reasoning, why they are not an issue (and
thus why you have not patched them).
This is by all means no simple task as even a rather simple Node.js application will have a lot of dependencies and could have quite a few (potentially absolutely harmless) vulnerabilities. Auditing them all takes a considerable amount of time, but on the other hand, if you're running a bank, an insurance company or a space ship, you want to be absolutely sure that it is not vulnerable.
Ensure a clean upgrade path
We have covered so far the supply chain and security aspect of an Enterprise Distribution vendor. However, most vendors additionally provide very long support cycles (10-15 years in the case of SUSE and RedHat) where they guarantee that their customer's system can be patched but updates will not break existing applications and systems.
Given our previous dive into the incredible size of the dependency chain, can an enterprise realistically maintain something like Envoy or Istio for well over a decade without upgrading to the next major version and only backport fixes? In my humble opinion: absolutely not. I don't intend to downplay the tremendous amount of work and engineering ingenuity that Enterprises invest into maintaining LTS code streams. However, we shouldn't fool ourselves: modern applications contain just far too many components that are moving at an incredible pace. It is the development version that gets most of the attention, testing and security fixes and not the maintenance branches (if they even exist). Everyone who has contributed to a bigger piece of software will confirm that even verifying whether a vulnerability or a bug is present becomes non-trivial once the main development branch and the LTS code stream have sufficiently diverged 4.
But why do customers want LTS releases? Well, because you just have to setup a system once and can then leave it running for the next decade with a guarantee that your operating system is maintained and your deployed applications will keep running. And if not, you can pick up your phone, call support and get the issue fixed by someone else. If an Enterprise vendor can no longer guarantee the maintenance of an application for a long time, why not improve the upgrade path instead? Imagine you are an Enterprise and you wish to deploy and pay for a maintained version of Grafana. Which option would you choose:
- an LTS of Grafana that is stuck at version
x.y
for the next 5 years or - the current version of Grafana with a supported upgrade path.
I would most certainly pick the second option and I believe that the focus of Enterprise Distribution vendors will have to shift more towards guaranteeing a smooth upgrade. This is not only because it will become next to impossible to maintain a huge software application in the long haul, but also because the cloud native ecosystem is moving at a much faster pace than "traditional" applications. It is certainly nice to have a supported version of e.g. the Kubernetes control plane, but if it relies on a version of Kubernetes that is no longer used by any Kubernetes distribution out there, then why would you even think about running it? Same goes for an outdated version of Kubernetes: even if it is maintained, but your developer's tooling doesn't work with it anymore because it is "too ancient", then you will most likely not want to run it in production either.
Wrap up
Long story short, the big value add of an Enterprise Distribution vendor will most likely be ensuring that upgrades are smooth and supported, so that customers can move to the "latest and greatest" version of your application. The QA cost of this is going to be tremendous, it will require a lot of new tooling to be developed and surely also completely new approaches to QA, but I believe that this is really the only way forward. Customers relying on Cloud Native applications will not want to stay on an old version forever and they will much rather pay for a smooth upgrade, than having to live with the old release for "all eternity".
Footnotes:
Bazel itself supports offline builds by caching the sources of your dependencies, but since bazel recipes can essentially execute arbitrary code during build, there's no guarantee that this will actually work. See also Bazel documentation on offline builds if you're interested in this topic.
Even the size increase due to duplicated dependencies could be solved via aggressive deduplication of filesystem blocks by the package manager (in theory at least). This could become more relevant for package managers for "immutable" base operating systems that are e.g. based on ostree, where the deduplication would only have to be computed once on image creation and could therefore be very thorough and expensive.
You can find a more in depth coverage of this in a blog post by Dan Abramov.
Albeit you often have a simple reproducer for a bug, it is usually caused by a mistake buried deeply in the call stack. Verifying whether this bug is still there can become tricky if it is only caused under unique conditions, that might not be easily reproducible in the maintenance branch. Fixing the issue superficially by simply ensuring that the reproducer does not trigger a crash/bug can be simple, but ensuring that the underlying cause is actually gone can turn out to be a tremendous amount of work. This is especially the case if you are not one of the main developers who is highly familiar with the code, but "just" a maintenance engineer who has to patch the LTS branch.