One of the most important principles that working on Observability has taught me is this: by default, anything you do not observe is broken.
No qualifications, no hedging: it is broken right now, for the simple reason that you cannot trust that it is not. If it were broken, you would have no way of knowing; so for all practical purposes, it is reasonable to treat it as broken from the outset.
The baseline for a robust product – especially one that is cloud-native and distributed – is to have observability in place that gives you confidence beyond reasonable doubt that, if something were broken, you would know about it as soon as it happened, and ideally a little sooner than that.
We also know that no application survives contact with real users unscathed (paraphrasing Helmuth von Moltke). That is why we do load testing with tools like k6. By simulating load and system stress, we can establish confidence that an application will not fail when exposed to large volumes of production traffic.
We simulate load to verify that an application is robust enough to tolerate production usage. However, heavy production usage is not the only source of instability and outage; infrastructure reliability must also be accounted for. What happens if your network begins dropping packets, pods become unavailable, or a service your application depends on stops responding or returns faulty data?
To extend Helmuth’s observation: no application survives contact with cloud infrastructure unmodified. Load testing verifies resilience under production-like usage; chaos testing verifies resilience under production-like deployment conditions. By simulating infrastructure failures – network jitter, pod crashes, worker node failures – we can establish that the system is robust not only under usage pressure, but also when running on top of inherently unreliable infrastructure.
| Concern | Practice that addresses the concern |
|---|---|
| Will it work on my machine? | Unit, integration (feature …) tests |
| Will it work on my cloud? | |
| - Will it work under production load? | Load testing |
| - Will it work on unstable infrastructure? | Chaos testing |
| - Would I know it if it did not work? | Observability |
At Canonical, we are working on day-2 operations automation across the entire open-source cloud operations stack – from deploying an application on a cloud and ensuring it interoperates correctly, all the way to observing it and subjecting it to load and chaos testing. The goal is to reach a level of confidence where deploying on a Friday afternoon is a routine, low-risk activity: not because the system will never break, but because we trust that we will receive an alert before it does, and that we will have sufficient data available to begin remediation immediately.
Juju primer
Juju is Canonical’s cloud operator driver. Full documentation and technical definitions are available at canonical.com/juju. It is useful to think of Juju as an operating system for the cloud. A “cloud” is a collection of abstracted storage, networking, and compute resources. This encompasses any public, private, or development cloud environment. Juju serves as the abstracted operating system for that environment. Charms are the applications that run on Juju (juju deploy postgres, and so on).
For those coming from Kubernetes, a charm is a sidecar operator that operates alongside your regular application or service pod.
For those coming from VM-based clouds, a charm is a co-located process that operates alongside your regular application or service package (snap, deb, or custom binary).
Juju is model-driven: you declaratively define the topology of your deployment. For example, “I want a Tempo instance and a Ceph cluster with S3, and I want Tempo to use that Ceph cluster as storage”, then Juju reconciles towards this desired state, abstracting away the specifics of the underlying substrate.
Where a conventional operating system has application developers encoding business logic (“this is how you edit images”, “this is how you serve a database”) in a binary distributed via a package, in Juju charm developers encode operational logic (“this is how you scale Postgres”, “this is how you add TLS to this server”) in charm code distributed as a charm package on Charmhub. In practice, a charm is a YAML specification with Python code that describes how a given application should be installed, configured, and operated in response to Juju model state and model changes.
For a more detailed introduction to Juju, see this introductory tutorial.
Observability in Juju
For over four years, the Observability team at Canonical has been developing COS (Canonical Observability Stack): an opinionated, scalable solution for monitoring cloud workloads. Further information is available at documentation.ubuntu.com/observability.
COS can ingest all OpenTelemetry signals — logs, metrics, profiles, and traces — and use them to populate dashboards and emit alerts suited to your monitoring requirements.
Both external users and internal Canonical services can bootstrap the substrate, Juju, and COS using Terraform; Juju then takes over, with charms handling all operational concerns such as scaling, backups, upgrades, and integration logic.
Load testing in Juju
To run load tests against a staging cloud, add the k6-k8s charm – along with its integrations to any load-bearing applications in your deployment – to your declarative Terraform plan and run terraform apply. Then execute juju run k6-k8s start to simulate large numbers of virtual users exercising your APIs.
Chaos testing with LitmusChaos on Juju
Last year, Canonical charmed the Litmus control plane and established it as the foundation of our Canonical Chaos Engineering Platform. By adding the litmus-operators module to your Terraform plan, you can provision a control plane that is immediately usable on any Kubernetes-based cloud.
This enables you to launch the Chaoscenter and begin defining experiments and probes to introduce controlled failures into your infrastructure. With COS running alongside and gathering telemetry, you gain precise visibility into what breaks, how it breaks, and whether the alerts you expect to receive are actually being triggered.
Add k6 to the mix and you have a comprehensive validation scenario: network packet loss, pod failures, and dependent-service degradation occurring simultaneously while a substantial volume of virtual users exercises your platform. Throughout this process, you can systematically evaluate:
- How does the system behave under combined stress?
- How does the observability stack respond?
- Are any signals missing?
- Were unnecessary alerts generated?
- Were expected alerts absent?
If this test passes and your observability stack holds up, you can proceed with production deployments with significantly greater confidence.
Chaos testing in Juju: 26.04
At Canonical, our development cycles align with Ubuntu releases. In the 26.04 cycle, we introduced a new charm to our Litmus collection: litmus-infrastructure-k8s. This allows Juju users to manage declaratively not only the Litmus control plane deployment and operation, but also the provisioning of Chaos Infrastructure. Adding this charm to your Terraform plan, in the same Juju model and Kubernetes namespace as the system under test, and integrating it with the Chaoscenter charm produces a fully software-defined Chaos Infrastructure ready to execute experiments.
The future of chaos testing in Juju
Several directions for future development are under consideration. As this capability is newly released, user feedback will be an important input in shaping the roadmap.
One area of interest is a Juju fault injector that would enable experiments operating at the Juju level rather than at the cloud substrate level; for example, adding or removing relations, scaling applications up or down, running actions, or modifying configuration options. This would enable a higher-order form of chaos engineering specifically suited to validating Juju-based products.
A second direction involves attaching chaos experiment definitions directly to charms. This would allow operators to define declaratively not just the deployment, observability, and testing stack topology, but also the chaos experiment definitions themselves. In practice, this would mean teaching a charm to communicate to the Litmus infrastructure how it should be chaos-tested; shipping built-in, opinionated fault and probe definitions that enable cloud administrators to quickly get started validating their deployments with Litmus.
Acknowledgements
We would like to thank the Litmus team for developing an excellent open-source Chaos Engineering tool, and the CNCF mentorship programme for contributing observability instrumentation for Litmus itself.