Improving Ubuntu Infrastructure Status Reporting

Hey everyone,

Recently, we’ve experienced a few infrastructure reliability issues that impacted various Ubuntu services. We know firsthand how frustrating it is when the tools you need to do distro work are unresponsive. What makes these situations even more annoying is our current lack of clear, public-facing reporting and dashboards. Without a centralized way to surface the state of our infra it is that to know what is broken, what is degraded, and what is currently being worked on.

This recent friction highlighted a clear technical debt in our observability story, prompting us to take a step back and look at how we can improve the situation—both in the short term and the long term.

Our First Step: Ubuntu Engineering Upptime

As an immediate, short-term fix to improve transparency, we looked at available options and decided to set up a GitHub “Upptime” project for our services.

You can view the repository and the generated status page here:
canonical/ubuntu-engineering-upptime

What is Upptime?

Upptime is an open-source uptime monitor and status page powered entirely by GitHub. It uses GitHub Actions to run scheduled synthetic checks against our endpoints, GitHub Issues for incident reporting, and GitHub Pages to host a live status dashboard.

Currently, this provides us with a clean, public overview of whether our main services are reachable, alongside historical response times and uptime percentages.

It also creates issues when services are down (and auto-resolves those once the service is restored)

The Limitations

While Upptime is a great first step, it is strictly a “black-box” monitoring tool. It is excellent at telling us if a web frontend returns an HTTP 200 OK, but it falls short of giving us the complete picture.

For instance, Upptime doesn’t tell us anything about the state of our backends. A service’s frontend might be up and routing traffic, but the backend worker processes could be failing to update reports.

Going forward: Juju Charms and COS

To address the missing backend statuses, we are leaning into the work done in recent Ubuntu cycles to transition and encapsulate our legacy services into Juju Charms.

By charming our services, we aren’t just improving deployment; we are creating a standardized platform for managing the entire operational lifecycle of these applications. Currently, our charms are doing a great job at running the services, but they lack built-in monitoring and reporting. Adding better observability into these charms is something we plan to work on during the upcoming cycles.

To achieve this, we plan to integrate our charmed services with the Canonical Observability Stack (COS).

How COS Delivers Value

COS bundles standard monitoring tools (like Prometheus and Grafana) to give us a look inside our applications, rather than just pinging them from the outside.

By integrating our backend infrastructure with COS, we can achieve two major improvements:

  • Actionable Alerting: We can set up intelligent, automated alerts for when a backend process silently fails or a job queue gets stuck.
  • Surfacing Meaningful Metrics: Instead of relying on a simple “green/red” status, we will be able to surface actual operational data. For example, we can provide dashboards showing exactly when a specific report was last successfully updated, or the current status of the backend.

This transition will take some time, but it will fundamentally shift our infrastructure from being reactive to proactive, providing a much more robust foundation for the distro.

Feedback

Please take a look at the Ubuntu Engineering Upptime page and let us know any services are currently missing for the list.

We also welcome any questions you might have about the current status or the plan for future cycles.

7 Likes

As someone who quite often uses packages.ubuntu.com, which currently shows an average response time greater than six long seconds, which may well end in error 503 (service unavailable), and, I think, I’ve seen 502 (bad gateway) too, I want to point out that such instances may go missing between the scheduled checks. I have the feeling that it’s experiencing a permanent brownout, quite probably due to “Ai” scraper bots. Just saying that the picture may be very incomplete.

And I cannot help but implore you, Canonical, to reevaluate your position on employing such “Ai” tools, for they, as a phenomenon in general, are the very cause, by my reckoning, for these issues akin to soft DDoS attacks. By using LLMs (large language models) you are inadvertently part of the problem, because that in turn creates demand for ever more training data and ever higher acquisition frequency of which. If nothing is done about that, the internet may well lose the arms race that is already ensuing. Just look at the outrageous RAM prices, and silicon more generally. That’s all because of high demand from the LLM industry, so they clearly have the upper hand.

And, as if by telepathy, I get a “draft error” from discourse, because “drafts offline” happened while typing these lines. :wink:

As a stopgap you could try something like “Anubis”, as mentioned in that LWN article. packages.debian.org seems to have something like it, but it’s too fast for me to say what, only that it’s something “Fastly”.

This is not a service run by Canonical but it was put in place by a community person running the service on their own machines, Canonical exclusively uses launchpad for everything and has not really a lot of influence on packages.u.c …

(there is a link at the very bottom “Report a Bug” where you should be able to reach the maintainer)

1 Like

Can we please bring data, not feelings, to topics like this?

packages.ubuntu.com has had those kinds of errors forever - long before LLMs became the enemy du jour, even if they may be a contributor now.

As I understand it, the code for the Ubuntu incarnation of the packages site is here: https://salsa.debian.org/webmaster-team/packages (in the ubuntu-master branch).

Perhaps offer a pull request that adds the kind of interstitial “anti-LLM” thing you’d like to see implemented here.

I don’t know how it’s deployed, or where, sadly. But it’s certainly a greatly appreciated community-contributed service.

2 Likes

Wait a minute, if packages.u.c is not an official part of Ubuntu’s infrastructure, then why is it on that ‘Upptime’ chart, with an official looking Canonical logo and on the official 2nd-level domain, no less?

You don’t see the irony between those lines? :wink:
Where is your data to support the “forever” statement? My data is pretty much inside that LWN article and the ones it links in the opening section. Why would packages.u.o be special in somehow not being affected? Plus, my past experience says, you made up that “forever”.

My first-hand experience is that I have been contributing to Ubuntu for around 20 years, and worked at Canonical for 9 years. I used packages.ubuntu.com daily as part of my work and community activities. It was incredibly frustrating, and has been for a decade or more. This was all pre-LLM. Hence, me saying it’s been like this “forever”, where 20 years is about forever in terms of Ubuntu.

Fair enough, I must have hit it at the most opportune times then. I wouldn’t have said anything, if I hadn’t noticed a remarkable difference. In recent months it’s all but unusable.

1 Like

Since I just clicked said “Report a Bug” button on packages.u.c and got this:

plus the “obligatory” :(

You’ve asked for feedback: I really don’t like all this infantilization going on lately. “Uh oh!” makes me feel like I’m being mocked, adding insult to injury, and it is not befitting a professional outfit like Canonical/Ubuntu. Just get back to more anodyne messaging. Trying desperately to put a “funny” spin on bad news may have the opposite effect. What’s next, “Uh oh!” when a certain vehicle on “autopilot” plowed into a semi-truck? Please, we’re all grownups, and sometimes sugarcoating is ill-advised. Just give it to me straight. At least I’ll know that there are no toddlers involved in fixing the issue, because that’s what “Uh oh!” says to me.

I’m talking about Launchpad, if that wasn’t immediately obvious.

The link goes to: Bugs : pkg-website just try it there directly (or contact the maintainer through the project overview page)

That was not at all my point, but thanks. Plus, I did click https://bugs.launchpad.net/pkg-website/+filebug

Hmm, then I didn’t get your point, from your post above I thought you wanted to talk to the person owning the page about LLM protection tools, you can reach them through the LP project page or via a bug report …

My last point was about the infantile messaging in that “temporarily unavailable” page I was greeted with. And I mentioned that it was on LP, precisely because that’s as officially Canonical as it gets.

Well, then there is also Bugs : Launchpad itself … not sure if the LP maintainers read along here in the community forum … (you might not see it often but if occasionally a tab crashes in your browser you get a very similar message, which makes me feel like the “infantility” as you call it is kind of a standard in many places nowadays, it does not seem to be limited to LP)

Sorry, still not the point. I wanted to address Canonical and provided feedback on a more general issue, which is the trend towards ever more infantile language in business matters such as error messages. And since this topic is entitled “Improving Ubuntu Infrastructure Status Reporting”…

And yes, Firefox is a shining example of this. Doesn’t mean it’s a good one to follow. :wink: