API metrics for disaster recovery

pedro-rib · September 9, 2024, 12:37pm

Project	LXD
Title	API metrics for disaster recovery
Status	Completed
Author(s)	@pedro-rib
Approver(s)	@tomp
Type	Implementation
Internal ID	LX083

Abstract

Proposing API metrics failure detection on LXD servers as part of a disaster recovery framework in MicroCloud. The goal of this spec is proposing metrics that are useful for detecting failures on each LXD server in a MicroCloud. Those metrics would be accessible through /1.0/metrics on each LXD server along with the existing metrics and be consumed by Canonical Observability Bundle (COS).

Rationale

MicroCloud, and particularly LXD, already provide many tools for disaster recovery, like instance/volume backups, snapshots and LXD’s disaster recovery tool. But although those features provide some tools to help recovering in case a failure happens, they don’t address the problem of detecting the failure itself efficiently.

For this reason, this specification proposes API metrics that convey real-time information about that server’s activity through API request rates. The user could easily make use of those metrics, setting appropriate thresholds for their values and configuring automated actions and alerts using COS or some other third party solution that would consume the values of these metrics.

Specification

Overview

Those metrics are designed to be scraped by Prometheus (that is included on COS) in regular intervals, giving an overview of the behavior of each LXD server on a MicroCloud over time.

The introduced metrics would be included as part of LXD’s internal metrics, accessible through GET /1.0/metrics on each node on a MicroCloud. Thus, each metric value is per node, allowing for identification of failures on any individual node on a MicroCloud.

Lastly, the API rates metrics will consider all endpoints on the main REST API, plus any request against the LXD server that uses an invalid URL will also be considered.

The docs relevant to these metrics can be found here.

Proposed metrics

All proposed metrics include a label dimension named entity_type that would showcase the main type of resource that each endpoint operates on. In this first version of the API metrics, the value associated with this label would follow this pattern:

instance for /1.0/containers and /1.0/instances endpoints.

network for /1.0/network-zones, /1.0/network-allocations, /1.0/network-acls and /1.0/networks endpoints.

storage_pool for /1.0/storage-pools and /1.0/storage-volumes endpoints.

identity for /1.0/auth and /1.0/certificates endpoints.

image for /1.0/images endpoints.

cluster_member for /1.0/cluster endpoints.

project for /{version}/projects endpoints.

profile for /{version}/profiles endpoints.

warning for /{version}/warnings endpoints.

operation for /{version}/operations endpoints.

server for /{version}/events, /{version}/metrics, /{version} and /{version}/resources endpoints. This is also used as a default value in case a URL does not match any other type, like we would expect from requests using invalid URLs.

With this information, it would be possible to filter the values of the metrics for each resource type and better convey to the user information about the failures that could be happening on the LXD server.

Additionally, for the rest of this specification, a request is considered completed if either:

The request handler finished without spawning any asynchronous operation.
All the operations spawned while handling a request are done and the request handler finished.

The metrics that will be included are:

1. Total completed requests: A counter type metric named lxd_api_requests_completed_total with no associated labels. Represents the number of completed requests by a LXD server from the moment it is started. The value for this metric starts at 0 and is incremented by 1 at each completed request on the LXD API, independently of the status of the response.

This will be calculated using a variable in memory and thus would reset to 0 each time the server restarts. This reset could be handled by Prometheus, who would continue incrementing the value of this metric from the last known value prior to the restart. The decrease of this metric’s value due to an unexpected restart could also be detected by the observability tool and interpreted as a failure.

This would also include one additional label dimension named result. This label is related to the request response’s status codes and can have the following values:

client_error, for completed requests that responded with a 4* status code, signaling that this request resulted in a client error. The reasoning for this is to easily filter this out when setting thresholds for server errors or identifying failures on the client;
server_error, for any request that spawns an asynchronous operation that fails or requests that responded with a 5* status code, signaling that the handling of that request resulted in a server error. An operation is considered as failed if it assumes a 400 status code.
succeeded, for all remaining requests.

2. Ongoing requests: A gauge type metric named lxd_api_requests_ongoing with no additional labels aside from the previously mentioned entity_type. Represents the number of ongoing requests on a LXD server at a certain point in time. Will also be computed with a variable in memory, being incremented in one every time a request starts being handled and decremented by 1 each time a request is completed, independently of the status code of the response and the outcome of any spawned operations. This metric is useful to check for a higher than expected number of pending requests, possibly alerting for long response times or an overload on a particular LXD server.

Although there is currently a lxd_operations metric that indicates ongoing operations, not every request spawns an operation and some failures can happen before creating the operation related to that request, so it is important to also include a metric for monitoring ongoing requests.

API changes

To include the new metrics on the LXD API, a metrics_api_requests API extension will be introduced. This extension will include changes to GET /1.0/metrics, adding the new metrics on the beginning of the endpoint’s response in the following format:

# HELP lxd_api_requests_completed_total The total number completed requests.

# TYPE lxd_api_requests_completed_total counter

lxd_api_requests_completed_total{status="succeeded",resource_group="images"} NUMBER_OF_SUCCEEDED_REQS_ON_STORAGE

lxd_api_requests_completed_total{status="succeeded",resource_group="instances"} NUMBER_OF_SUCCEEDED_REQS_ON_INSTANCES

lxd_api_requests_completed_total{status="server_error",resource_group="storage"} NUMBER_OF_SUCCEEDED_REQS_ON_IMAGES

…

# HELP lxd_api_requests_ongoing The number of ongoing requests on LXD REST API.

# TYPE lxd_api_requests_ongoing gauge
lxd_api_requests_ongoing{resource_group="images"} NUMBER_OF_ONGOING_REQS_ON_IMAGES

...