[Spec] improve error and warning visibility

Index US057
Title cloud-init improve error and warning visibility
Status Pending Review
Authors @holmanb
Type Standard
Created 2023-05-03

Abstract

This spec aims to provide a better and more unified user interface for introspecting cloud-init errors. This will enable consumers of cloud-init to replace heuristic parsing with a standardized machine-readable interface.

Rational

Reporting of success or failure of cloud-init is currently limited to a simple pass/fail boolean via cloud-init status. This leaves users and other tools blind to the various failure variations that cloud-init has. These modes include:

  • Use of deprecated schema keys
  • Use of deprecated features
  • External commands called by cloud-init that failed
  • Non-fatal tracebacks (which are often considered a bug)
  • Warnings
  • Errors
  • Critical failure

This information is currently logged, however log files are not in a machine-readable format, and the signal to noise ratio of logs makes for a poor experience understanding and responding to unexpected cloud-init behavior. This leaves users without the ability to interact with cloud-init’s various states of degradation with nuance, and tooling unable to reliably interact with or react to the different failure states. Furthermore, heuristic parsing of logs is easily broken if a user provides a non-default logging configuration or if the logging content/format changes.

Scope

1. Provide better introspection from cloud-init.

Current state: Can run and check logs if there was a traceback. By default, cloud-init doesn’t report degraded state via cli status command.

Future state: Provide rich error information via stable machine-readable command line interface.

2. Provide guidance and assistance to known consumers of cloud-init status information.

Communicate these changes in cloud-init and assist known cloud-init consumers with consumption of this interface: CPC build team, Snap, Juju, Subiquity, Maas, and others might be able to make more intelligent error handling with this information.

3. Document best practices for interacting with cloud-init’s new exported error statuses.

Expect to develop these best practices while interacting with cloud-init’s consumers.

Implementation

Cloud-init will collect and persist recoverable errors during system boot.

Command line status command will produce richer human-readable and machine-readable data, containing fatal and recoverable errors from the most recent boot.

Recoverable errors are defined as messages logged at or above a WARNING level.

Current state: recoverable errors vs non-recoverable errors

critical failure - If cloud-init is unable to complete, the service returns with exit code 1, and error messages are visible in the log files and in output of cloud-init status --format json under the top level 'error' key.

recoverable failure - In the case that cloud-init is able to complete yet something goes awry, the service returns with exit code 0 and messages are visible in the log files.

Future state: recoverable errors vs non-recoverable errors

critical failure - If cloud-init is unable to complete, error messages will now additionally be visible in output of cloud-init status --format json within the 'error' key nested under the module-level keys: 'init-local', 'init', 'modules-config', 'modules-final'.

recoverable failure - In the case that cloud-init is able to complete yet something goes awry, the service will now return with exit code 2, and error messages will be visible in the output of cloud-init status --format json under the top level 'recoverable_errors' key as well as within the 'error' key nested under the module-level keys: 'init-local', 'init', 'modules-config', 'modules-final'.

Current output

1. Current status

$ cloud-init status
status: done

2. Current verbose status

$ cloud-init status --long
status: done
boot_status_code: enabled-by-generator
last_update: Mon, 09 Oct 2023 20:51:46 +0000
detail:
DataSourceNoCloud [seed=/var/lib/cloud/seed/nocloud-net][dsmode=net]

3. Current Machine Readable status

Note that “_schema_version” and “schemas” keys will be eliminated in upstream cloud-init to avoid unnecessary verbosity of output. If a different meaning for duplicate keys is required, then a v2 can be added.

$ cloud-init status --format json
{
  "_schema_version": "1",
  "boot_status_code": "enabled-by-generator",
  "datasource": "nocloud",
  "detail": "DataSourceNoCloud [seed=/var/lib/cloud/seed/nocloud-net][dsmode=net]",
  "errors": [],
  "last_update": "Mon, 09 Oct 2023 20:51:46 +0000",
  "schemas": {
	"1": {
  	"boot_status_code": "enabled-by-generator",
  	"datasource": "nocloud",
  	"detail": "DataSourceNoCloud [seed=/var/lib/cloud/seed/nocloud-net][dsmode=net]",
  	"errors": [],
  	"last_update": "Mon, 09 Oct 2023 20:51:46 +0000",
  	"status": "done"
	}
  },
  "status": "done"
}

Proposed output

1. Proposed status

<unchanged>

2. Proposed verbose status

$ cloud-init status --long
status: done
extended_status: degraded done
boot_status_code: enabled-by-generator
last_update: Tue, 10 Oct 2023 18:16:42 +0000
detail:
DataSourceNoCloud [seed=/var/lib/cloud/seed/nocloud-net][dsmode=net]
recoverable_errors:
DEPRECATED:
    - Deprecated cloud-config provided: ca-certs:  Deprecated in version 22.3. Use ``ca_certs`` instead.
    - Key 'ca-certs' is deprecated in 22.1 and scheduled to be removed in 27.1. Use 'ca_certs' instead.

3. Proposed machine readable status

cloud-init status --format json
{
  "boot_status_code": "enabled-by-generator",
  "datasource": "nocloud",
  "detail": "DataSourceNoCloud [seed=/var/lib/cloud/seed/nocloud-net][dsmode=net]",
  "errors": [],
  "extended_status": "degraded done",
  "init": {
	"errors": [],
	"finished": 1698279442.4062886,
	"recoverable_errors": {
  	"WARNING": [
    	"Failed at merging in cloud config part from part-001: empty cloud config"
  	]
	},
	"start": 1698279441.3664
  },
  "init-local": {
	"errors": [],
	"finished": 1698279439.4033117,
	"recoverable_errors": {},
	"start": 1698279438.9879673
  },
  "last_update": "Thu, 26 Oct 2023 00:17:31 +0000",
  "modules-config": {
	"errors": [],
	"finished": 1698279450.2446203,
	"recoverable_errors": {
  	"WARNING": [
    	"No template found in /etc/cloud/templates for template named sources.list.ubuntu.deb822",
    	"No template found in /etc/cloud/templates for template named sources.list",
    	"No template found, not rendering /etc/apt/sources.list.d/ubuntu.sources"
  	]
	},
	"start": 1698279449.9259806
  },
  "modules-final": {
	"errors": [],
	"finished": 1698279451.0209844,
	"recoverable_errors": {},
	"start": 1698279450.8273187
  },
  "recoverable_errors": {
	"WARNING": [
  	"Failed at merging in cloud config part from part-001: empty cloud config",
  	"No template found in /etc/cloud/templates for template named sources.list.ubuntu.deb822",
  	"No template found in /etc/cloud/templates for template named sources.list",
  	"No template found, not rendering /etc/apt/sources.list.d/ubuntu.sources"
	]
  },
  "stage": null,
  "status": "done"
}

Therefore a user wanting to see which recoverable errors occurred can simply:

$ cloud-init status --format json | jq .recoverable_errors
{}

To see a recoverable error for a specific stage:

$ cloud-init status --format json | jq .init.recoverable_errors
{
  "WARNING": [
	"Failed at merging in cloud config part from part-001: empty cloud config"
  ]
}

To see the aggregate recoverable errors from all stages:

$ cloud-init status --format json | jq .recoverable_errors
{
  "WARNING": [
	"Failed at merging in cloud config part from part-001: empty cloud config",
	"No template found in /etc/cloud/templates for template named sources.list.ubuntu.deb822",
	"No template found in /etc/cloud/templates for template named sources.list",
	"No template found, not rendering /etc/apt/sources.list.d/ubuntu.sources"
  ]
}

Or to check for errors in a specific boot stage

cloud-init status --format json | jq .init.recoverable_errors
{
  "DEPRECATED": [
	"Deprecated cloud-config provided:\nca-certs:  Deprecated in version 22.3. Use ``ca_certs`` instead.",
	"Key 'ca-certs' is deprecated in 22.1 and scheduled to be removed in 27.1. Use 'ca_certs' instead."
  ]
}

Appendix A: States of cloud-init

Definitions of cloud-init extended_status can be found in cloudinit/cmd/status.py::UXAppStatus.
Consumers of cloud-init that want to make use of this output can expect to parse the following states:

"not running"
"running"
"done"
"error"
"degraded done"
"degraded running"
"disabled"

Appendix B: Classes of recoverable errors

All errors logged at a level of WARNING or higher (including cloud-init’s builtin DEPRECATED log level) will be exported via this interface. These recoverable errors are categorized by the level at which they are logged, and may be

WARNING
DEPRECATED
ERROR
CRITICAL