Index | US057 |
---|---|
Title | cloud-init improve error and warning visibility |
Status | Pending Review |
Authors | @holmanb |
Type | Standard |
Created | 2023-05-03 |
Abstract
This spec aims to provide a better and more unified user interface for introspecting cloud-init errors. This will enable consumers of cloud-init to replace heuristic parsing with a standardized machine-readable interface.
Rational
Reporting of success or failure of cloud-init is currently limited to a simple pass/fail boolean via cloud-init status
. This leaves users and other tools blind to the various failure variations that cloud-init has. These modes include:
- Use of deprecated schema keys
- Use of deprecated features
- External commands called by cloud-init that failed
- Non-fatal tracebacks (which are often considered a bug)
- Warnings
- Errors
- Critical failure
This information is currently logged, however log files are not in a machine-readable format, and the signal to noise ratio of logs makes for a poor experience understanding and responding to unexpected cloud-init behavior. This leaves users without the ability to interact with cloud-init’s various states of degradation with nuance, and tooling unable to reliably interact with or react to the different failure states. Furthermore, heuristic parsing of logs is easily broken if a user provides a non-default logging configuration or if the logging content/format changes.
Scope
1. Provide better introspection from cloud-init.
Current state: Can run and check logs if there was a traceback. By default, cloud-init doesn’t report degraded state via cli status command.
Future state: Provide rich error information via stable machine-readable command line interface.
2. Provide guidance and assistance to known consumers of cloud-init status information.
Communicate these changes in cloud-init and assist known cloud-init consumers with consumption of this interface: CPC build team, Snap, Juju, Subiquity, Maas, and others might be able to make more intelligent error handling with this information.
3. Document best practices for interacting with cloud-init’s new exported error statuses.
Expect to develop these best practices while interacting with cloud-init’s consumers.
Implementation
Cloud-init will collect and persist recoverable errors during system boot.
Command line status command will produce richer human-readable and machine-readable data, containing fatal and recoverable errors from the most recent boot.
Recoverable errors are defined as messages logged at or above a WARNING level.
Current state: recoverable errors vs non-recoverable errors
critical failure - If cloud-init is unable to complete, the service returns with exit code 1, and error messages are visible in the log files and in output of cloud-init status --format json
under the top level 'error'
key.
recoverable failure - In the case that cloud-init is able to complete yet something goes awry, the service returns with exit code 0 and messages are visible in the log files.
Future state: recoverable errors vs non-recoverable errors
critical failure - If cloud-init is unable to complete, error messages will now additionally be visible in output of cloud-init status --format json
within the 'error'
key nested under the module-level keys: 'init-local'
, 'init'
, 'modules-config'
, 'modules-final'
.
recoverable failure - In the case that cloud-init is able to complete yet something goes awry, the service will now return with exit code 2, and error messages will be visible in the output of cloud-init status --format json
under the top level 'recoverable_errors'
key as well as within the 'error'
key nested under the module-level keys: 'init-local'
, 'init'
, 'modules-config'
, 'modules-final'
.
Current output
1. Current status
$ cloud-init status
status: done
2. Current verbose status
$ cloud-init status --long
status: done
boot_status_code: enabled-by-generator
last_update: Mon, 09 Oct 2023 20:51:46 +0000
detail:
DataSourceNoCloud [seed=/var/lib/cloud/seed/nocloud-net][dsmode=net]
3. Current Machine Readable status
Note that “_schema_version” and “schemas” keys will be eliminated in upstream cloud-init to avoid unnecessary verbosity of output. If a different meaning for duplicate keys is required, then a v2 can be added.
$ cloud-init status --format json
{
"_schema_version": "1",
"boot_status_code": "enabled-by-generator",
"datasource": "nocloud",
"detail": "DataSourceNoCloud [seed=/var/lib/cloud/seed/nocloud-net][dsmode=net]",
"errors": [],
"last_update": "Mon, 09 Oct 2023 20:51:46 +0000",
"schemas": {
"1": {
"boot_status_code": "enabled-by-generator",
"datasource": "nocloud",
"detail": "DataSourceNoCloud [seed=/var/lib/cloud/seed/nocloud-net][dsmode=net]",
"errors": [],
"last_update": "Mon, 09 Oct 2023 20:51:46 +0000",
"status": "done"
}
},
"status": "done"
}
Proposed output
1. Proposed status
<unchanged>
2. Proposed verbose status
$ cloud-init status --long
status: done
extended_status: degraded done
boot_status_code: enabled-by-generator
last_update: Tue, 10 Oct 2023 18:16:42 +0000
detail:
DataSourceNoCloud [seed=/var/lib/cloud/seed/nocloud-net][dsmode=net]
recoverable_errors:
DEPRECATED:
- Deprecated cloud-config provided: ca-certs: Deprecated in version 22.3. Use ``ca_certs`` instead.
- Key 'ca-certs' is deprecated in 22.1 and scheduled to be removed in 27.1. Use 'ca_certs' instead.
3. Proposed machine readable status
cloud-init status --format json
{
"boot_status_code": "enabled-by-generator",
"datasource": "nocloud",
"detail": "DataSourceNoCloud [seed=/var/lib/cloud/seed/nocloud-net][dsmode=net]",
"errors": [],
"extended_status": "degraded done",
"init": {
"errors": [],
"finished": 1698279442.4062886,
"recoverable_errors": {
"WARNING": [
"Failed at merging in cloud config part from part-001: empty cloud config"
]
},
"start": 1698279441.3664
},
"init-local": {
"errors": [],
"finished": 1698279439.4033117,
"recoverable_errors": {},
"start": 1698279438.9879673
},
"last_update": "Thu, 26 Oct 2023 00:17:31 +0000",
"modules-config": {
"errors": [],
"finished": 1698279450.2446203,
"recoverable_errors": {
"WARNING": [
"No template found in /etc/cloud/templates for template named sources.list.ubuntu.deb822",
"No template found in /etc/cloud/templates for template named sources.list",
"No template found, not rendering /etc/apt/sources.list.d/ubuntu.sources"
]
},
"start": 1698279449.9259806
},
"modules-final": {
"errors": [],
"finished": 1698279451.0209844,
"recoverable_errors": {},
"start": 1698279450.8273187
},
"recoverable_errors": {
"WARNING": [
"Failed at merging in cloud config part from part-001: empty cloud config",
"No template found in /etc/cloud/templates for template named sources.list.ubuntu.deb822",
"No template found in /etc/cloud/templates for template named sources.list",
"No template found, not rendering /etc/apt/sources.list.d/ubuntu.sources"
]
},
"stage": null,
"status": "done"
}
Therefore a user wanting to see which recoverable errors occurred can simply:
$ cloud-init status --format json | jq .recoverable_errors
{}
To see a recoverable error for a specific stage:
$ cloud-init status --format json | jq .init.recoverable_errors
{
"WARNING": [
"Failed at merging in cloud config part from part-001: empty cloud config"
]
}
To see the aggregate recoverable errors from all stages:
$ cloud-init status --format json | jq .recoverable_errors
{
"WARNING": [
"Failed at merging in cloud config part from part-001: empty cloud config",
"No template found in /etc/cloud/templates for template named sources.list.ubuntu.deb822",
"No template found in /etc/cloud/templates for template named sources.list",
"No template found, not rendering /etc/apt/sources.list.d/ubuntu.sources"
]
}
Or to check for errors in a specific boot stage
cloud-init status --format json | jq .init.recoverable_errors
{
"DEPRECATED": [
"Deprecated cloud-config provided:\nca-certs: Deprecated in version 22.3. Use ``ca_certs`` instead.",
"Key 'ca-certs' is deprecated in 22.1 and scheduled to be removed in 27.1. Use 'ca_certs' instead."
]
}
Appendix A: States of cloud-init
Definitions of cloud-init extended_status can be found in cloudinit/cmd/status.py::UXAppStatus.
Consumers of cloud-init that want to make use of this output can expect to parse the following states:
"not running"
"running"
"done"
"error"
"degraded done"
"degraded running"
"disabled"
Appendix B: Classes of recoverable errors
All errors logged at a level of WARNING or higher (including cloud-init’s builtin DEPRECATED log level) will be exported via this interface. These recoverable errors are categorized by the level at which they are logged, and may be
WARNING
DEPRECATED
ERROR
CRITICAL