[Spec] Definition of an "online" system

Title Definition of network-online.target
Internal ID FO020
Status Pending Review
Author @slyon

Abstract

With this specification we’re trying to formalize the definition of “online” in the context of systemd’s network-online.target, so we can implement a common behavior across the Distro and decide about the online state of given services based on this definition.

Rationale

Certain applications require an “online” network connection, before they can be started. They achieve this by making their systemd service order After=network-online.target and pulling in the Wants=network-online.target dependency. This is generally discouraged, as any application should try to handle the (changing) network conditions dynamically, but is unavoidable in some cases.

Pulling network-online.target into the boot transaction can lead to a delayed boot sequence and to confusion due to the wide variety of definitions of an “online”/”up”/”up-and-running” network (Link-layer up, IPv4/6 assigned, global route available, DHCP responded, DNS resolved, LAN reachable, WAN/internet reachable, …).

In cloud images network-online.target is pulled in automatically, as cloud-config.service and cloud-final.service define a Wants=network-online.target dependency.

As an example, the nfs-server.service (src:nfs-utils) needs the network to be “online” and DNS set up for name resolution, in order to mount the user’s NFS share, defined by hostname (LP: #1918141).

Specification

Status quo

Currently, the network-online.target service depends on systemd-networkd-wait-online.service for networkd status and/or NetworkManager-wait-online.service for NetworkManager status respectively. Usage of networkd and NetworkManager at the same time is discouraged, as there are slight differences (like differing definitions of “online”) which can lead to confusion (e.g. LP: #19516).

In case of networkd, “/lib/systemd/systemd-networkd-wait-online” (using default parameters) defines the “network-online” logic, while “/usr/bin/nm-online -s” defines that logic in case NetworkManager is in use.

systemd-networkd-wait-online’s logic [systemd-networkd-wait-online (8)]:

  • By default, it will wait for all links it is aware of, which are managed by systemd-networkd and are configured as “RequiredForOnline=yes” (the default) to be fully configured or failed, and for at least one link to be online. Here, “online” means that the link’s operational state is equal to or higher than “degraded” (i.e. has a link-local IP). By default the loopback interface is ignored.
  • The default timeout is 120 seconds, once hit it logs an error “Failed to start Wait for Network to be Configured” and marks the “system-networkd-wait-online.service” as failed; booting continues (after the delay) without the networking being “online”
  • The operational status is one of the following:
    • Missing: the device is missing
    • Off: the device is powered down
    • No-carrier: the device is powered up, but it does not yet have a carrier
    • Dormant: the device has a carrier, but is not yet ready for normal traffic
    • Degraded-carrier: for bond or bridge master, one of the bonding or bridge slave network interfaces is in off, no-carrier, or dormant state
    • Carrier: the link has a carrier, or for bond or bridge master, all bonding or bridge slave network interfaces are enslaved to the master
    • Degraded: the link has carrier and addresses valid on the local link configured
    • Enslaved: the link has carrier and is enslaved to bond or bridge master network interface
    • Routable: the link has carrier and routable address configured
  • The setup status is one of the following:
    • Pending: udev is still processing the link, we don’t yet know if we will manage it
    • Failed: networkd failed to manage the link
    • Configuring: in the process of retrieving configuration or configuring the link
    • Configured: link configured successfully
    • Unmanaged: networkd is not handling the link
    • Linger: the link is gone, but has not yet been dropped by networkd

Nm-online’s logic [nm-online (1)], using the -s/--wait-for-startup parameter:

  • Startup is considered complete once NetworkManager has activated (or attempted to activate) every auto-activate connection (autoconnect=true) which is available given the current network state. By default, connections have the ipv4.may-fail and ipv6.may-fail properties set to yes; this means that NetworkManager waits for one of the two address families to complete configuration before considering the connection activated.
  • The default timeout is 30 seconds

Ubuntu’s definition of “online”

We consider a system to be “online” when ALL of the following conditions are met:

  • all non-optional interfaces MUST be up on link layer (optional: no in netplan sense)
    • including completion of ipv6 RA link-local and/or ipv4 link-local if enabled on the interface, except if those are explicitly marked as “optional” (see “MUST NOT” section below)
  • at least one interface MUST be up on the link layer and have received layer 3 (IP) configuration
    • incl. IP address of at least one address family and corresponding routes (ignoring IPv6/IPv4 link local addresses), c.f. systemd’s RequiredFamilyForOnline= configuration
  • there MUST be a default route for at least one configured address family
  • discovery of the default routes for all other configured address families MUST have been attempted (succeeded or failed) – Including routes provided via DHCP, IPv6 RA, OSPF, BGP, … (if available/enabled)
  • DNS MUST be configured

The status of “online” MUST NOT be delayed or blocked by the following:

  • link status or configuration of interfaces that are marked optional (optional: yes in netplan sense)
  • address sources that are defined as “optional” for an interface (optional-addresses in netplan sense, e.g. [ipv4-ll, dhcp6])
  • Configuration of a default route for an address family of which no interfaces have addresses defined

Reaching of network-online.target:

  • A “wait-online daemon” should be running in the background, checking for the definition of online, according to this specification.
    • This can be either a (modified) version of “systemd-networkd-wait-online”, “nm-online”, a new daemon listening to netlink like “netplan-wait-online”, or a combination of those
  • The wait-online service should exit with a success return code if the “online” state as described in this spec is reached, it shall keep running indefinitely otherwise, while the “online” state is not yet reached.
    • This blocks the starting of services pulling in network-online.target via a Wants= or Requires= dependency on purpose, in cases where networking is not available
    • This has the potential to block the whole boot process, if services pull in the network-online.target, sort After=network-online.service and Before=multi-user.target (or similar higher level target, even indirecly through other service dependencies or starting order) at the same time. Such services need to be identified and fixed, as the network being down should never delay or block the overall boot process. It should only block services that actually depend on the networking being “online”, while continuing to boot any other services and reaching the final target in parallel.

Q/A (from previous discussions)

  • (@slyon) What about WiFi (on Desktop)?

    • those interfaces SHOULD be marked as “optional: true” or not be defined at all (“renderer: NetworkManager” for all interfaces), so they can be ignored by the waiting logic (@slyon)
  • (@xnox) what if network connectivity drops & gets re-established? Should we bounce the network-online.target (aka restart it)? We can declare for units to be restarted, when network-online.target is restarted, if they otherwise themselves are incapable to dynamically detect networking loss & networking resumption.

    • Fix this in a “stage 2” attempt, focus on fixing newtork-online.target bringup for now (@vorlon)
  • (~any upstream) Do we really want to adopt Ubuntu’s new definition of “online”?

    • It doesn’t have to be adopted upstream for us to be making Ubuntu better for our users. The goal is to get this agreed with the systemd upstream community; but that should not be a blocker. (@vorlon)
  • We should not try to change systemd-networkd-wait-online’s definition of “online” but only extend the tool (upstream, if possible) using “STATE_TAGS”, allowing to define and reach Ubuntu’s state of online from an external definition (e.g. netplan.io YAML). (@slyon)

  • (@rbasak) Some packages can be configured differently. For example, “named” serving DNS authoritatively bound to particular interfaces might need to wait until the “network is up”. But for “named” configured as a local recursive resolver, it’s the opposite.

    • We can only handle a package’s default configuration case here. There will always be cases where if you configure one thing in one place, you’re expected/required to configure another thing in another place to match. (@rbasak)
  • (@vorlon) Does the Desktop Team need input on the changes to the definition of nm-online?

  • (@vorlon) if we are using NM which has support for detecting captive portals, is it a requirement that we have gotten through the captive portal?

    • IMO we should not block on captive portals, as there isn’t really any way to get through those during boot. (@slyon)

Further information

2 Likes

Thank you for working on this! It’d be great to have this better specified. We get bug reports on a regular basis, but it is never clear to what extent they require changes in packaging or just differences in how users should configure their systems.

I’ve been collecting these reports using the tag network-online-ordering. Currently there are 14. It might be worth going through these and considering those use cases against this spec to help validate that the spec provides a reasonable answer for all of these user stories. I’d also be interested to know if the packages are all correctly defining service units with respect to this spec. I might do this myself if I have the time because I don’t think I can have an opinion on the spec before considering these!

Question: in your answer to “What about WiFi (on Desktop)”, are you saying that on the default desktop (NetworkManager for all interfaces), network-online.target will fire before a laptop is really online at all (eg. when I’m somewhere remote without connectivity)? I think that might be the case already; I’d just like to make sure I understand, and suggest that if it’s the case then the spec should point it out explicitly because it’s not what a user would understand to mean “online”.

1 Like

On this point, it’s quite common for users to configure a specific bind address, find that the service then fails to start on boot, and then conclude that our packaging is wrong because we didn’t define Wants=network-online.target in packaging. The counterargument is that, based on the above quote, it’s necessary for the user to also add Wants=network-online.target as a local override in this case.

It’d be nice to define if packaging should configure network-listening daemons with Wants=network-online.target in the case that they work by default without that, but fail if the user configures a specific bind address. Or should all network-listening daemons ship with Wants=network-online.target regardless? Or never, unless something doesn’t work by default otherwise?

network-online.target should still be avoided as a dependency whenever not strictly necessary. A service that can be configured to bind to a specific address, but doesn’t by default, should not depend on this target as it will make the boot slower/less reliable for users in the default case.

1 Like

Thank you for pointing this out! I think checking the spec against those bug reports / user stories should be very helpful, in validating the spec. But please be aware, that the spec is not yet (fully) implemented in Ubuntu. Therefore, behavior on current real systems will be more as described in the “Status Quo” section than in the “definition of online” section.

Indeed, NetworkManager (i.e. nm-online -s) seems to be pretty lax with waiting for the online state (cf. nm-online(1)):

Wait for NetworkManager startup to complete, rather than waiting for network connectivity specifically. Startup is considered complete once NetworkManager has activated (or attempted to activate) every auto-activate connection which is available given the current network state.

So AFAIU, if you’re remote without wifi signal, NM will attempt to activate that connection profile, but fail and then mark “startup complete” and activate the network-online.target. According to the new “definition of online” this is NOT what we want. “network-online.target” should stay in a “pending” state, while booting of the multi-user.target or graphical.target should continue in parallel. (This has the potential to “block” the boot process if some services pull in the network-online.target , sort After=network-online.service and Before=multi-user.target at the same time, though. )

1 Like

This has implications if some packages were to ship Wants=network-online.target by default, which I believe is already the case. Such a package would never start its service on a laptop that is offline. This might break user expectations. Some users would expect a service to be available on localhost even on a laptop that is offline. For example, right now in Jammy, lighttpd, inspircd, openbsd-inetd, squid all declare this. I haven’t fully tested if this means that the service would be unavailable. dovecot does seem to be in this list but I noticed that it also declares a socket unit for example, so I’m not sure why dovecot is started by default at all. But it’s something to consider. lighttpd for example is perhaps the clearest in the above list in terms of services that users could reasonably expect to be available locally even when offline.

Maybe we will decide in Ubuntu that packaging that declares After/Wants=network-online.target is buggy in the general case. But then, if we decide that services should generally be available on localhost if that makes any sense at all, including on offline laptops, the problem will be maintaining that Ubuntu-specific delta against Debian packages that already do it, since we’d have made it an Ubuntu-specific problem.

I’m not trying to argue either way here. I have yet to form an opinion, and am just pointing out the implications of going in that direction.