Intermittent network hang in CLOSE-WAIT during cloud-init

When deploying large-ish models with juju using the lxd provider I sometimes (1/10 or so) get containers or VMs that are hanging in cloud-init when downloading packages

Typically I’d see some apt processes hanging around

root         588  0.0  0.0  18932 10368 ?        S    07:17   0:00 /usr/bin/apt-get --option=Dpkg::Options::=--force-confold --option=Dpkg::options::=--force-unsafe-io --assume-yes --quiet update
_apt         609  0.0  0.0  23164 10240 ?        S    07:17   0:00 /usr/lib/apt/methods/http
_apt         610  0.0  0.0  23168 10368 ?        S    07:17   0:00 /usr/lib/apt/methods/http
_apt         612  0.0  0.0  16088  6784 ?        S    07:17   0:00 /usr/lib/apt/methods/gpgv
_apt         731  0.0  0.0  24480  9856 ?        S    07:17   0:00 /usr/lib/apt/methods/store

And a connection in CLOSE-WAIT

ss -tnp
State                   Recv-Q               Send-Q                             Local Address:Port                              Peer Address:Port              Process              
CLOSE-WAIT              1                    0                                   10.35.33.133:46812                                10.0.0.22:3142               users:(("http",pid=609,fd=3))

From the POV of lxd the container is up and running – lxc info:

lxc info --show-log juju-9ed1bb-7
Name: juju-9ed1bb-7
Status: RUNNING
Type: container
Architecture: x86_64
PID: 1208709
Created: 2024/10/04 07:17 UTC
Last Used: 2024/10/04 07:17 UTC

Resources:
  Processes: 32
  CPU usage:
    CPU usage (in seconds): 14
  Memory usage:
    Memory (current): 118.45MiB
  Network usage:
    eth0:
      Type: broadcast
      State: UP
      Host interface: vethc3feacd3
      MAC address: 00:16:3e:7a:24:7a
      MTU: 1500
      Bytes received: 1.70MB
      Bytes sent: 28.85kB
      Packets received: 740
      Packets sent: 248
      IP addresses:
        inet:  10.35.33.133/24 (global)
        inet6: fd42:753e:cfcf:5e23:216:3eff:fe7a:247a/64 (global)
        inet6: fe80::216:3eff:fe7a:247a/64 (link)
    lo:
      Type: loopback
      State: UP
      MTU: 65536
      Bytes received: 2.18kB
      Bytes sent: 2.18kB
      Packets received: 20
      Packets sent: 20
      IP addresses:
        inet:  127.0.0.1/8 (local)
        inet6: ::1/128 (local)

Log:

lxc juju-9ed1bb-7 20241004071735.232 WARN     idmap_utils - ../src/src/lxc/idmap_utils.c:lxc_map_ids:165 - newuidmap binary is missing
lxc juju-9ed1bb-7 20241004071735.232 WARN     idmap_utils - ../src/src/lxc/idmap_utils.c:lxc_map_ids:171 - newgidmap binary is missing
lxc juju-9ed1bb-7 20241004071735.235 WARN     idmap_utils - ../src/src/lxc/idmap_utils.c:lxc_map_ids:165 - newuidmap binary is missing
lxc juju-9ed1bb-7 20241004071735.235 WARN     idmap_utils - ../src/src/lxc/idmap_utils.c:lxc_map_ids:171 - newgidmap binary is missing
lxc juju-9ed1bb-7 20241004073302.579 WARN     idmap_utils - ../src/src/lxc/idmap_utils.c:lxc_map_ids:165 - newuidmap binary is missing
lxc juju-9ed1bb-7 20241004073302.579 WARN     idmap_utils - ../src/src/lxc/idmap_utils.c:lxc_map_ids:171 - newgidmap binary is missing
lxc juju-9ed1bb-7 20241004074848.228 WARN     idmap_utils - ../src/src/lxc/idmap_utils.c:lxc_map_ids:165 - newuidmap binary is missing
lxc juju-9ed1bb-7 20241004074848.228 WARN     idmap_utils - ../src/src/lxc/idmap_utils.c:lxc_map_ids:171 - newgidmap binary is missing

Killing the associated apt process lets cloud-init continue as normal though.

This is on noble, kernel 6.8.0-44-generic

Anyone seen this before?

The port 3142 on a private IP caught my attention (10.0.0.22:3142). Could it be that you run a local apt-cacher(-ng) and that little daemon would collapse under load?

Good eye – indeed I’m running apt-cacher-ng there so that my line doesn’t get clogged (as much). I did not see it collapsing in the sense of crashing though, it was responsive when running apt-get against it from another machine

Back when I was using apt-cacher-ng, I would sometimes see similar behaviour when running updates on many instances in parallel. I’ve since moved to a NGINX doing on-demand caching and have not had that issue.

Interesting, might try that. I assume you configured nginx with proxy_cache and use proxy_pass to refer to an upstream archive server or something along those lines?

Yeah, pretty much. Here are a few snippets which I hope will be useful:

sites-enabled/ubuntu-mirrors:

server {
  listen 80;
  listen [::]:80;
  server_name archive.ubuntu.com;
  access_log /var/log/nginx/archive.access.log apt-mirror;
  error_log  /var/log/nginx/archive.error.log;

  # apt proxy config
  include conf.d/apt-proxy.inc;

  # by-hash and pool files are immutable by definition
  location ~ ^/(ubuntu/(?:dists/.*/by-hash|pool)/.*) {
    proxy_ignore_headers Cache-Control;
    proxy_cache_valid 200 30d;
    proxy_cache_revalidate off;
    include conf.d/archive-backend.inc;
  }

  location ~ ^/(.+) {
    include conf.d/archive-backend.inc;
  }
}

conf.d/apt-proxy.inc:

# cache
proxy_cache apt-cache;
proxy_cache_valid 200 301 302 5m;
proxy_cache_revalidate on;
proxy_cache_lock on;

# NGINX does not cache responses if proxy_buffering is set to off. It is on by default.
proxy_buffering on;

# indicates if it was a MISS, HIT, REVALIDATE, STALE, etc
add_header Cache-Status $upstream_cache_status;

# persistent connections
proxy_http_version 1.1;
proxy_set_header "Connection" "";

conf.d/archive-backend.inc:

set $upstream_host 'us.archive.ubuntu.com';
proxy_pass http://$upstream_host/$1;

I use .inc snippets as I reuse those include files else where but that’s obviously not required.

1 Like