Failed update validation for device

Hi! I look after a bunch of VMs running ubuntu 20.04 on s390x. These VMs host Travis CI workers. Each worker takes a build request off a queue and spins up an LXD container to run the build script in. I quite often see builds failed with messages like this.

message:time="2023-07-12T00:59:22Z" level=error msg="failed to update the container config" err="Failed update validation for device \"eth0\": IP address \"192.168.0.3\" already defined on another NIC" job_id=8513524 job_path=cn/cn-workspace/jobs/8513524 pid=836904 processor=******** repository=cn/cn-workspace self=backend/lxd_provider uuid=********

msgtype:failed to update the container config

The request then gets put back on a queue and is picked up by another worker, which may or may not hit the same problem, the request bounces among the workers until it hits one that can launch the container properly. It has been seen that the build will sometime work on a worker that had previously had the problem with this build.

Obviously, multiple requeues wastes time and resources and I’d like to avoid that if possible. So, what I’d like to know is what the problem actaully is, what might cause it and how to stop it happening, if that is possible.

I have no idea what is going on under the hood (this is something I inherited when a colleague left).

Thanks!

lxd version 5.15
uname -a
Linux worker-01 5.15.0-76-generic #83~20.04.1-Ubuntu SMP Wed Jun 21 20:23:49 UTC 2023 s390x s390x s390x GNU/Linux

It sounds like your CI is trying to set a static DHCP reservation for an instance NIC that is conflicting with another instance’s NIC reservation.

Is an instance a container?
From what I understand there should be only one container running on the VM at a time, spun up for the duration of the build and then deleted.

Sounds like one of them is hanging around (could be stopped, but still existing) by the time the next one is updated.

You are not wrong! I just ran
sudo lxc list -c n | grep travis -c
on the worker VM that had the most recent incidence of the error and it turns out that there are 483 containers, 9 of which are running. That is not what I was expecting. Something is clearly not right with the housekeeping. I’m going to do some more digging to see if there is a pattern among the containers that have been left running, but in the meantime, is there a quick way to clean up all the stopped ones?

I think lxc stop -f --all should work to stop the running ones.

But we don’t have a lxc delete --all because it would be very dangerous.

You can provide multiple instance names to the lxc delete command though.

1 Like