Hi! I look after a bunch of VMs running ubuntu 20.04 on s390x. These VMs host Travis CI workers. Each worker takes a build request off a queue and spins up an LXD container to run the build script in. I quite often see builds failed with messages like this.
message:time="2023-07-12T00:59:22Z" level=error msg="failed to update the container config" err="Failed update validation for device \"eth0\": IP address \"192.168.0.3\" already defined on another NIC" job_id=8513524 job_path=cn/cn-workspace/jobs/8513524 pid=836904 processor=******** repository=cn/cn-workspace self=backend/lxd_provider uuid=********
msgtype:failed to update the container config
The request then gets put back on a queue and is picked up by another worker, which may or may not hit the same problem, the request bounces among the workers until it hits one that can launch the container properly. It has been seen that the build will sometime work on a worker that had previously had the problem with this build.
Obviously, multiple requeues wastes time and resources and I’d like to avoid that if possible. So, what I’d like to know is what the problem actaully is, what might cause it and how to stop it happening, if that is possible.
I have no idea what is going on under the hood (this is something I inherited when a colleague left).
Thanks!
lxd version 5.15
uname -a
Linux worker-01 5.15.0-76-generic #83~20.04.1-Ubuntu SMP Wed Jun 21 20:23:49 UTC 2023 s390x s390x s390x GNU/Linux