LXD container instance lazy modules loading

Index LX070
Title LXD CT instance lazy kernel modules loading
Subteam LXD
Status Completed
Authors Aleksandr Mikhalitsyn
Stakeholders Thomas Parrott
Type Implementation
Created Feb 28, 2024

Abstract

Make it possible to use modprobe inside the container.

Fortunately, these days most of the software runs just fine inside the containers and can gracefully handle lack of permissions to load modules. But features like this are supposed to make adaptation to the containers easier (for example if there is a proprietary/closed-source app that tries to load module and fails).

Possible usage scenario:

lxc launch ubuntu:jammy myct
lxc exec myct apt install linux-generic-hwe-22.04
lxc exec myct modprobe -v overlay

As you can see, from the container user perspective everything seems to work like we are inside the VM or on the bare-metal Linux installation.

We don’t want to intercept a delete_module syscall as it’s usually not something that software in the container wants to do. We want to keep that module can be unloaded only by the root user on the host.

Rationale

We already have a CT instance option linux.kernel_modules. It can be used to specify a list of linux kernel modules to be loaded before container instance starts. Sometimes, it can be useful to be able to load kernel modules on-demand from inside the container. For example, because not all modules are required all the time but depend on the workload.

Of course, this does not mean that we want to allow user from inside the container to load module binaries provided by the container user. This is an obvious security hole. Instead, we only want to use container binaries to detect the module name and use it to load this module with using a trusted module binary from the host filesystem. We introduce a new option linux.kernel_modules.load=boot/ondemand. In addition to this, we also want to use a list linux.kernel_modules as a list of kernel modules those are allowed to be loaded (if linux.kernel_modules.load=ondemand). By default value will be boot (current behavior). In together it means that we are not lifting any existing restrictions, host system administrator still need to approach wisely to the instance configuration and clearly decide what is allowed and what is not.

Specification

Design

  1. We need to intercept init_module/finit_module syscalls using seccomp (we already have all the infrastructure in place in LXD)

  2. Permission checks

  • Check that lazy loading is enabled for the container instance
  • Check capability of the user inside the container (must have CAP_SYS_MODULE inside).
  1. We need to deal with the syscall parameters and get an access to the module ELF contents.

  2. Parse ELF and extract .modinfo section from it.

Example:

readelf -p .modinfo /lib/modules/6.5.0-21-generic/kernel/net/netfilter/nft_log.ko
readelf: Warning: Separate debug info file /usr/lib/modules/6.5.0-21-generic/kernel/net/netfilter/nft_log.ko found, but CRC does not match - ignoring

String dump of section '.modinfo':
  [     0]  description=Netfilter nf_tables log module
  [    2b]  alias=nft-expr-log
  [    3e]  author=Patrick McHardy <kaber@trash.net>
  [    67]  license=GPL
  [    73]  srcversion=5273BF1A794F4B0CDDEE430
  [    96]  depends=nf_tables
  [    a8]  retpoline=Y
  [    b4]  intree=Y
  [    bd]  name=nft_log
  [    ca]  vermagic=6.5.0-21-generic SMP preempt mod_unload modversions

Then we can do something like:

readelf -p .modinfo /lib/modules/6.5.0-21-generic/kernel/net/netfilter/nft_log.ko | grep -o -P "(?<=name\=)(.*)"

to extract module name. Of course, this is just for the demonstration sake. All this ELF parsing can be done with help from https://pkg.go.dev/debug/elf

  1. Check that module with the extracted name is in the allowlist.

  2. Do modprobe <mod name>.

That’s it.

The only important thing here to add is that golang debug/elf package is considered as not fully safe thing. It’s clearly said:
https://pkg.go.dev/debug/elf#hdr-Security

but this is not a big deal, because this piece of code will be running as a separate helper process and we can protect ourselves from possible attacks here by dropping capabilities for this piece that parses a user-supplied (untrusted) module ELF file.

We also need to ensure that if the user from inside the container makes debug/elf library and interception code unhappy it won’t affect init_module/finit_module seccomp processing for other instances. It’s important because we have a single interception processing server in LXD daemon (see seccomp.NewSeccompServer) and we need to ensure that malicious user can’t prevent the whole seccomp processing server from working.

API changes

Introduce an API extension to signal that the new instance config option linux.kernel_modules.load is supported by the LXD daemon (with two possible values). Default value will be boot (which means old behavior). If value is ondemand then the existing linux.kernel_modules instance config option will be treated as a allowlist for a modules.

Types

TBD

Routes

No new routes will be introduced.

CLI changes

TBD

Database changes

No database changes.

1 Like

I’d suggest opening with this part - the reason for doing it.

1 Like

whitelist is on the list of offensive terms, suggest allow list instead.

1 Like

I wonder if instead of an allow list, we re-use the linux.kernel_modules setting for the list of modules allowed, and introduce a new setting linux.kernel_modules.load=dynamic (defaulting to boot) which would indicate when/how the modules in linux.kernel_modules are loaded?

What do you think? Do you see a benefit to being able to load some modules at start time and others on demand?

1 Like

That’s an interesting point.

Sometimes it can be useful, but I would go for simplicity. I like your idea with linux.kernel_modules.load=dynamic thing.

1 Like

As we are introducing a new config setting that will require an API extension for feature detection.

Does this part need updating to reflect that linux.kernel_modules will become the allow list when linux.kernel_modules.load=dynamic? And that by default linux.kernel_modules.load will use the value boot to mean the existing behaviour of them being loaded at boot occurs?

Please could you expand on this bit a little, what does “it won’t affect init_module/finit_module seccomp processing for other instances.” mean?

Should we clarify also that we are not intending to implement module removal interception (for the reasons we may unload something already needed on the host)?

1 Like

yes, I have added a clarification.

we have a single interception processing server in LXD daemon (see seccomp.NewSeccompServer) and we need to ensure that malicious user can’t prevent the whole seccomp processing server from working.

1 Like

This bit needs tweaking to not mention the previously proposed new api extension config key.

sure. Fixed! Thanks!

1 Like

Thanks, looks good to me!

Implementation PR https://github.com/canonical/lxd/pull/13151

I decided not to go with init_module() interception, but only implement the finit_module interception. As this syscall considered as a modern way race-free way to load modules and also it’s present for more than 8 years. It just makes no sense to implement both of them.

1 Like