Index | LX070 |
---|---|
Title | LXD CT instance lazy kernel modules loading |
Subteam | LXD |
Status | Completed |
Authors | Aleksandr Mikhalitsyn |
Stakeholders | Thomas Parrott |
Type | Implementation |
Created | Feb 28, 2024 |
Abstract
Make it possible to use modprobe
inside the container.
Fortunately, these days most of the software runs just fine inside the containers and can gracefully handle lack of permissions to load modules. But features like this are supposed to make adaptation to the containers easier (for example if there is a proprietary/closed-source app that tries to load module and fails).
Possible usage scenario:
lxc launch ubuntu:jammy myct
lxc exec myct apt install linux-generic-hwe-22.04
lxc exec myct modprobe -v overlay
As you can see, from the container user perspective everything seems to work like we are inside the VM or on the bare-metal Linux installation.
We don’t want to intercept a delete_module
syscall as it’s usually not something that software in the container wants to do. We want to keep that module can be unloaded only by the root user on the host.
Rationale
We already have a CT instance option linux.kernel_modules
. It can be used to specify a list of linux kernel modules to be loaded before container instance starts. Sometimes, it can be useful to be able to load kernel modules on-demand from inside the container. For example, because not all modules are required all the time but depend on the workload.
Of course, this does not mean that we want to allow user from inside the container to load module binaries provided by the container user. This is an obvious security hole. Instead, we only want to use container binaries to detect the module name and use it to load this module with using a trusted module binary from the host filesystem. We introduce a new option linux.kernel_modules.load=boot/ondemand
. In addition to this, we also want to use a list linux.kernel_modules
as a list of kernel modules those are allowed to be loaded (if linux.kernel_modules.load=ondemand
). By default value will be boot
(current behavior). In together it means that we are not lifting any existing restrictions, host system administrator still need to approach wisely to the instance configuration and clearly decide what is allowed and what is not.
Specification
Design
-
We need to intercept
init_module
/finit_module
syscalls using seccomp (we already have all the infrastructure in place in LXD) -
Permission checks
- Check that lazy loading is enabled for the container instance
- Check capability of the user inside the container (must have
CAP_SYS_MODULE
inside).
-
We need to deal with the syscall parameters and get an access to the module ELF contents.
-
Parse ELF and extract
.modinfo
section from it.
Example:
readelf -p .modinfo /lib/modules/6.5.0-21-generic/kernel/net/netfilter/nft_log.ko
readelf: Warning: Separate debug info file /usr/lib/modules/6.5.0-21-generic/kernel/net/netfilter/nft_log.ko found, but CRC does not match - ignoring
String dump of section '.modinfo':
[ 0] description=Netfilter nf_tables log module
[ 2b] alias=nft-expr-log
[ 3e] author=Patrick McHardy <kaber@trash.net>
[ 67] license=GPL
[ 73] srcversion=5273BF1A794F4B0CDDEE430
[ 96] depends=nf_tables
[ a8] retpoline=Y
[ b4] intree=Y
[ bd] name=nft_log
[ ca] vermagic=6.5.0-21-generic SMP preempt mod_unload modversions
Then we can do something like:
readelf -p .modinfo /lib/modules/6.5.0-21-generic/kernel/net/netfilter/nft_log.ko | grep -o -P "(?<=name\=)(.*)"
to extract module name. Of course, this is just for the demonstration sake. All this ELF parsing can be done with help from https://pkg.go.dev/debug/elf
-
Check that module with the extracted name is in the allowlist.
-
Do
modprobe <mod name>
.
That’s it.
The only important thing here to add is that golang debug/elf
package is considered as not fully safe thing. It’s clearly said:
https://pkg.go.dev/debug/elf#hdr-Security
but this is not a big deal, because this piece of code will be running as a separate helper process and we can protect ourselves from possible attacks here by dropping capabilities for this piece that parses a user-supplied (untrusted) module ELF file.
We also need to ensure that if the user from inside the container makes debug/elf
library and interception code unhappy it won’t affect init_module
/finit_module
seccomp processing for other instances. It’s important because we have a single interception processing server in LXD daemon (see seccomp.NewSeccompServer
) and we need to ensure that malicious user can’t prevent the whole seccomp processing server from working.
API changes
Introduce an API extension to signal that the new instance config option linux.kernel_modules.load
is supported by the LXD daemon (with two possible values). Default value will be boot
(which means old behavior). If value is ondemand
then the existing linux.kernel_modules
instance config option will be treated as a allowlist for a modules.
Types
TBD
Routes
No new routes will be introduced.
CLI changes
TBD
Database changes
No database changes.