Problem Description: We have a server that dumps data to a mounted CIFS share, then compresses it. After we upgraded the server from 22.04 to 24.04 we ran into memory issues. Over time the whole memory is used up and OOM killer is activated. The only way to free up RAM is reboot.
How to Reproduce
I found a way to easily reproduce the problem:
mount any CIFS share under /mnt/tstshr
cd /mnt/tstshr
create a small file: fallocate -l 1k d1.data
keep zipping the file in an endless loop: while true ; do zstd --force d1.data ; done
watch all memory disappear
Relevant System Information: VMWare VM, Ubuntu 24.04 LTS with Ubuntu Pro livepatch kernel
# uname -a
Linux xerius 6.8.0-54-generic #56-Ubuntu SMP PREEMPT_DYNAMIC Sat Feb 8 00:37:57 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
directio is not a recognized mount.cifs option. I get mount error(22): Invalid argument when I add it.
cache=none resolves the memory leak, but slows the CIFS mount down (mongodump took 1 hour, with cache disabled it takes 3.5 hours). I cannot use this workaround.
The default cache mode is cache=strict, that’s the one leaking memory. I am going to try cache=loose and see.
As far as reproducing the issue is concerned, I can report that we have also suffered from this memory leak, eventually invoking oom-killer, and requiring a restart to release the memory. Tools such as free or top were not showing any alarming available/used memory figures, however “available” memory calculated as free + buffers + caches was tracking the leak, and oom-killer was invoked when this figure was close to the total RAM.
We run on Azure, and the kernel 6.8.0-1021-azure must have some Azure-customizations, and hence our suspicions were directed that way. @ostolc’s original post alerted to us that the issue is probably not particular to the “azurized” kernel.
In our normal operation, the leak may take a week or longer to cause a crash, but writing a test program that contstantly writes (and then deletes) random-content files in the fileshare was leaking a couple GB per hour.
Using this test program, we were able to pinpoint two specs, one that lead to the leak, one that didn’t:
leak
no leak
dist
Ubuntu 24.04.1 LTS
Ubuntu 24.04.1 LTS
kernel
6.8.0-1021-azure
6.8.0-1007-azure
arch
x86_64
x86_64
cifs version
2.47
2.47
cifs srcversion
73D51B6C2121A09B28D7AEF
5278A8B7337E9F0C894C52C
cifs vermagic
6.8.0-1021-azure SMP mod_unload modversions
6.8.0-1007-azure SMP mod_unload modversions
The file system was mounted with the options nofail,vers=3.0,dir_mode=0777,file_mode=0777,serverino. We can also confirm that cache=none on the leaking system remedied the leak at the cost of performance.
I hope this post contributes to finding and handling the issue.
Second update on using cache=loose. “Slower” memory leak is just an illusion, OOM already activates at 45% used RAM, instead of 90% with cache=strict.
root@xerius :~# dmesg -T |grep "Out of memory"
[Mon Mar 10 06:28:35 2025] Out of memory: Killed process 2982889 (landscape-packa) total-vm:231340kB, anon-rss:9044kB, file-rss:13208kB, shmem-rss:0kB, UID:116 pgtables:500kB oom_score_adj:0
[Mon Mar 10 06:28:35 2025] Out of memory: Killed process 2966405 (wdavdaemon) total-vm:885004kB, anon-rss:572kB, file-rss:1664kB, shmem-rss:0kB, UID:997 pgtables:1436kB oom_score_adj:0
[Mon Mar 10 07:23:08 2025] Out of memory: Killed process 2983258 (wdavdaemon) total-vm:870028kB, anon-rss:340kB, file-rss:1792kB, shmem-rss:0kB, UID:997 pgtables:1400kB oom_score_adj:0
[Mon Mar 10 07:43:20 2025] Out of memory: Killed process 2996897 (wdavdaemon) total-vm:860448kB, anon-rss:1064kB, file-rss:4352kB, shmem-rss:0kB, UID:997 pgtables:1392kB oom_score_adj:0
[Mon Mar 10 07:44:35 2025] Out of memory: Killed process 3001195 (wdavdaemon) total-vm:824008kB, anon-rss:464kB, file-rss:1792kB, shmem-rss:0kB, UID:997 pgtables:1288kB oom_score_adj:0
[Mon Mar 10 08:20:45 2025] Out of memory: Killed process 3001555 (wdavdaemon) total-vm:852364kB, anon-rss:480kB, file-rss:2432kB, shmem-rss:0kB, UID:997 pgtables:1372kB oom_score_adj:0
[Mon Mar 10 08:24:32 2025] Out of memory: Killed process 3010467 (wdavdaemon) total-vm:835028kB, anon-rss:556kB, file-rss:2304kB, shmem-rss:0kB, UID:997 pgtables:1324kB oom_score_adj:0
I am going to look for a different workaround until this gets fixed…