Ubuntu 24.04 - File operations on top of a CIFS mount leak memory

ostolc · March 4, 2025, 2:00pm

Ubuntu Version: Ubuntu Pro 24.04 Server

Problem Description: We have a server that dumps data to a mounted CIFS share, then compresses it. After we upgraded the server from 22.04 to 24.04 we ran into memory issues. Over time the whole memory is used up and OOM killer is activated. The only way to free up RAM is reboot.

How to Reproduce

I found a way to easily reproduce the problem:

mount any CIFS share under /mnt/tstshr
cd /mnt/tstshr
create a small file: fallocate -l 1k d1.data
keep zipping the file in an endless loop: while true ; do zstd --force d1.data ; done
watch all memory disappear

Screenshot_2025-03-04_12-52-40

Relevant System Information: VMWare VM, Ubuntu 24.04 LTS with Ubuntu Pro livepatch kernel

# uname -a
Linux xerius 6.8.0-54-generic #56-Ubuntu SMP PREEMPT_DYNAMIC Sat Feb  8 00:37:57 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

CIFS share mounted like this

//10.10.10.10/tstshr /mnt/tstshr cifs noauto,credentials=/root/.smbcredentials,file_mode=0666,dir_mode=0777 0 0

Is this a known bug? Can anyone else reproduce it?

Thanks,
Oskar

ogra · March 4, 2025, 2:10pm

Have you considered mounting with something like directio or cache=none ?

ostolc · March 5, 2025, 3:38pm

directio is not a recognized mount.cifs option. I get mount error(22): Invalid argument when I add it.

cache=none resolves the memory leak, but slows the CIFS mount down (mongodump took 1 hour, with cache disabled it takes 3.5 hours). I cannot use this workaround.

The default cache mode is cache=strict, that’s the one leaking memory. I am going to try cache=loose and see.

cemyolcu · March 7, 2025, 10:51am

Hello!

As far as reproducing the issue is concerned, I can report that we have also suffered from this memory leak, eventually invoking oom-killer, and requiring a restart to release the memory. Tools such as free or top were not showing any alarming available/used memory figures, however “available” memory calculated as free + buffers + caches was tracking the leak, and oom-killer was invoked when this figure was close to the total RAM.

We run on Azure, and the kernel 6.8.0-1021-azure must have some Azure-customizations, and hence our suspicions were directed that way. @ostolc’s original post alerted to us that the issue is probably not particular to the “azurized” kernel.

In our normal operation, the leak may take a week or longer to cause a crash, but writing a test program that contstantly writes (and then deletes) random-content files in the fileshare was leaking a couple GB per hour.

Using this test program, we were able to pinpoint two specs, one that lead to the leak, one that didn’t:

	leak	no leak
dist	Ubuntu 24.04.1 LTS	Ubuntu 24.04.1 LTS
kernel	6.8.0-1021-azure	6.8.0-1007-azure
arch	x86_64	x86_64
cifs version	2.47	2.47
cifs srcversion	73D51B6C2121A09B28D7AEF	5278A8B7337E9F0C894C52C
cifs vermagic	6.8.0-1021-azure SMP mod_unload modversions	6.8.0-1007-azure SMP mod_unload modversions

The file system was mounted with the options nofail,vers=3.0,dir_mode=0777,file_mode=0777,serverino. We can also confirm that cache=none on the leaking system remedied the leak at the cost of performance.

I hope this post contributes to finding and handling the issue.

ostolc · March 10, 2025, 8:08am

I have an update on using cache=loose on production server, it also leaks memory, but “slower”. See this 3 weeks worth of RAM usage graphs

BTW, I keep rebooting the server every Monday to free up memory to prevent OOM activity…

ostolc · March 10, 2025, 8:37am

Second update on using cache=loose. “Slower” memory leak is just an illusion, OOM already activates at 45% used RAM, instead of 90% with cache=strict.

root@xerius :~# dmesg -T |grep "Out of memory"
[Mon Mar 10 06:28:35 2025] Out of memory: Killed process 2982889 (landscape-packa) total-vm:231340kB, anon-rss:9044kB, file-rss:13208kB, shmem-rss:0kB, UID:116 pgtables:500kB oom_score_adj:0
[Mon Mar 10 06:28:35 2025] Out of memory: Killed process 2966405 (wdavdaemon) total-vm:885004kB, anon-rss:572kB, file-rss:1664kB, shmem-rss:0kB, UID:997 pgtables:1436kB oom_score_adj:0
[Mon Mar 10 07:23:08 2025] Out of memory: Killed process 2983258 (wdavdaemon) total-vm:870028kB, anon-rss:340kB, file-rss:1792kB, shmem-rss:0kB, UID:997 pgtables:1400kB oom_score_adj:0
[Mon Mar 10 07:43:20 2025] Out of memory: Killed process 2996897 (wdavdaemon) total-vm:860448kB, anon-rss:1064kB, file-rss:4352kB, shmem-rss:0kB, UID:997 pgtables:1392kB oom_score_adj:0
[Mon Mar 10 07:44:35 2025] Out of memory: Killed process 3001195 (wdavdaemon) total-vm:824008kB, anon-rss:464kB, file-rss:1792kB, shmem-rss:0kB, UID:997 pgtables:1288kB oom_score_adj:0
[Mon Mar 10 08:20:45 2025] Out of memory: Killed process 3001555 (wdavdaemon) total-vm:852364kB, anon-rss:480kB, file-rss:2432kB, shmem-rss:0kB, UID:997 pgtables:1372kB oom_score_adj:0
[Mon Mar 10 08:24:32 2025] Out of memory: Killed process 3010467 (wdavdaemon) total-vm:835028kB, anon-rss:556kB, file-rss:2304kB, shmem-rss:0kB, UID:997 pgtables:1324kB oom_score_adj:0

I am going to look for a different workaround until this gets fixed…

ostolc · March 11, 2025, 12:45pm

@cemyolcu inspired me to test older kernels, here are my results:

6.8.0-31-generic - OK
6.8.0-32-generic - leaks
6.8.0-35-generic - leaks
6.8.0-38-generic - leaks
6.8.0-54-generic - leaks

As a workaround, I am going to downgrade to 6.8.0-31 until a fixed kernel is released

Linux xerius 6.8.0-31-generic #31-Ubuntu SMP PREEMPT_DYNAMIC Sat Apr 20 00:40:06 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

tscloud · March 12, 2025, 10:55am

We face the same issue on:

uname -a
Linux mon1-xxx 6.8.0-1021-azure #25-Ubuntu SMP Wed Jan 15 20:45:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

The MemAvailable metric does not show the issue.

We also reproduce the issue with the provided script:

fallocate -l 1k d1.data
while true ; do zstd --force d1.data ; done

You can watch this delta growing:

awk '/MemAvailable:|MemFree:|Buffers:|Cached:/ {a[$1]=$2} END {print a["MemAvailable:"] - (a["MemFree:"] + a["Buffers:"] + a["Cached:"])}' /proc/meminfo

Our workload is backup storage once a day.
It looks like:

zfs send -c data/mysql@2025-03-10  > /backup/mysql/db_backup_2025-03-10 .snap

But every day, we will have a big “leak”.

When “all memory is consumed”, computed as

MemTotal - (MemFree + Buffers + Cached)

We do not face OOM, but rather a 100% CPU usage from programs (mysql in this case).
The only solution found is reboot to free the ghost memory.

system · April 3, 2025, 2:01pm

This topic was automatically closed after 30 days. New replies are no longer allowed.