Problem Description:
In trying to debug an occasional kernel panic, ran stress-ng to try and reproduce issue, it did not for several hours, then started running LLM (forgot to shutdown stress-ng) - and since machine was āoverloadedā it panickedā¦
The question is this EXPECTED? or should it just run slower or give some other error? why panic when user-mode jobs are running ? shouldnāt kernel act in similar fashion as OOM, by killing large consumers (of CPU) as it does in memory case?
Screenshots or Error Messages:
Sections of the kernel panic are:
[19741.362765] [ T580292] Oops: Oops: 0000 [#305841] PREEMPT SMP NOPTI
[19741.362767] [ T580292] BUG: kernel NULL pointer dereference, address: 0000000000000000
[19741.362768] [ T580292] #PF: supervisor read access in kernel mode
[19741.362769] [ T580292] #PF: error_code(0x0000) - not-present page
[19741.362769] [ T580292] PGD 0 P4D 0
[19747.214991] [ C4] watchdog: CPU4: Watchdog detected hard LOCKUP on cpu 4
[19756.155979] [ C6] watchdog: CPU6: Watchdog detected hard LOCKUP on cpu 6
[19756.296665] [ C9] watchdog: CPU9: Watchdog detected hard LOCKUP on cpu 9
[19756.678041] [ C10] watchdog: CPU10: Watchdog detected hard LOCKUP on cpu 10
[19757.901609] [ C7] watchdog: CPU7: Watchdog detected hard LOCKUP on cpu 7
[19758.208291] [ C0] watchdog: CPU0: Watchdog detected hard LOCKUP on cpu 0
[19759.876555] [ C5] watchdog: CPU5: Watchdog detected hard LOCKUP on cpu 5
[19777.176677] [ C2] watchdog: CPU2: Watchdog detected hard LOCKUP on cpu 2
The real issue is that system crashes when LLM is running, after few hours sometimes sooner, no overheating is noticed⦠managed to capture that dmesg only once so far: https://filebin.net/custom-bin/VmCore-dmesg1 due to crash not able to read kernel dumps cannot get more info from actual crashā¦
Ran memtest for 8 passes in console mode, all came out fine.
Looking at that memory utilization, given you run a graphics card intensive task, did you consider giving the card more video ram for its LLM processing ? (AFAIK the Radeon 780M can handle up to 32GB shared memory)
Iād check the VRAM and GTT values and if they are high enough for such a demanding task ā¦
If I got it right you are not using a standard kernel. I donāt know if your LLM will run with standard kernel and how performance will be affected - but did you try with standard kernel?
Sorry, was not aware need to set VRAM in BIOS? will try that, what about amdgpu package, is that needed to install?
I see the kernel module loaded OK:
# modinfo amdgpu
filename: /lib/modules/6.14.0-29-generic/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko.zst
license: GPL and additional rights
description: AMD GPU
author: AMD linux driver team
firmware: amdgpu/navi12_gpu_info.bin
which is part of kernel, but probably need to install/pull this one too?
amdgpu/noble 1:6.4.60403-2194681.24.04 amd64
# apt install amdgpu
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
amdgpu-core amdgpu-dkms amdgpu-dkms-firmware amdgpu-lib amdgpu-multimedia autoconf automake
<snap>
0 upgraded, 29 newly installed, 0 to remove and 5 not upgraded.
Need to get 62.7 MB of archives.
After this operation, 772 MB of additional disk space will be used.
Do you want to continue? [Y/n]
though even at full load, never saw the vram more than 6% used as shown above.
No, you do not need to apt install amdgpu for basic functionality because the amdgpu driver is built into the Linux kernel and is usually installed by default on Ubuntu-based systems. You would only use the amdgpu-install command to install the ROCm software stack or to install a proprietary driver, not for the standard open-source graphics driver. To ensure you have the latest open-source drivers, simply keep your system updated with sudo apt update && sudo apt upgrade
which makes sense, rebooted and set UMA Frame buffer to max 16GB, on top it shows same amount used:
Friends,
Running this, shows it is using CPU only:
# ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
llama3.1:70b 711a9e8463af 44 GB 100% CPU 4096 2 minutes from now...
I think 780M gpu is not fully supported by ollama nor rocm.
From āofficial docsā installing rocm requires 21GB!!! of space! wow!! thatās an OVERKILL!
found this ollama issue:
which has some pointers, i did not run apt install rocm (for 21GB), instead pulled the amdgpu-install v7 from that url (not ubuntu repos which have v6 only).
then ran that script (amdgpu-install) it pulled about 1GB of stuff, set the ollama override env variable, and restart the service, now it uses GPU and CPU:
# ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
llama3.1:70b 711a9e8463af 45 GB 63%/37% CPU/GPU 4096 3 minutes from now
Responses are MUCH MUCH MUCH faster!! ok, will give it few hours/days of testing and report backā¦
Please note that we can not support third party software or drivers here, specifically for the proprietary amdgpu driver and ROCm you should be able to get support at:
been crashing (kernel panic , freezing) ~2 weeks, and first time now running for full day!!
12:14:29 up 23:25, 3 users, load average: 0.08, 0.09, 0.09
Took a while to come here and ask for help, setup kdump, etc. surprised how USER MODE application can take down a full OS/computer⦠even if running into memory leak or ācpu leakā, shouldnāt it OMM kill, or protect the OS and not kernel panic?? these are the crashes:
they do not reproduce now due to split load to GPU I bet⦠couldnāt fully read them since ācrashā utility is failing as noted in other post from yesterdayā¦
sounds like important issue to fix to be able to analyze kernel dumps using crashā¦