Expected to get BUG: kernel NULL pointer dereference, address: 0000000000000000 on overloaded system?

Ubuntu Version:
24.04 LTS,
Linux 6.14.0-29-generic #29~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Aug 14 16:52:50 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

CPU: 8-core AMD Ryzen 7 8845HS w/ Radeon 780M Graphics (-MT MCP-) speed/min/max: 1464/400/5102 MHz
Kernel: 6.14.0-29-generic x86_64 Up: 18m Mem: 2.4/88.92 GiB (2.7%)
Storage: 953.87 GiB (13.0% used) Procs: 343 Shell: Bash inxi: 3.3.34

Problem Description:
In trying to debug an occasional kernel panic, ran stress-ng to try and reproduce issue, it did not for several hours, then started running LLM (forgot to shutdown stress-ng) - and since machine was ā€œoverloadedā€ it panicked…

The question is this EXPECTED? or should it just run slower or give some other error? why panic when user-mode jobs are running ? shouldn’t kernel act in similar fashion as OOM, by killing large consumers (of CPU) as it does in memory case?

Screenshots or Error Messages:

Sections of the kernel panic are:

[19741.362765] [ T580292] Oops: Oops: 0000 [#305841] PREEMPT SMP NOPTI
[19741.362767] [ T580292] BUG: kernel NULL pointer dereference, address: 0000000000000000
[19741.362768] [ T580292] #PF: supervisor read access in kernel mode
[19741.362769] [ T580292] #PF: error_code(0x0000) - not-present page
[19741.362769] [ T580292] PGD 0 P4D 0
[19747.214991] [      C4] watchdog: CPU4: Watchdog detected hard LOCKUP on cpu 4
[19756.155979] [      C6] watchdog: CPU6: Watchdog detected hard LOCKUP on cpu 6
[19756.296665] [      C9] watchdog: CPU9: Watchdog detected hard LOCKUP on cpu 9
[19756.678041] [     C10] watchdog: CPU10: Watchdog detected hard LOCKUP on cpu 10
[19757.901609] [      C7] watchdog: CPU7: Watchdog detected hard LOCKUP on cpu 7
[19758.208291] [      C0] watchdog: CPU0: Watchdog detected hard LOCKUP on cpu 0
[19759.876555] [      C5] watchdog: CPU5: Watchdog detected hard LOCKUP on cpu 5
[19777.176677] [      C2] watchdog: CPU2: Watchdog detected hard LOCKUP on cpu 2

full dmesg crash output here:

https://filebin.net/custom-bin/dmesg.202509171504

The real issue is that system crashes when LLM is running, after few hours sometimes sooner, no overheating is noticed… managed to capture that dmesg only once so far: https://filebin.net/custom-bin/VmCore-dmesg1 due to crash not able to read kernel dumps cannot get more info from actual crash…

Ran memtest for 8 passes in console mode, all came out fine.

any tips welcomed.

Looking at that memory utilization, given you run a graphics card intensive task, did you consider giving the card more video ram for its LLM processing ? (AFAIK the Radeon 780M can handle up to 32GB shared memory)

I’d check the VRAM and GTT values and if they are high enough for such a demanding task …

https://www.kernel.org/doc/html/v4.20/gpu/amdgpu.html

(there is also the radeontop utility that should reveal some info about the graphics card and memory usage)

If I got it right you are not using a standard kernel. I don’t know if your LLM will run with standard kernel and how performance will be affected - but did you try with standard kernel?

The kernel looks like a legit hwe kernel to me …

https://launchpad.net/ubuntu/+source/linux-meta-hwe-6.14

1 Like

Sorry. My fault. I was thinking that parameters are not standard. Don’t no how I came up with that …

Thanks for feedback, here’s a snapshot when system is running the LLM… memory at OS level is used:

# inxi
CPU: 8-core AMD Ryzen 7 8845HS w/ Radeon 780M Graphics (-MT MCP-) speed/min/max: 3382/400/5102 MHz
Kernel: 6.14.0-29-generic x86_64 Up: 1h 28m Mem: 43.84/88.92 GiB (49.3%)
Storage: 953.87 GiB (13.1% used) Procs: 346 Shell: Bash inxi: 3.3.34
# free -m
               total        used        free      shared  buff/cache   available
Mem:           91053       44844        3776           3       43312       46208
Swap:           8191           0        8191

the radeontop shows some error on start, then numbers of VRAM really low:

# radeontop
Unknown Radeon card. <= R500 won't work, new cards might.
Collecting data, please wait....

                               Graphics pipe   0.00% ?
?????????????????????????????????????????????????????????????????????????????????????????????????????????
                                Event Engine   0.00% ?
                                                     ?
                 Vertex Grouper + Tesselator   0.00% ?
                                                     ?
                           Texture Addresser   0.00% ?
                               Texture Cache   0.00% ?
                                                     ?
                               Shader Export   0.00% ?
                 Sequencer Instruction Cache   0.00% ?
                         Shader Interpolator   0.00% ?
                      Shader Memory Exchange   0.00% ?
                                                     ?
                              Scan Converter   0.00% ?
                          Primitive Assembly   0.00% ?
                                                     ?
                                 Depth Block   0.00% ?
                                 Color Block   0.00% ?
                              Clip Rectangle   0.00% ?
                                                     ?
                           198M / 2995M VRAM   6.60% ?   
                            32M / 45513M GTT   0.07% ?
                  2.80G / 2.80G Memory Clock 100.00% ?                                                  
                  0.80G / 2.70G Shader Clock  29.63% ?               

Sorry, was not aware need to set VRAM in BIOS? will try that, what about amdgpu package, is that needed to install?

I see the kernel module loaded OK:

# modinfo amdgpu
filename:       /lib/modules/6.14.0-29-generic/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko.zst
license:        GPL and additional rights
description:    AMD GPU
author:         AMD linux driver team
firmware:       amdgpu/navi12_gpu_info.bin

which is part of kernel, but probably need to install/pull this one too?

amdgpu/noble 1:6.4.60403-2194681.24.04 amd64
# apt install amdgpu
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  amdgpu-core amdgpu-dkms amdgpu-dkms-firmware amdgpu-lib amdgpu-multimedia autoconf automake
<snap>
0 upgraded, 29 newly installed, 0 to remove and 5 not upgraded.
Need to get 62.7 MB of archives.
After this operation, 772 MB of additional disk space will be used.
Do you want to continue? [Y/n] 

though even at full load, never saw the vram more than 6% used as shown above.

will report back…

AI says:

No, you do not need to apt install amdgpu for basic functionality because the amdgpu driver is built into the Linux kernel and is usually installed by default on Ubuntu-based systems. You would only use the amdgpu-install command to install the ROCm software stack or to install a proprietary driver, not for the standard open-source graphics driver. To ensure you have the latest open-source drivers, simply keep your system updated with sudo apt update && sudo apt upgrade

which makes sense, rebooted and set UMA Frame buffer to max 16GB, on top it shows same amount used:

                                         ?
              187M / 16298M VRAM   1.15% ?
                34M / 38961M GTT   0.09% ?
      2.80G / 2.80G Memory Clock 100.00% ?                                      
      0.80G / 2.70G Shader Clock  29.63% ?           

lets see if that will help with crashes, will know in few hours… any other ideas welcomed.

It doesn’t seem like

ollama      6531    2459 99 17:44 ?        00:47:49 /usr/local/bin/ollama runner

is using the GPU… at least VRAM is not going up during usage… will research that on the side, still curious why kernel panic…

Friends,
Running this, shows it is using CPU only:

# ollama ps
NAME            ID              SIZE     PROCESSOR    CONTEXT    UNTIL       
llama3.1:70b    711a9e8463af    44 GB    100% CPU     4096       2 minutes from now...    

I think 780M gpu is not fully supported by ollama nor rocm.

From ā€œofficial docsā€ installing rocm requires 21GB!!! of space! wow!! that’s an OVERKILL!

found this ollama issue:

which has some pointers, i did not run apt install rocm (for 21GB), instead pulled the amdgpu-install v7 from that url (not ubuntu repos which have v6 only).
then ran that script (amdgpu-install) it pulled about 1GB of stuff, set the ollama override env variable, and restart the service, now it uses GPU and CPU:

# ollama ps
NAME            ID              SIZE     PROCESSOR          CONTEXT    UNTIL              
llama3.1:70b    711a9e8463af    45 GB    63%/37% CPU/GPU    4096       3 minutes from now    

Responses are MUCH MUCH MUCH faster!! ok, will give it few hours/days of testing and report back…

Thanks for the support.

Congratulations !

Please note that we can not support third party software or drivers here, specifically for the proprietary amdgpu driver and ROCm you should be able to get support at:

https://www.amd.com/en/developer/browse-by-resource-type/support.html

1 Like

quick update… yeah, very very happY!!!

been crashing (kernel panic , freezing) ~2 weeks, and first time now running for full day!!

 12:14:29 up 23:25,  3 users,  load average: 0.08, 0.09, 0.09

Took a while to come here and ask for help, setup kdump, etc. surprised how USER MODE application can take down a full OS/computer… even if running into memory leak or ā€œcpu leakā€, shouldn’t it OMM kill, or protect the OS and not kernel panic?? these are the crashes:

-rw-r--r-- 1 root whoopsie 306200163 Sep 16 18:21 202509161821/dump.202509161821
-rw-r--r-- 1 root whoopsie 392246883 Sep 17 15:04 202509171504/dump.202509171504
-rw-r--r-- 1 root whoopsie 369187122 Sep 17 15:42 202509171542/dump.202509171542

they do not reproduce now due to split load to GPU I bet… couldn’t fully read them since ā€˜crash’ utility is failing as noted in other post from yesterday…

sounds like important issue to fix to be able to analyze kernel dumps using crash…

This topic was automatically closed after 30 days. New replies are no longer allowed.