First of all, I am using Debian Bookworm with the default Gnome DE. I’m not sure if that is alright on an Ubuntu forum but I’ve been a member here for many, many years and have always had wonderful help. I figured, since Ubuntu is based on Debian I could get help here as well.
Background. I live in an old house that has some power issues. In one room of the house the power will just go out randomly and I need to flip the breaker box to turn the power on. That is the room I currently have my computer in. It is unfortunately the only option I have for a computer room. I’ve had it in this room for a few years and I can’t count how many unsafe shutdowns my computer has been through. I apparently also have “dirty power” which I recently discovered after a friend recommended I get a Battery Backup and UPS, which has greatly helped rate the power and I haven’t had any issues with the computer just shutting down. I’ve been able to shut the computer down if the power goes out safely. I finally decided to use an extension cord to plug the UPS into an outlet in another room and the power is much more stable and there shouldn’t be any more issues. Hopefully. But despite doing that I’ve still been getting these errors. I’ve bought three hard drives in the last year. My desktop was a custom build and is roughly ten years old. It has these specs:
debian@-----:~$ inxi -Fxz
System:
Kernel: 6.1.0-30-amd64 arch: x86_64 bits: 64 compiler: gcc v: 12.2.0
Desktop: GNOME v: 43.9 Distro: Debian GNU/Linux 12 (bookworm)
Machine:
Type: Desktop Mobo: ASUSTeK model: MAXIMUS VIII HERO v: Rev 1.xx
serial: <superuser required> UEFI: American Megatrends v: 1902
date: 06/24/2016
CPU:
Info: quad core model: Intel Core i7-6700K bits: 64 type: MT MCP
arch: Skylake-S rev: 3 cache: L1: 256 KiB L2: 1024 KiB L3: 8 MiB
Speed (MHz): avg: 800 min/max: 800/4200 cores: 1: 800 2: 800 3: 800 4: 800
5: 800 6: 800 7: 800 8: 800 bogomips: 63999
Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
Graphics:
Device-1: Intel HD Graphics 530 vendor: ASUSTeK driver: i915 v: kernel
arch: Gen-9 bus-ID: 00:02.0
Display: wayland server: X.Org v: 1.22.1.9 with: Xwayland v: 22.1.9
compositor: gnome-shell driver: dri: iris gpu: i915
resolution: 1920x1080~60Hz
API: OpenGL v: 4.6 Mesa 22.3.6 renderer: Mesa Intel HD Graphics 530 (SKL
GT2) direct-render: Yes
Audio:
Device-1: Intel 100 Series/C230 Series Family HD Audio vendor: ASUSTeK
driver: snd_hda_intel v: kernel bus-ID: 00:1f.3
API: ALSA v: k6.1.0-30-amd64 status: kernel-api
Server-1: PipeWire v: 0.3.65 status: active
Network:
Device-1: Intel Ethernet I219-V vendor: ASUSTeK driver: e1000e v: kernel
port: N/A bus-ID: 00:1f.6
IF: enp0s31f6 state: up speed: 1000 Mbps duplex: full mac: <filter>
Drives:
Local Storage: total: 3.64 TiB used: 570.96 GiB (15.3%)
ID-1: /dev/nvme0n1 vendor: Samsung model: SSD 990 PRO with Heatsink 4TB
size: 3.64 TiB temp: 28.9 C
Partition:
ID-1: / size: 27.33 GiB used: 20.73 GiB (75.8%) fs: ext4 dev: /dev/dm-1
mapped: hoovs--vg-root
ID-2: /boot size: 455.1 MiB used: 151.5 MiB (33.3%) fs: ext2
dev: /dev/nvme0n1p2
ID-3: /boot/efi size: 511 MiB used: 5.8 MiB (1.1%) fs: vfat
dev: /dev/nvme0n1p1
ID-4: /home size: 3.55 TiB used: 550.08 GiB (15.1%) fs: ext4
dev: /dev/dm-3 mapped: hoovs--vg-home
Swap:
ID-1: swap-1 type: partition size: 976 MiB used: 0 KiB (0.0%) dev: /dev/dm-2
mapped: hoovs--vg-swap_1
Sensors:
System Temperatures: cpu: 22.0 C mobo: N/A
Fan Speeds (RPM): N/A
Info:
Processes: 277 Uptime: 1d 12h 43m Memory: 62.67 GiB used: 3.67 GiB (5.9%)
Init: systemd target: graphical (5) Compilers: gcc: 12.2.0 Packages: 1864
Shell: Bash v: 5.2.15 inxi: 3.3.26
The Issue. Over the last year I’ve continually gotten odd behavior from my system that eventually results in a hard drive failure. The first symptoms to appear is that I click on icons in the GUI to open a program or open settings and nothing happens. I reboot and all is fine. For a while. Later on I get IO errors on a black screen. I reboot and all is fine for a while. Then I get the following error, but I was able to take a picture of it before rebooting.
Systemd-journald[20176]: Failed to rotate /var/log/journal/[random letters/numbers]/user-1000.journal: Read-only file system.
The last time this happened I rebooted my computer and the system simply went to the BIOS. I looked for the hard drive and could not find it. I bought a new hard drive and installed it. After about six months the same issue. I’m now on my third hard drive, a Samsung 990 Pro with Heatsink SSD. And now the same issues are happening, despite no power outages. So I am unsure of the cause. Perhaps it’s the motherboard? I was hoping to get some help before this happens again.
I’ve used the smartctl cmd to check for any errors in the hard drive itself and it says there is none. But this is also my first SSD and am unfamiliar with all of their in’s and out’s. Here is that output.
sudo smartctl -i -a /dev/nvme0n1p3
[sudo] password for debian:
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-30-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 990 PRO with Heatsink 4TB
Serial Number: S7DSNJ0X501917B
Firmware Version: 4B2QJXD7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 4,000,787,030,016 [4.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 2.0
Number of Namespaces: 1
Namespace 1 Size/Capacity: 4,000,787,030,016 [4.00 TB]
Namespace 1 Utilization: 708,399,677,440 [708 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 4541402bb7
Local Time is: Sun Jan 26 09:14:38 2025 MST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0055): Comp DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x2f): S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg *Other*
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 82 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 9.39W - - 0 0 0 0 0 0
1 + 9.39W - - 1 1 1 1 0 0
2 + 9.39W - - 2 2 2 2 0 0
3 - 0.0400W - - 3 3 3 3 4200 2700
4 - 0.0050W - - 4 4 4 4 500 21800
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 29 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 2,396,866 [1.22 TB]
Data Units Written: 12,371,742 [6.33 TB]
Host Read Commands: 18,084,801
Host Write Commands: 480,395,870
Controller Busy Time: 774
Power Cycles: 28
Power On Hours: 1,613
Unsafe Shutdowns: 20
Media and Data Integrity Errors: 0
Error Information Log Entries: 17
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 29 Celsius
Temperature Sensor 2: 32 Celsius
Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged
Thank you for any help you can provide.