I continue to have hard drive crashes after several months of using my computer

First of all, I am using Debian Bookworm with the default Gnome DE. I’m not sure if that is alright on an Ubuntu forum but I’ve been a member here for many, many years and have always had wonderful help. I figured, since Ubuntu is based on Debian I could get help here as well.

Background. I live in an old house that has some power issues. In one room of the house the power will just go out randomly and I need to flip the breaker box to turn the power on. That is the room I currently have my computer in. It is unfortunately the only option I have for a computer room. I’ve had it in this room for a few years and I can’t count how many unsafe shutdowns my computer has been through. I apparently also have “dirty power” which I recently discovered after a friend recommended I get a Battery Backup and UPS, which has greatly helped rate the power and I haven’t had any issues with the computer just shutting down. I’ve been able to shut the computer down if the power goes out safely. I finally decided to use an extension cord to plug the UPS into an outlet in another room and the power is much more stable and there shouldn’t be any more issues. Hopefully. But despite doing that I’ve still been getting these errors. I’ve bought three hard drives in the last year. My desktop was a custom build and is roughly ten years old. It has these specs:

debian@-----:~$ inxi -Fxz
System:
  Kernel: 6.1.0-30-amd64 arch: x86_64 bits: 64 compiler: gcc v: 12.2.0
    Desktop: GNOME v: 43.9 Distro: Debian GNU/Linux 12 (bookworm)
Machine:
  Type: Desktop Mobo: ASUSTeK model: MAXIMUS VIII HERO v: Rev 1.xx
    serial: <superuser required> UEFI: American Megatrends v: 1902
    date: 06/24/2016
CPU:
  Info: quad core model: Intel Core i7-6700K bits: 64 type: MT MCP
    arch: Skylake-S rev: 3 cache: L1: 256 KiB L2: 1024 KiB L3: 8 MiB
  Speed (MHz): avg: 800 min/max: 800/4200 cores: 1: 800 2: 800 3: 800 4: 800
    5: 800 6: 800 7: 800 8: 800 bogomips: 63999
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
Graphics:
  Device-1: Intel HD Graphics 530 vendor: ASUSTeK driver: i915 v: kernel
    arch: Gen-9 bus-ID: 00:02.0
  Display: wayland server: X.Org v: 1.22.1.9 with: Xwayland v: 22.1.9
    compositor: gnome-shell driver: dri: iris gpu: i915
    resolution: 1920x1080~60Hz
  API: OpenGL v: 4.6 Mesa 22.3.6 renderer: Mesa Intel HD Graphics 530 (SKL
    GT2) direct-render: Yes
Audio:
  Device-1: Intel 100 Series/C230 Series Family HD Audio vendor: ASUSTeK
    driver: snd_hda_intel v: kernel bus-ID: 00:1f.3
  API: ALSA v: k6.1.0-30-amd64 status: kernel-api
  Server-1: PipeWire v: 0.3.65 status: active
Network:
  Device-1: Intel Ethernet I219-V vendor: ASUSTeK driver: e1000e v: kernel
    port: N/A bus-ID: 00:1f.6
  IF: enp0s31f6 state: up speed: 1000 Mbps duplex: full mac: <filter>
Drives:
  Local Storage: total: 3.64 TiB used: 570.96 GiB (15.3%)
  ID-1: /dev/nvme0n1 vendor: Samsung model: SSD 990 PRO with Heatsink 4TB
    size: 3.64 TiB temp: 28.9 C
Partition:
  ID-1: / size: 27.33 GiB used: 20.73 GiB (75.8%) fs: ext4 dev: /dev/dm-1
    mapped: hoovs--vg-root
  ID-2: /boot size: 455.1 MiB used: 151.5 MiB (33.3%) fs: ext2
    dev: /dev/nvme0n1p2
  ID-3: /boot/efi size: 511 MiB used: 5.8 MiB (1.1%) fs: vfat
    dev: /dev/nvme0n1p1
  ID-4: /home size: 3.55 TiB used: 550.08 GiB (15.1%) fs: ext4
    dev: /dev/dm-3 mapped: hoovs--vg-home
Swap:
  ID-1: swap-1 type: partition size: 976 MiB used: 0 KiB (0.0%) dev: /dev/dm-2
    mapped: hoovs--vg-swap_1
Sensors:
  System Temperatures: cpu: 22.0 C mobo: N/A
  Fan Speeds (RPM): N/A
Info:
  Processes: 277 Uptime: 1d 12h 43m Memory: 62.67 GiB used: 3.67 GiB (5.9%)
  Init: systemd target: graphical (5) Compilers: gcc: 12.2.0 Packages: 1864
  Shell: Bash v: 5.2.15 inxi: 3.3.26

The Issue. Over the last year I’ve continually gotten odd behavior from my system that eventually results in a hard drive failure. The first symptoms to appear is that I click on icons in the GUI to open a program or open settings and nothing happens. I reboot and all is fine. For a while. Later on I get IO errors on a black screen. I reboot and all is fine for a while. Then I get the following error, but I was able to take a picture of it before rebooting.

Systemd-journald[20176]: Failed to rotate /var/log/journal/[random letters/numbers]/user-1000.journal: Read-only file system.

The last time this happened I rebooted my computer and the system simply went to the BIOS. I looked for the hard drive and could not find it. I bought a new hard drive and installed it. After about six months the same issue. I’m now on my third hard drive, a Samsung 990 Pro with Heatsink SSD. And now the same issues are happening, despite no power outages. So I am unsure of the cause. Perhaps it’s the motherboard? I was hoping to get some help before this happens again.

I’ve used the smartctl cmd to check for any errors in the hard drive itself and it says there is none. But this is also my first SSD and am unfamiliar with all of their in’s and out’s. Here is that output.

 sudo smartctl -i -a /dev/nvme0n1p3
[sudo] password for debian: 
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-30-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 990 PRO with Heatsink 4TB
Serial Number:                      S7DSNJ0X501917B
Firmware Version:                   4B2QJXD7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 4,000,787,030,016 [4.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       2.0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          4,000,787,030,016 [4.00 TB]
Namespace 1 Utilization:            708,399,677,440 [708 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 4541402bb7
Local Time is:                      Sun Jan 26 09:14:38 2025 MST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0055):     Comp DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x2f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg *Other*
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.39W       -        -    0  0  0  0        0       0
 1 +     9.39W       -        -    1  1  1  1        0       0
 2 +     9.39W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3     4200    2700
 4 -   0.0050W       -        -    4  4  4  4      500   21800

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        29 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    2,396,866 [1.22 TB]
Data Units Written:                 12,371,742 [6.33 TB]
Host Read Commands:                 18,084,801
Host Write Commands:                480,395,870
Controller Busy Time:               774
Power Cycles:                       28
Power On Hours:                     1,613
Unsafe Shutdowns:                   20
Media and Data Integrity Errors:    0
Error Information Log Entries:      17
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               29 Celsius
Temperature Sensor 2:               32 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Thank you for any help you can provide.

Replace the circuit breaker and check the wiring would be the first things to do. Seems like a serious electrical problem, which can lead to fires. An fsck on the root filesystem to find any problems. I don’t see much wrong in the smartctl output.

Kind of old firmware… Can you update it?

A power loss you describe is a very good way to ruin computer equipment so I would agree with the suggestion above to repair the wiring in the house if you can.

Thanks for everyone’s replies. Even after providing a different power source to the computer I’ve still been getting errors and odd behavior. One of the first symptoms to happen is I go to click an icon to open something and nothing happens. This happened just last night. I am able to open the cmd, however and I try to open the flatpak program by trying the run command and I get the following message:

error: open(0_TMPFILE): Read-only file system

I also tried updating the system running sudo update and I get this error:

bash: /usr/bin/sudo: Input/output error

I ran fsck on my system more than once and it found no errors.

I was looking into buying a new computer… maybe it’s time. I found one from System76 I can afford and I’ll go from there. I was just hoping I could fix what’s wrong so I could at least extend the life of this computer a little longer. I built it myself and I cherish it.

Are there any particular system logs I could look into to pinpoint the issue? Thanks.

Hopefully I didn’t overlook a drive…
But I noticed a NVMe
there is a trick to those drives to have them recover cells.
One could obtain a Nvme to 2.5" SATA adapter place the NVMe in it attach the power only connection to the adapter. Power system on putting in the bios screen for 1- 2 hours not allowing data transmission and the cells will rebuild. Works for SSD’s as well.

won’t help with corrupted data but will usually recover the drive to a state that data could be overwritten.