Need assistance solving unexpected and random I/O errors in homelab

tintedundercollar12@sopuli.xyz · edit-2 2 months ago

Need assistance solving unexpected and random I/O errors in homelab

MangoPenguin@lemmy.blahaj.zone · 2 months ago

Potentially an overheating NVMe drive maybe? They can run quite hot without a heat sink on them.

prettygorgeous@aussie.zone · 2 months ago

Your NVMe drive is failing.

tintedundercollar12@sopuli.xyz · 2 months ago

Wait, can failing SSDs really be detected beforehand? I thought it was only possible with HDDs.

frongt@lemmy.zip · 2 months ago

It’s not really “beforehand”. It’s failing now.

Yeah you can predict failure with stuff like increasing bad blocks or the wearout indicator, but by the time you see behavior like this, the drive is already failed, just not completely dead. This is your last chance to back up your important data.

Dave.@aussie.zone · edit-2 2 months ago

It looks like your drive is going offline randomly, or at least, when it warms up a little. All the IO errors look like various subsystems trying to write to something that’s not there anymore, which is why there’s nothing visible in the logs when you look later.

Could be the drive, could be the drive controller on the motherboard, could be just that your nvme drive just needs to be taken out of its slot and reseated, could be something weird in your BIOS setup that’s causing mayhem (bus timings, etc).

Personally I’d reseat your drive in its slot first and go from there.

empireOfLove2@lemmy.dbzer0.com · edit-2 2 months ago

My default goto with any stability issue is to first force a new drive self test

smartctl -l selftest /dev/nvme0

And then I would also run a complete extended memory test (memtest86) to ensure bad ram isn’t doing something dumb like corrupting the part of the kernel that handles disk IO. The number of times I’ve had unsolvable issues that traced to an unstable stick of memory is… Surprisingly high.

If the memtest passes try fsck’ing nvme0, if there are corrupted blocks yeah it’s possible the SSD is dying but the controller isn’t reporting it.

Andres@social.ridetrans.it · 2 months ago

@empireOfLove2 @tintedundercollar12 Yeah, that absolutely looks like a hardware issue. Memtest is a good idea, but also reseat the nvme and keep an eye out for overheating (eg, ssh’ing in and keeping the following running in a terminal:
while (sleep 5); do sudo smartctl -a /dev/nvme0|grep ‘Temperature:’; done
). Components on the drive could be failing early when temperatures get high, but not high enough to trigger warning thresholds.