I’ve been having issues with my homelab ever since I set it up a few months ago. For some reason the server becomes unresponsive as if it is offline. However when accessing its CLI, it seems to spew out this message in continuity.

I’ve tried entering commands directly into the CLI, but it shows an ‘input/output’ error instead. I cannot even get it to shutdown through the CLI so I have to manually pull the plug.

Here’s another screenshot of the logs in the CLI a few moments just after the error occurred.

The issue does not even get fixed after I try switching it off and on. Sometimes the homelab gets stuck indefinitely in the startup loading screen, fails to detect the system partition between the GRUB stage, results in a Linux kernel crash or refuses to boot altogether. It is only mitigated when I leave the homelab switched off for 5 minutes or so.

The weird thing about it is that there is no way to predict when this error could come up. The server would work completely unhindered for a few weeks straight on some occasions, and break down just a few minutes after startup. It doesn’t depend on what type of services I am hosting, all of which are lightweight in nature.

Additionally, once it does start working again there seems to be no record of the encountered error to be seen in the logs, apart from the number of unsafe shutdowns. This makes it difficult to debug or even document the matter coupled with the fact that its occurence is random in nature. I’be tried running several diagnostic tools including smartctl but I am unable to deduce anything useful out of it.

Some specs and info about the homelab is as follows:

  • Build: Pre built Compact Mini PC
  • CPU: Intel i7-14700
  • RAM: 16GB
  • Storage: 1TB SSD
  • GPU: Integrated Intel HD Graphics 770
  • Operating System: Ubuntu 24.04 LTS

I would really appreciate if you could point out the cause of this issue. This experience makes the server reliable which is why I don’t feel comfortable hosting anything valuable or sensitive on it yet.

I can provide you additional details or logs if required.

      • frongt@lemmy.zip
        link
        fedilink
        English
        arrow-up
        0
        ·
        2 months ago

        It’s not really “beforehand”. It’s failing now.

        Yeah you can predict failure with stuff like increasing bad blocks or the wearout indicator, but by the time you see behavior like this, the drive is already failed, just not completely dead. This is your last chance to back up your important data.

  • Dave.@aussie.zone
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    2 months ago

    It looks like your drive is going offline randomly, or at least, when it warms up a little. All the IO errors look like various subsystems trying to write to something that’s not there anymore, which is why there’s nothing visible in the logs when you look later.

    Could be the drive, could be the drive controller on the motherboard, could be just that your nvme drive just needs to be taken out of its slot and reseated, could be something weird in your BIOS setup that’s causing mayhem (bus timings, etc).

    Personally I’d reseat your drive in its slot first and go from there.

  • empireOfLove2@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    2 months ago

    My default goto with any stability issue is to first force a new drive self test

    smartctl -l selftest /dev/nvme0  
    
    

    And then I would also run a complete extended memory test (memtest86) to ensure bad ram isn’t doing something dumb like corrupting the part of the kernel that handles disk IO. The number of times I’ve had unsolvable issues that traced to an unstable stick of memory is… Surprisingly high.

    If the memtest passes try fsck’ing nvme0, if there are corrupted blocks yeah it’s possible the SSD is dying but the controller isn’t reporting it.

    • Andres@social.ridetrans.it
      link
      fedilink
      arrow-up
      0
      ·
      2 months ago

      @empireOfLove2 @tintedundercollar12 Yeah, that absolutely looks like a hardware issue. Memtest is a good idea, but also reseat the nvme and keep an eye out for overheating (eg, ssh’ing in and keeping the following running in a terminal:
      while (sleep 5); do sudo smartctl -a /dev/nvme0|grep ‘Temperature:’; done
      ). Components on the drive could be failing early when temperatures get high, but not high enough to trigger warning thresholds.