As of Linux 6.7 I’m getting hard freezes that require a power cut to reset (sysrq doesn’t work.) Happens at both idle and load anywhere from 5 minutes in to an hour. Running journalctl --follow and dmesg -w (both as root) reveal nothing at the time of the crash. Kernel version 6.6 continues to be 100% stable.

System:

  • Distro/Kernel: Arch Linux 6.7.arch3-1
  • CPU: AMD Ryzen 5 2600X
  • GPU: AMD RX580 8GB via AMDGPU
  • RAM: Some configuration of 16GB at 2667 MT/s.
  • WM: SwayWM

I’m unsure how to go about properly reporting a bug if no errors are being generated.

Any advice?

I’m not alone on this apparently (warning, it’s reddit.)

  • Rockslide0482@discuss.tchncs.de
    link
    fedilink
    English
    arrow-up
    7
    ·
    10 months ago

    TLDR: do memtest on your RAM

    I recently had an issue for quite some time where my computer would occasionally just hard crash. When it first started happening I tried many of the common tests including memcheck but found nothing. For a while it wasnt super common so I just lived through it. I thought it was an OS thing but it occurred on a different Linux distro and even on the ancient Windows 10 install I have but rarely use. I was just about to pull the trigger on replacing mobo and maybe even CPU+RAM. Before I did that I followed someone’s suggestion to do a mem test. I could have at least sworn that I already did that and it came clean but it was an easy enough test to run, so why not.

    Sure enough, found an error. I isolated the faulted DIMM, pulled it out and I haven’t had a crash since. Crazy since I’m all but certain I did both memtest from a Linux live iso and the Windows memory checking utility.

    In short, test your RAM. Do multiple passes. Maybe even just try swapping out single DIMMs and running on that for a reasonable ammount of time to see if you can isolate a culprit. It was my first thought when the issue first occurred because it’s usually what causes stuff like that. When the tests came up clean originally I assumed it had to be something else. I was wrong.

    • 0x0@social.rocketsfall.netOP
      link
      fedilink
      arrow-up
      4
      ·
      10 months ago

      This is what I’ll try next. I do think memory is the problem now that I’ve had a few more hours of research. Kernel 6.7 has issues with elevated RAM usage, so it’s absolutely doing something funky with memory that might be exposing underlying hardware issues. I also realized my stable kernel was a version or two away from 6.6.13 (6.6.10), so I’m running it now to see if the issue was introduced late in the 6.6 release cycle, which would be easier to bisect than 6.7.