I've definitely had a bad RAM module. Starting the test amounted to thousands of error within the first 30 seconds.
Since the manufacturer offered a life-time warranty option, I RMA'd the bad memory module (actually two of them, even though only was broken, because they were both part of a kit) and the manufacturer sent a new set of modules.
I went ahead and tested both modules individually/one at the time for at least 12 hours (12 passes in that case) each with default settings, including running in parallel mode.
Then, I reassembled the device back into working state with all four slots filled with 16 GB modules and attempted to give it another memtest86 run for at least 12 hours just to be sure/for good measure, but this never came through. Instead, the machine just froze after about 3.5 hours. I don't have logs for that, but it shouldn't really matter, more on that later.
Rebooted, restarted the test, freeze after about 30 minutes. Rinse, repeat. Freeze after about 10 minutes.
At that point I grew wary and started playing around with the CPU selection modes. System firmwares seem to often be buggy when it comes to the parallel modes, so I started a single-CPU test with all four memory modules, which terminated after 4 passes and more than 20 hours without errors (or freezes).
I noticed that the "sequential" mode leads to freezes very quickly, while "round-robin" takes longer (typically a few tests to trigger a freeze) and the "parallel" mode takes even longer, although eventually it also freezes.
I went ahead and started removing modules. Tested (in sequential mode, because that is the quickest one to show a freeze) with 3 modules, 2 modules, swapped the two modules around, then down to 1 module and swapped this module around in each and every slot to make sure that it's not just a defective slot that is causing issues. Each of these tests quickly culminated in a system freeze after just a few seconds.
I'll attach log files for sequential, round-robin and parallel runs with 4 modules, all of which froze at the end.
I'm unsure whether this is a firmware or HW issue, though. From the log files (for rr and seq at least), it looks like the machine always froze when switching from CPU 6 to CPU 0. This sounds like a firmware issue, but I'm not sure.
Of course, I'm not just testing this stuff because I love testing memory and have too much free time on my hands. Rather, I've seen similar system freezes under Linux recently (I don't use Windows), but mostly shrugged it off as nVidia-driver-related issues. However, I'm able to trigger system freezes under Linux using a CPU-only test via stress-ng as well.
Another odd point is that testing the modules individually in parallel mode hasn't shown any errors or freezes in more than 12 hours of testing, which makes a hardware issue that developed or worsened during testing even more likely and a firmware issue less likely.
I'm currently leaning towards broken hardware - either CPU or mainboard. I figure you, unlike me, have seen a lot of different systems, though, and maybe also encountered such freezes before, so I'd appreciate your input.
Since the manufacturer offered a life-time warranty option, I RMA'd the bad memory module (actually two of them, even though only was broken, because they were both part of a kit) and the manufacturer sent a new set of modules.
I went ahead and tested both modules individually/one at the time for at least 12 hours (12 passes in that case) each with default settings, including running in parallel mode.
Then, I reassembled the device back into working state with all four slots filled with 16 GB modules and attempted to give it another memtest86 run for at least 12 hours just to be sure/for good measure, but this never came through. Instead, the machine just froze after about 3.5 hours. I don't have logs for that, but it shouldn't really matter, more on that later.
Rebooted, restarted the test, freeze after about 30 minutes. Rinse, repeat. Freeze after about 10 minutes.
At that point I grew wary and started playing around with the CPU selection modes. System firmwares seem to often be buggy when it comes to the parallel modes, so I started a single-CPU test with all four memory modules, which terminated after 4 passes and more than 20 hours without errors (or freezes).
I noticed that the "sequential" mode leads to freezes very quickly, while "round-robin" takes longer (typically a few tests to trigger a freeze) and the "parallel" mode takes even longer, although eventually it also freezes.
I went ahead and started removing modules. Tested (in sequential mode, because that is the quickest one to show a freeze) with 3 modules, 2 modules, swapped the two modules around, then down to 1 module and swapped this module around in each and every slot to make sure that it's not just a defective slot that is causing issues. Each of these tests quickly culminated in a system freeze after just a few seconds.
I'll attach log files for sequential, round-robin and parallel runs with 4 modules, all of which froze at the end.
I'm unsure whether this is a firmware or HW issue, though. From the log files (for rr and seq at least), it looks like the machine always froze when switching from CPU 6 to CPU 0. This sounds like a firmware issue, but I'm not sure.
Of course, I'm not just testing this stuff because I love testing memory and have too much free time on my hands. Rather, I've seen similar system freezes under Linux recently (I don't use Windows), but mostly shrugged it off as nVidia-driver-related issues. However, I'm able to trigger system freezes under Linux using a CPU-only test via stress-ng as well.
Another odd point is that testing the modules individually in parallel mode hasn't shown any errors or freezes in more than 12 hours of testing, which makes a hardware issue that developed or worsened during testing even more likely and a firmware issue less likely.
I'm currently leaning towards broken hardware - either CPU or mainboard. I figure you, unlike me, have seen a lot of different systems, though, and maybe also encountered such freezes before, so I'd appreciate your input.
Comment