Announcement

Collapse
No announcement yet.

Freezes - Firmware or HW issue?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Freezes - Firmware or HW issue?

    I've definitely had a bad RAM module. Starting the test amounted to thousands of error within the first 30 seconds.

    Since the manufacturer offered a life-time warranty option, I RMA'd the bad memory module (actually two of them, even though only was broken, because they were both part of a kit) and the manufacturer sent a new set of modules.

    I went ahead and tested both modules individually/one at the time for at least 12 hours (12 passes in that case) each with default settings, including running in parallel mode.

    Then, I reassembled the device back into working state with all four slots filled with 16 GB modules and attempted to give it another memtest86 run for at least 12 hours just to be sure/for good measure, but this never came through. Instead, the machine just froze after about 3.5 hours. I don't have logs for that, but it shouldn't really matter, more on that later.

    Rebooted, restarted the test, freeze after about 30 minutes. Rinse, repeat. Freeze after about 10 minutes.

    At that point I grew wary and started playing around with the CPU selection modes. System firmwares seem to often be buggy when it comes to the parallel modes, so I started a single-CPU test with all four memory modules, which terminated after 4 passes and more than 20 hours without errors (or freezes).

    I noticed that the "sequential" mode leads to freezes very quickly, while "round-robin" takes longer (typically a few tests to trigger a freeze) and the "parallel" mode takes even longer, although eventually it also freezes.

    I went ahead and started removing modules. Tested (in sequential mode, because that is the quickest one to show a freeze) with 3 modules, 2 modules, swapped the two modules around, then down to 1 module and swapped this module around in each and every slot to make sure that it's not just a defective slot that is causing issues. Each of these tests quickly culminated in a system freeze after just a few seconds.

    I'll attach log files for sequential, round-robin and parallel runs with 4 modules, all of which froze at the end.

    I'm unsure whether this is a firmware or HW issue, though. From the log files (for rr and seq at least), it looks like the machine always froze when switching from CPU 6 to CPU 0. This sounds like a firmware issue, but I'm not sure.


    Of course, I'm not just testing this stuff because I love testing memory and have too much free time on my hands. Rather, I've seen similar system freezes under Linux recently (I don't use Windows), but mostly shrugged it off as nVidia-driver-related issues. However, I'm able to trigger system freezes under Linux using a CPU-only test via stress-ng as well.

    Another odd point is that testing the modules individually in parallel mode hasn't shown any errors or freezes in more than 12 hours of testing, which makes a hardware issue that developed or worsened during testing even more likely and a firmware issue less likely.


    I'm currently leaning towards broken hardware - either CPU or mainboard. I figure you, unlike me, have seen a lot of different systems, though, and maybe also encountered such freezes before, so I'd appreciate your input.
    Attached Files

  • #2
    Very likely it is this UEFI firmware problem
    List of Motherboards with issues when running MemTest86 in multi-CPU selection modes

    The log list this as the firmware, 07/27/2017
    Is there a newer version available?

    Log also lists this as a "XMG P7xxDM2" computer. Which we haven't come across before. What is the marketing name for the machine?

    Comment


    • #3
      Originally posted by David (PassMark) View Post
      Maaaybe. I haven't seen any CPU's past Intel's 6th generation listed there, though. This may not mean much, since the same board was also used with 6th generation CPU's in the past.

      I still lean towards an HW issue, though, because I've never seen freezes in parallel mode before. I've ran a total of at least 6 * 12 hours of testing in parallel mode and it never froze. Maybe these runs have just been lucky, but I somehow doubt that. It's pretty likely that a freeze would have shown up during that time.

      I've been using this machine almost 2/7 in the past two years, often fully loaded (both CPU- and GPU-wise) for 6 hours or more at a time. The system was stable all the time, with the odd GPU driver issue every now and then. It only started to freeze half a year ago and it got way worse in the past few days. Smells like (thermal) aging to me?

      Then again, since the sequential and round robin tests freeze so reproducibly when switching from CPU 6 to 0, maybe there is a both a firmware bug and HW aging at play. Unfortunately I never tested round-robin or sequential modes in the past, mostly because there was just no need to.


      Originally posted by David (PassMark) View Post
      The log list this as the firmware, 07/27/2017
      Is there a newer version available?
      Sadly, yes. Otherwise I would have already updated. It's very unlikely that the mainboard manufacturer will ever release another firmware version. Maybe they would if they were contacted by PassMark directly, since I guess they have an interest in making memtest86 work well. A lot of OEM's and resellers test with memtest86 after all.

      Originally posted by David (PassMark) View Post
      Log also lists this as a "XMG P7xxDM2" computer. Which we haven't come across before. What is the marketing name for the machine?
      This is a Schenker XMG U507 laptop (with a desktop CPU); "P7xxDM2" is the Clevo barebone/mainboard name this system is based on. Clevo is an ODM and designs barebone laptops, which are sold to OEM's/resellers that further customize the models and sell them under their own brand.


      I'd hold off adding it to the blacklist. I'll have to eventually turn in the machine and have it diagnosed and repaired by the seller/OEM, but this likely won't happen within the next 3 weeks. If I can reproduce the issue with a repaired laptop, I'll get back and have the mainboard blacklisted as buggy. I could also ask my seller/OEM to run a sequential memtest86 run on a known good machine and see if it freezes almost immediately for them? That should rule out hardware issues. If it crashes, we'll know that this specific behavior is a firmware bug.

      Comment


      • #4
        Originally posted by Station View Post
        Sadly, yes. Otherwise I would have already updated. [...]
        I naturally meant "Sadly, no."

        Comment

        Working...
        X