Announcement

Collapse
No announcement yet.

Memtest86 crashing before finishing

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Memtest86 crashing before finishing

    Hi all,

    I'm new here. I've built several personal PCs in the past, but this is the first time using Memtest86 on a workstation-like machine that I just finished building. Everything in the machine is new. The only parts coming from eBay were the dual Epyc CPUs (reputable seller who states that the CPUs were not used), and the 2 RAM sticks of 64gb each (another reputable seller who also stated the RAM is brand new). Regardless, I thought it'd be prudent to do thorough testing upon building the machine to make sure everything is ok.

    It all posted successfully on first try, and I've barely used the machine - the only thing I changed was the RAM speed setting in the BIOS from "auto" to "3200", which is the supported/stated speed of both the RAM and the CPUs/motherboard. I also made sure to have all fans set to full speed.

    I've tried to run Memtest86 twice now, and both times the test was unable to complete and my system just rebooted before it was over (and I caught when it happened the second time). In the second attempt, it had 4 errors during pass 3 / 4 (I believe test 7 and . It was probably on test 10 of pass 3 / 4 that the whole thing just stopped entirely and rebooted.

    Here's a snippet of Memtest86 after the initial errors, but before it all rebooted:
    I unfortunately don't have the errors themselves, but do remember they were in the format of:
    Test 8, CPU 15, Address: XXXXX, Expected: XXXXX, Actual: XXXXXX
    Click image for larger version

Name:	image.png
Views:	289
Size:	273.4 KB
ID:	57275

    Ok so there are some​ unexpected/annoying errors, but I am not sure how to deal with the whole Memtest86 not being able to finish (and it's slow, it was at 18-19 hours pass 3 / 4 before stopping on its own. I planned on filling this up with 1TB eventually)

    From the IPMI, you can see when the initial errors happened on tests 7 and 8 (18:20:34), and then when the error(s) happened on likely test 10 that caused the system to reboot (20:33:56). Further below you can also see the health status for the memory sticks, though they went back to green for both sticks after the reboot.
    Click image for larger version

Name:	image.png
Views:	215
Size:	135.4 KB
ID:	57276

    I am also attaching here the logs that I found in the USB flash drive after the reboot MemTest86-20240701-022233.txt, with date modified at 8:32pm (so presumably right before it crashed) -- but I'm not sure what to look for in here, if anything. There are no results outputted anywhere because Memtest86 never got to the end.

    CPU temps and whatnot were at most ~53C but stayed in the mid-40s most of the time, including shortly before the reboot. The CPUs idle at around 32-35C and I have all fans set to full speed for this. I'm fairly certain CPU overheating was not a problem. I also opened it up right after the reboot and did not feel major heat anywhere in the system.​

    And this might be unrelated, but after the reboot the following 2 errors showed up:
    "Entry Point Not Found - The procedure entry point GetTempPath2W could not be located in the dynamic link library C:\Windows\system32\spool\DRIVERS\x64\3\FXSTIFF.dl l"
    "Entry Point Not Found - The procedure entry point __CxxFrameHandler4 could not be located in the dynamic link library C:\Windows\system32\spool\DRIVERS\x64\3\FXSUI.dll" ​

    Lastly, I am not sure what these represent, but quite a few of them show up even though they don't increase the actual error count. So I assumed this was ok? I do have 14 empty RAM slots and have just 1 stick per CPU.
    [ECC Errors] Test: 4 Channel-Slot: 0-X​​​


    Specs:
    Motherboard Supermicro H12DSI-N6
    CPU 1 AMD Epyc 7532
    CPU 2 AMD Epyc 7532
    CPU Fan 1 Arctic Freezer 4U-M
    CPU Fan 2 Arctic Freezer 4U-M
    RAM Hynix DDR4 64GB 3200 RDIMM PC4-25600 RDIMM ECC Registered (2x64GB for 128GB total)
    Storage Crucial P3 Plus 4TB SSD
    Storage WD Ultrastar DCHC550 16TB HDD
    PSU Corsair HX1500i 80 Plus Platinum 1500W
    OS Windows 11 Enterprise

    I'd greatly appreciate any help or advice, if anyone has any ideas what might be happening.
    Attached Files

  • #2
    So it seems you have ECC RAM with both correctable and un-correctable RAM errors.
    Neither are good. Un-correctable errors obviously being far worse that the ones that get corrected.

    Bad RAM can cause software crashes.

    Maybe it is just bad luck, or maybe those EBay sellers aren't selling new parts. It is hard to imagine how any Ebay seller can offer the best price on any new parts, as the EBay selling fees are around 15%. So anything new on Ebay should be ~15% more expensive that regular PC parts stores. Ebay makes sense for rare and 2nd hand items however.

    Comment


    • #3
      Originally posted by David (PassMark) View Post
      So it seems you have ECC RAM with both correctable and un-correctable RAM errors.
      Neither are good. Un-correctable errors obviously being far worse that the ones that get corrected.

      Bad RAM can cause software crashes.

      Maybe it is just bad luck, or maybe those EBay sellers aren't selling new parts. It is hard to imagine how any Ebay seller can offer the best price on any new parts, as the EBay selling fees are around 15%. So anything new on Ebay should be ~15% more expensive that regular PC parts stores. Ebay makes sense for rare and 2nd hand items however.
      Thanks for the quick reply - wow, I presume the correctable errors are from the [ECC Errors] Test: 4 Channel-Slot: 0-X​messages?

      I'm running Windows Memory Diagnostics Tool as a second check, but definitely feels like I should just return these 2 sticks and just pay up somewhere else (I bought only 2 sticks to start, for this cautionary reason). I did find it hard to find server-specific RAM from brands like Hynix and whatnot elsewhere, though. The only thing on Amazon was A-Tech, which I haven't really heard of.

      Comment


      • #4
        Yes, there are around 500 ECC correctable errors in the log (so a single bit flip )

        They look like this
        [MEM ERROR - ECC Errors] Test: 7, (Chan,Slot,Rank,Bank,Row,Col): (0,N/A,N/A,N/A,N/A,N/A), ECC Corrected: yes, Syndrome: N/A, Channel/Slot: 0-X

        Shame the memory controller doesn't identify the slot at fault.
        Not really surprising that you get the occasional 2 bit flip error (uncorrectable) as well with this many errors.

        Comment


        • #5
          Originally posted by David (PassMark) View Post
          Yes, there are around 500 ECC correctable errors in the log (so a single bit flip )

          They look like this
          [MEM ERROR - ECC Errors] Test: 7, (Chan,Slot,Rank,Bank,Row,Col): (0,N/A,N/A,N/A,N/A,N/A), ECC Corrected: yes, Syndrome: N/A, Channel/Slot: 0-X

          Shame the memory controller doesn't identify the slot at fault.
          Not really surprising that you get the occasional 2 bit flip error (uncorrectable) as well with this many errors.
          Ah ok. Do you think the chance that this is a problem coming from the CPUs (separate ebay seller) or motherboard (new from Amazon) is low, compared to the RAM sticks?

          Lastly, do you think it's worth I add my mobo to blacklist.cfg? I noticed H12DSI-NT6 is on there, but not my H12DSI-N6. They are very similar except one supports 10gb networking and the other doesn't. Though I'm not sure if this will turn the test from 20hr to 200hr long

          I reckon I'll move both sticks into different slots to see if the problems follow, but very likely going to be returning them.

          FWIW, this is the memory I got: https://www.ebay.com/itm/355442478866
          And motherboard on amazon: https://www.amazon.com/dp/B0BZVLP32M?psc=1&ref=ppx_yo2ov_dt_b_product_detail s

          Comment


          • #6
            According to the Supermicro web site, they only ever tested one model of 64GB RAM with this motherboard. This was their own module with the part number, MEM-DR464MC-ER32. This is of course pretty bad on the part of Supermicro. Especially as their store page is down at the moment and no one else stocks it. Obviously trying to create lock-in. This, their (potential) spying scandal, poor customer support and other shenanigans from Supermicro makes me vary wary about buying anything from them.

            Comment


            • #7
              Originally posted by David (PassMark) View Post
              According to the Supermicro web site, they only ever tested one model of 64GB RAM with this motherboard. This was their own module with the part number, MEM-DR464MC-ER32. This is of course pretty bad on the part of Supermicro. Especially as their store page is down at the moment and no one else stocks it. Obviously trying to create lock-in. This, their (potential) spying scandal, poor customer support and other shenanigans from Supermicro makes me vary wary about buying anything from them.
              Hi David,

              Thank you again for the replies above. I returned the two ram sticks from eBay, and went for "proper" RAM sticks directly from Crucial/Micron: https://www.crucial.com/memory/serve...MjEzMDIzNTY0OA..

              It seems that was indeed the issue, as Memtest86 passed all 4/4 passes when placing these sticks in the same mobo slots, same settings, etc.
              The only thing that showed up this time was "[UEFI Firmware Error] Could not start CPU 8" (it's possible this showed up before as well and I didn't catch it in time before all the ECC error messages), but that did not keep memtest86 from finishing and passing all 4 passes. From reading other posts, it seems that there is a good chance this is some kind of UEFI bug, and not something actually wrong with my CPU -- but I couldn't find a way of how to get a definitive answer or not. Do you know how I could confirm or deny this?

              Click image for larger version

Name:	UEFI firmware error screenshot.png
Views:	186
Size:	238.6 KB
ID:	57303

              Attaching the new log (compressed as 2.3mb exceed the upload limit) as well as screenshot of the finished HTML report that was able to output this time.

              It did take a very long ~27 hours for 128gb of RAM to finish 4 passes. I'm scared to attempt the full 1TB that I'll install in the machine -- I presume the time to finish will scale linearly? If so, that's a whopping 9-10 days so maybe I'll just do 1-2 passes only?

              And again, thank you for the help on this!

              Attached Files

              Comment


              • #8
                [UEFI Firmware Error] Could not start CPU 8
                Yes, this is a UEFI BIOS bug.
                We've tried to get Supermicro to fix it, but they don't care.

                Comment

                Working...
                X