Announcement

Collapse
No announcement yet.

ECC Errors - Which RDIMM?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Sure! I just downloaded 10.2 and tried it out. This time, instead of restarting the system at "Getting SPD details..." it just gets stuck there. No other output appears after that line.

    Comment


    • #17
      In case it helps, here's the log from the 10.2 attempt.
      Attached Files

      Comment


      • #18
        OK thanks. We'll take a look. It will be a few days as we are going into Christmas.

        Comment


        • #19
          Thanks! If there's anything I can do to help, please let me know. Unfortunately, I don't have any other open systems that take this kind of memory and am not sure how long I can hold onto the replacement DIMM.

          Comment


          • #20
            UPDATE - After updating the BIOS on the new motherboard, I noticed that I could get into MemTest 10.2 but only if I let the system boot to the USB flash drive directly with no user interaction. When I tried to go through the boot menu and explicitly select the same USB flash drive to boot from, MemTest would still get stuck after "Getting SPD details...".

            However, now there's a new problem -- MemTest says ECC is *not* enabled.

            FWIW, this board/CPU/memory setup is:

            Supermicro M12SWA-TF
            Threadripper PRO 5965WX
            Samsung M393A4K40EB3-CWE 32GB ECC RAM

            There's only one BIOS setting that looks remotely related to ECC. It's called "Memory Corrected Error Enabling" and I've tried it set to "Enabled" and "Disabled" with no change in the ECC status as reported by MemTest.

            In case it helps, I'm attaching the logs plus the RAM and sys info files.​
            Attached Files

            Comment


            • #21
              UPDATE -- I got a tip from Supermicro to try checking using Ubuntu Live and dmidecode. Per dmidecode, both of my systems support "Multi-bit ECC".

              So, it seems there may be a bug in MemTest86's ECC detection on this motherboard / chipset?

              Here's the dmidecode output from the M12SWA-TF system.

              Click image for larger version

Name:	dmidecode.png
Views:	149
Size:	177.0 KB
ID:	54126

              Comment


              • #22
                Originally posted by lunadesign View Post
                UPDATE - After updating the BIOS on the new motherboard, I noticed that I could get into MemTest 10.2 but only if I let the system boot to the USB flash drive directly with no user interaction. When I tried to go through the boot menu and explicitly select the same USB flash drive to boot from, MemTest would still get stuck after "Getting SPD details...".
                Thanks for the update. According to the logs, it appears to freeze while attempting to obtain multiprocessor info from the UEFI firmware. This is likely a BIOS bug though it is strange that the behaviour is different depending on whether MemTest86 is selected to boot from menu or not.

                Originally posted by lunadesign View Post
                However, now there's a new problem -- MemTest says ECC is *not* enabled.

                We're looking into adding support for this particular chipset. If you send an e-mail to us, we can provide you a build to test.

                Comment


                • #23
                  Originally posted by keith View Post

                  Thanks for the update. According to the logs, it appears to freeze while attempting to obtain multiprocessor info from the UEFI firmware. This is likely a BIOS bug though it is strange that the behaviour is different depending on whether MemTest86 is selected to boot from menu or not.

                  We're looking into adding support for this particular chipset. If you send an e-mail to us, we can provide you a build to test.
                  Thanks! That sound about right. Very bizarre.

                  I've just sent an e-mail. Please let me know what I can do to help get this chipset supported.

                  Comment


                  • #24
                    I obtained the test build from Keith and gave it a try on the TR Pro/WRX80 motherboard (M12SWA-TF). It took 2 or 3 reboots to get it past the suspected BIOS bug (even one where I didn't use the boot menu). But once I got in, I noticed that ECC Polling is now enabled. Yay!

                    I ran the standard memory tests with a single ECC 32GB DIMM for a full pass and saw no errors (ECC or otherwise).

                    I fully populated the same motherboard with 8 32GB DIMMs and ran the standard memory tests overnight. It's currently in the middle of pass 2. So far, no errors (ECC or otherwise).

                    Note: The DIMMs I've been testing on the TR Pro/WRX80 motherboard so far are not the same ones that I was having problems with on the EPYC motherboard at the beginning of this thread.

                    Question for PassMark: Does the test build have the same ability to detect ECC errors on the TR Pro/WRX80 motherboard as 10.0/10.2 did on the EPYC motherboard (the one at the beginning of this thread)?

                    If yes, I can declare the TR Pro/WRX80 motherboard as "known good" and use it to test the memory that was having issues on the EPYC motherboard. This will finally allow me to determine if it was the memory or the EPYC motherboard that was having issues.​

                    Comment


                    • #25
                      Does the test build have the same ability to detect ECC errors on the TR Pro/WRX80 motherboard as 10.0/10.2 did on the EPYC motherboard
                      Unless we have messed up something, it should be at least as functional as the older patch release.

                      Comment


                      • #26
                        Originally posted by David (PassMark) View Post

                        Unless we have messed up something, it should be at least as functional as the older patch release.
                        Good to know. I wasn't sure if you needed to add some WRX80-specific logic to detect ECC errors like the ones flagged on my EPYC board. Thanks!

                        Comment


                        • #27

                          TESTING UPDATE

                          I'm going to include a full recap that covers the full story so you don't have to scroll back. In a few cases, I'm going to identify specific DIMMS by the last 3 digits of the serial number so you can follow the movement of DIMMs.

                          1) EPYC motherboard with 8 x 64GB ECC RDIMMs has ECC errors.
                          a) Initially, errors are coming from DIMM 4E1 in channel 3.
                          b) I swap the channel 2 and 3 DIMMs. 4E1 is now in channel 2, 482 is now in channel 3.
                          c) I re-run the tests and errors are now coming from DIMM 4E1 in channel 2.
                          d) I remove DIMM 4E1 from the system and replace it with a new replacement (5C6). So, 5C6 is now in channel 2.
                          e) I re-run the tests and errors are now coming from DIMM 5C6 in channel 2.
                          f) I swap the channel 2 and 4 DIMMs. 5C6 is now in channel 4, 73C is now in channel 2.
                          g) I re-run the tests and errors are now coming from DIMM 73C in channel 2.

                          2) TR Pro motherboard with 8 x 32GB ECC RDIMMs fully tests out with no errors.

                          3) Back on the EPYC motherboard...
                          a) I re-run the tests on the 64GB DIMMs to make sure the problem is still reproducible. Once again, I get errors from DIMM 73C in channel 2.
                          b) I remove all DIMMs, use canned air to blow out all the memory sockets, carefully reinstall all DIMMs in the exact same slots they were previously in.
                          c) I re-run the tests. Once again, I get errors from DIMM 73C in channel 2.

                          4) I move all of the 64GB DIMMs from the EPYC motherboard to the TR Pro motherboard. I keep the same DIMM-to-slot assignments as I had with on the EPYC board (the two boards have the same slot-to-channel associations). I run the tests on the TR Pro and get ECC errors from DIMM 73C in channel 2.

                          5) I move all of the 32GB DIMMs that were previously in the TR Pro motherboard to the EPYC motherboard. I keep the same DIMM-to-slot assignments. I run the tests on the EPYC and get no ECC errors.

                          Since the 32GB DIMMs have worked perfectly on both boards, I think those DIMMs and both boards are fine. I think the problem is with the 64GB DIMMs.

                          The challenge in identifying the culprit is that I've seen errors reported on 3 different 64GB DIMMs. In case it helps:
                          • I have 9 64GB DIMMs with the exact same part number
                          • 2 were manufactured in late 2020
                          • 7 were manufactured in late 2021
                          • Of the 3 that have reported errors, 1 was from the 2020 batch, 2 were from the 2021 batch
                          It also seems odd that the errors are always in channels 2 or 3.

                          I'm not sure how likely it is that I've got 3 different bad DIMMs from 2 batches. Is it possible that another DIMM in the set is causing these 3 to report errors? Or are the channels *completely* independent? (Both motherboards have a single slot per memory channel.)

                          I guess I could try testing each of the problematic 64GB DIMM independently to see if triggers the errors?

                          Thoughts?​

                          Comment


                          • #28
                            It would be strange to have 3 bad sticks. Unless it was a design flaw (i.e. all sticks of that RAM design are marginal in these AMD motherboards).

                            Might be time to start tweaking BIOS settings. This should totally not be required, as the vendors experts should be doing this for you.
                            Assuming the BIOS lets you, bump up the voltages very slightly and drop the speed (e.g. turn off XMP). Basically following the same steps as you would have if you have overclocked the RAM.

                            There are some more detailed comments here
                            https://help.corsair.com/hc/en-us/ar...locking-memory

                            Comment


                            • #29
                              Originally posted by David (PassMark) View Post
                              It would be strange to have 3 bad sticks. Unless it was a design flaw (i.e. all sticks of that RAM design are marginal in these AMD motherboards).

                              Might be time to start tweaking BIOS settings. This should totally not be required, as the vendors experts should be doing this for you.
                              Assuming the BIOS lets you, bump up the voltages very slightly and drop the speed (e.g. turn off XMP). Basically following the same steps as you would have if you have overclocked the RAM.

                              There are some more detailed comments here
                              https://help.corsair.com/hc/en-us/ar...locking-memory
                              Unfortunately, the board in question is a server-class board with very limited memory-related BIOS settings (see below). I've always run with stock memory settings so I'd hate to have to get into that game now.

                              Any thoughts as to my theory where a DIMM might be triggering the error in one or more DIMMs?

                              Click image for larger version

Name:	Memory BIOS.png
Views:	176
Size:	199.2 KB
ID:	54176

                              Comment


                              • #30
                                Any thoughts as to my theory where a DIMM might be triggering the error in one or more DIMMs?
                                It is getting outside my area of expertise.
                                But if one stick (or all sticks together) were drawing too much current, this might drop voltage levels below acceptable level. Thus the suggestion to bump voltage up.

                                Comment

                                Working...
                                X