Announcement

Collapse
No announcement yet.

ECC Errors - Which RDIMM?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ECC Errors - Which RDIMM?

    I'm using MemTest86 V10.0 Pro to test some new ECC RDIMMs on a Supermicro H12SSL-NT with an EPYC 7262.

    I'm nearing the end of Pass 1 and have already seen 3 ECC Errors detected (one during Test 8, two during Test 13).

    In each case, the message identifies "Channel-Slot: 3-X".

    Do the channel numbers on this platform start with 0 or 1? I'd like to know so I can figure out which RDIMM I need to re-seat and/or replace.

    I'm guessing the "X" part means "N/A" since this motherboard has only 1 DIMM slot per memory channel.

    Thanks!

    Click image for larger version  Name:	ECC-Errors.png Views:	0 Size:	129.3 KB ID:	54042

  • #2
    Fairly sure they are zero numbered.
    These were corrected 1 bit errors by the way, so not errors that are going to cause a problem (just yet).

    Comment


    • #3
      Originally posted by David (PassMark) View Post
      Fairly sure they are zero numbered.
      These were corrected 1 bit errors by the way, so not errors that are going to cause a problem (just yet).
      Thanks David! Would any of the other screens within MemTest confirm the zero-based numbering?

      The fun part is that Supermicro labels the slots DIMMA1 through DIMMH1 but I'm not sure "A" is the first channel. I'm thinking this because their memory population guidelines suggest populating DIMMC1 first, then DIMMD1.

      Since this is for a file server, I'm definitely going to want to fix this one way or another.

      Comment


      • #4
        The channel/slots use zero-based numbering. So it is the 4th channel that report ECC errors.

        Word of caution, the reported ECC errors are from the perspective of the CPU; how the motherboard vendor chooses to map the physical slots on the board to the CPU memory controller is arbitrary.

        If you happen to have the logs, we can have a look and see if there are any additional details available.

        Comment


        • #5
          Originally posted by keith View Post
          The channel/slots use zero-based numbering. So it is the 4th channel that report ECC errors.

          Word of caution, the reported ECC errors are from the perspective of the CPU; how the motherboard vendor chooses to map the physical slots on the board to the CPU memory controller is arbitrary.

          If you happen to have the logs, we can have a look and see if there are any additional details available.
          Good to know! Where do I find the logs? Are they automatically written to the USB stick or do I have to do something to trigger them to be written? I still have the test running so I'm hoping the data is still available.

          Comment


          • #6
            Debug logs info can be found here
            https://www.memtest86.com/tech_debug-logs.html

            Comment


            • #7
              Thanks David! Sorry I didn't search hard enough to find the log location by myself.

              I'm attaching the log here.

              Two additional notes:

              1) The really handy "View detailed RAM (SPD) info" section of MemTest86 appears to identify the problematic DIMM. Specifically, under "DIMM #4", I see this:

              Device Locator: DIMMD1
              Bank Locator: P0_Node0_Channel3_Dimm0

              "DIMMD1" corresponds to a specific slot on the motherboard. I'm going to try re-seating that. If that doesn't work, I'll swap it with the DIMM in "DIMMC1".

              2) According to Supermicro Support, the IPMI "Health Event Log" is supposed to show ECC errors but I didn't see any. Perhaps they only log the uncorrectable ones?

              PS: I really love this tool!​
              Attached Files

              Comment


              • #8
                Originally posted by lunadesign View Post
                Device Locator: DIMMD1
                Bank Locator: P0_Node0_Channel3_Dimm0

                "DIMMD1" corresponds to a specific slot on the motherboard. I'm going to try re-seating that. If that doesn't work, I'll swap it with the DIMM in "DIMMC1".
                That is a fair assumption. The logs indicate the ECC error count register is incrementing while the tests are running and appear to be real errors (as opposed to false positives).

                2022-12-13 23:20:39 - [Channel 3, Slot 0] DIMM err count=8 (prev=6)
                2022-12-13 23:20:39 - [MEM ERROR - ECC] Test: 4, (Chan,Slot,Rank,Bank,Row,Col): (3,N/A,N/A,N/A,N/A,N/A), ECC Corrected: yes, Syndrome: N/A, Channel/Slot: 3-X

                Comment


                • #9
                  Thank you very much, Keith!

                  Last night, I tried re-seating the Channel 3 DIMM (DIMMD1). I re-ran MemTest and got an ECC error on the that DIMM within an hour.

                  Then, I tried swapping the Channel 3 (DIMMD1) and Channel 2 (DIMMC1) DIMMs. I re-ran MemTest overnight and got ECC errors - all on Channel 2.

                  Now that the same DIMM has caused ECC errors in 2 different slots, it seems like I've got a bad DIMM. Is there any else I should try or should I start acquiring a replacement DIMM?

                  FYI - Not sure it matters but the DIMMs are all the same manufacturer / part number. However, the ones in Channels 2 and 3 were manufactured the 2nd week of 2020 while the rest were in weeks 34 & 39 of 2021.​

                  Click image for larger version

Name:	ECC-Errors-3.png
Views:	263
Size:	121.9 KB
ID:	54062

                  Comment


                  • #10
                    I wouldn't waste anymore time on it.
                    If your vendor is happy to replace the stick, then get it replaced.

                    Comment


                    • #11
                      Thanks! I've ordered the replacement.

                      On the bright side, I know the motherboard's ECC mechanism works!

                      Comment


                      • #12
                        UPDATE - I received the replacement DIMM today and used it to swap out the problematic one. I fired up MemTest and was surprised to see ECC errors from the replacement DIMM.

                        Quick recap:
                        1) In each test, DIMMs are installed in all 8 slots
                        2) Problematic DIMM has ECC errors in slots C1 and D1
                        3) Replacement DIMM has ECC errors in slot C1
                        4) IPMI hasn't detecting a single error

                        What do I try next?

                        Comment


                        • #13
                          Do you have another system (identical or not) that you can try the suspect RAM in?

                          Comment


                          • #14
                            As luck would have it, I just received a new workstation board (Supermicro M12SWA-TF) that has the same memory on the manufacturer's tested list.

                            I just got the system built today but unfortunately MemTest86 won't run on it. When I try to boot from USB with a single stick of memory, I see the following:

                            Retrieving hardware info. Please wait...
                            Getting CPUID info...
                            Getting cache size...
                            Measuring CPU/cache/mem speeds...
                            Retrieving CPU MSR data...
                            Getting memory size...
                            Getting SPD details...


                            A few seconds after that last line, the system restarts itself.

                            I think I've seen something like this on some older Supermicro X9 motherboards but I chalked that up to a BIOS vs UEFI thing. This mobo shouldn't be having any of those sorts of issues.​

                            I'm attaching the MemTest log here. Any ideas how I can get MemTest up and running on this system?
                            Attached Files

                            Comment


                            • #15

                              You are using V10.0. Can you try V10.1 or V10.2.
                              (V10.2 is being released later today)
                              https://www.memtest86.com/whats-new.html

                              If it still doesn't work we can have a deeper look.

                              Comment

                              Working...
                              X