Announcement

Collapse
No announcement yet.

ECC Errors - Which RDIMM?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    I see. That makes sense. Thanks!

    I've got two more quick questions:

    1) I'm noticing that SPD information is sometimes detected but other times not. Is that a motherboard-specific thing?

    2) Do motherboards cache memory info, specifically serial numbers? I ask because I just moved DIMM 73C from the TR Pro board to the EPYC board and put another DIMM in it's place on the TR Pro board. However, MemTest on the TR Pro still thinks 73C is installed. (Unfortunately, Hynix DIMMs don't have serial numbers on their labels so I rely on MemTest to know which one is which.)

    Comment


    • #32
      With regards to the motherboard caching the memory info, that apparently *is* happening, at least on the TR Pro board. The only way I could get the board to discover a different Hynix DIMM was installed was to temporarily install a Samsung DIMM in that slot, boot it up, bring it down, reinstall the Hynix DIMM in that slot and boot it up. Very strange!

      I'm currently testing the 64GB DIMMs one-by-one on the EPYC board. I was hoping to use the TR Pro to do this too but I can't seem to get MemTest test build to fully come up any more on that board....it now gets stuck after "Getting SPD details..." every time. Sometimes it gets stuck indefinitely. Other times, it gets stuck for about 30-60 seconds before resetting the system. I'm not sure what's changed to cause this.

      Never a dull moment!

      Comment


      • #33
        I'm noticing that SPD information is sometimes detected but other times not. Is that a motherboard-specific thing?
        There are some motherboards where collecting SPD data is a problem. But it is normally consistent. i.e. all the data, or none of the data. The same behaviour every time.
        If you want post two debug log files. One with the data and one without, then we can take a look.

        Same for the stuck boot.

        I've never seen a motherboard cache the RAM SPD data. But it also isn't something we have gone looking for. Caching could have a really bad outcome if the machine failed to detect a new type of RAM being inserted (with a different speed or voltage requirement).

        Comment


        • #34
          Thanks as always for the responsiveness and the guidance. I realize I'm getting into weird territory with some of the issues I'm reporting.

          With regards to the "stuck boot" case, I'm attaching 2 logs to this post. Both are from the TR Pro system with the same DIMM installed. The file ending in 073020 is from one of the cases where it got stuck. The file ending in 073517 is from one of the cases where it fully booted.

          More logs / results to come...
          Attached Files

          Comment


          • #35
            With regards to the "sometimes SPD is present, sometimes its not" case, I'm attaching 2 logs to this post. Both were taken from the EPYC system. In each case, 1 DIMM was installed (different ones). The file ending in 073517 has no SPD data while the file ending in 000154 has SPD data.

            Correction: I just discovered one of these was on the TR Pro system (USB sticks got swapped). I'll try to get better data.
            Attached Files
            Last edited by lunadesign; Dec-31-2022, 04:58 AM.

            Comment


            • #36
              Happy New Year to those of you in Australia!

              I've been testing the 64GB DIMMs one-by-one. Some are being tested on the EPYC board, others on the TR Pro board.

              In the first round, I tested:
              • EPYC board: DIMM 73C (previously had ECC errors). This time it went through 2.5 passes with no errors. (This test somehow got started with CPU Selection set to 1 CPU so it didn't get through as many passes.)
              • TR Pro board: DIMM 601 (no ECC errors to date). This time it went through 4 passes with no errors.
              In the second round, I tested:
              • EPYC board: DIMM 5C6 (previously had ECC errors). This test is still running but has gone through 3.7 passes so far with no errors.
              • TR Pro board: DIMM 4E1 (previously had ECC errors). This one was fine until the near the end of pass 4, when it triggered one ECC error during test 13. But here's where it gets weird. This DIMM is installed in channel 2 (DIMMC1 slot) but the ECC error identified channel 3 (DIMMD1 slot), which is empty. How is this possible?
              A few other issues with the 2nd round TR Pro test:
              • The "caching" of the DIMM info is very much happening on this board. MemTest and the system's IPMI config both identify DIMM 5C6 being installed in this board. However, 5C6 hasn't been installed in this system for at least a day or two. Since then, 601 and now 4E1 have been in this system in that very slot. This seems like a BIOS bug. My theory is that the BIOS only updates the DIMM info if it detects a DIMM with a different manufacturer or model number. The EPYC board doesn't seem to have this problem.
              • The DIMM Results report image identifies the slots incorrectly. The report uses A1, A2, B1, B2, C1, C2, D1, D2 but this board has A1, B1, C1, D1, E1, F1, G1, H1. This seems like a MemTest bug as I don't see any references to A2, B2, etc in the MemTest log file but I do see E1, etc.
              I'm attaching the files from the 2nd round TR Pro test here and some screenshots in the next post.
              Attached Files

              Comment


              • #37
                Here are some screenshots from the 2nd round TR Pro test.

                Click image for larger version

Name:	Screen 2.png
Views:	455
Size:	81.7 KB
ID:	54217

                Click image for larger version

Name:	DIMMResults-20221231-230034.png
Views:	428
Size:	34.3 KB
ID:	54218

                Comment


                • #38
                  Thanks for the logs and screenshot.

                  Originally posted by lunadesign View Post
                  This DIMM is installed in channel 2 (DIMMC1 slot) but the ECC error identified channel 3 (DIMMD1 slot), which is empty. How is this possible?
                  According to the logs, it sees ECC-enabled memory in channel 3 from the point of view of the CPU. There is a possibility that the mapping of the physical DIMM slots to the chipset memory controller may not be in sequential order. Can you try moving the stick to different slots and grab the logs for those runs? (No need for ECC errors to be detected)


                  Originally posted by lunadesign View Post
                  The "caching" of the DIMM info is very much happening on this board. MemTest and the system's IPMI config both identify DIMM 5C6 being installed in this board. However, 5C6 hasn't been installed in this system for at least a day or two. Since then, 601 and now 4E1 have been in this system in that very slot. This seems like a BIOS bug. My theory is that the BIOS only updates the DIMM info if it detects a DIMM with a different manufacturer or model number. The EPYC board doesn't seem to have this problem.
                  Would likely be a BIOS bug as MemTest86 only reads directly from what is stored in the SMBIOS and does no caching whatsoever.

                  Originally posted by lunadesign View Post
                  The DIMM Results report image identifies the slots incorrectly. The report uses A1, A2, B1, B2, C1, C2, D1, D2 but this board has A1, B1, C1, D1, E1, F1, G1, H1. This seems like a MemTest bug as I don't see any references to A2, B2, etc in the MemTest log file but I do see E1, etc.
                  The labeling of slots was initially based on the naming convention used by specific boards and chipsets that support DIMM/chip decoding. However, this was extended for non-supported platforms as well. We will fix this by using the slot name stored in the SMBIOS.

                  Comment


                  • #39
                    Originally posted by keith View Post
                    According to the logs, it sees ECC-enabled memory in channel 3 from the point of view of the CPU. There is a possibility that the mapping of the physical DIMM slots to the chipset memory controller may not be in sequential order. Can you try moving the stick to different slots and grab the logs for those runs? (No need for ECC errors to be detected)
                    Will this experiment work without obtaining an ECC error?

                    I ask because in that log the only place where I see channel 3 mentioned is in the context of the ECC error. It's possible there are other channel 3 references that I am not seeing or don't know how to decode but I wanted to make sure.

                    Earlier in the log, I see the DIMM correctly identified as being located in channel 2 (it also shows channel 3 being unoccupied).

                    2022-12-31 05:13:15 - [Slot 2] DeviceLocator: DIMMC1, BankLocator: P0_Node0_Channel2_Dimm0, Manufacturer: SK Hynix, S/N: 80AD012134955385C6, AssetTag: DIMMC1_AssetTag (date:21/34), PartNumber: HMAA8GR7AJR4N-XN

                    2022-12-31 05:13:15 - [Slot 3] DeviceLocator: DIMMD1, BankLocator: P0_Node0_Channel3_Dimm0, Manufacturer: NO DIMM, S/N: Unknown, AssetTag: NO DIMM, PartNumber: Unknown

                    Comment


                    • #40
                      Getting back to the "sometimes SPD is present" case, here are two logs from the EPYC system. The log ending in 000154 has the SPD info, the log ending in 025345 does not.

                      Meanwhile, I'm continuing to test one DIMM at a time.
                      Attached Files

                      Comment


                      • #41
                        Originally posted by lunadesign View Post
                        Will this experiment work without obtaining an ECC error?
                        Yes, the logs will indicate which "channel" the DIMM was installed on. This doesn't require ECC errors to be detected.
                        Hopefully with enough data points, we can determine how these channels are mapped to the physical slots on the board.

                        Getting back to the "sometimes SPD is present" case, here are two logs from the EPYC system.
                        ​Thanks for the logs. This is quite strange so we might have to get back to you with a new build to collect additional debug info.

                        Comment


                        • #42
                          Originally posted by keith View Post
                          Yes, the logs will indicate which "channel" the DIMM was installed on. This doesn't require ECC errors to be detected.
                          Hopefully with enough data points, we can determine how these channels are mapped to the physical slots on the board.
                          I'm not sure I explained myself very well in the previous post. In the log we're talking about where only a single DIMM was tested, are there any data elements that indicate the DIMM being in channel 3 besides the ECC error messages? If no, I'm not sure how this will work without an ECC error because all the non-ECC error elements I saw referenced the DIMM being in channel 2.

                          Originally posted by keith View Post
                          ​Thanks for the logs. This is quite strange so we might have to get back to you with a new build to collect additional debug info.
                          I'd be happy to try it out. FWIW, on the board where I've seen SPD present, it happens pretty rarely. It's much more common that SPD is *not* present, even on that board.

                          Comment


                          • #43
                            I've pretty much finished testing the nine 64GB DIMMs separately. Here's what I found:
                            • DIMM 4E1 had a single ECC error. This is the one I reported earlier where the DIMM was in channel 2 but the ECC error was identified as coming from the empty channel 3. This happened on the TR Pro board.
                            • DIMM 482 had numerous ECC errors (a few per hour). Curiously, this one also was in channel 2 but the ECC errors were all identified as coming from the empty channel 3. This happened on the EPYC board.
                            Now that I've seen this "channel 2 reported as channel 3" case twice, I went back and looked at the earlier tests with 8 DIMMs installed. It turns out the really problematic DIMM (482) was installed in every single test and was either:
                            • Installed in channel 2 when channel 3 had ECC errors
                            • Installed in channel 3 when channel 2 had ECC errors
                            So, I think DIMM 482 has been the primary culprit all along.

                            I'm not sure about DIMM 4E1. It *was* installed in channel 3 when channel 3 had ECC errors. It was also in channel 2 the first time channel 2 had ECC errors. But, I had another four 8 DIMM tests where DIMM 4E1 was not installed and still encountered ECC errors. I'll also note that I ran two full rounds of single DIMM tests with this DIMM and it fully passed once and had a single ECC error the other time. Just for yucks, I'm re-running another round overnight.

                            I'm inclined to return DIMMs 482 and 4E1. I already have the replacement for one of them but need another to have 8 "known good" DIMMs to do a full re-test with 8 DIMMs.

                            Getting back to the "channel 2 reported as channel 3" case (and its inverse), it seems like there's a bug in either MemTest or the BIOS routine that's providing the ECC error information to MemTest. I'd love to see this investigated and fixed because it definitely sent me on a wild goose chase by seemingly identifying the wrong DIMMs on each ECC error.

                            Since I currently have a DIMM that triggers ECC errors pretty regularly, I'm open to doing some experimentation. However, it appears my motherboard has pretty specific memory population guidelines so it's not as simple as trying the single DIMM test with the problematic DIMM in each slot. I'd likely install all 8 and try moving that DIMM around to see what channels get flagged each time there's an error.​

                            Comment


                            • #44
                              Originally posted by lunadesign View Post
                              Since I currently have a DIMM that triggers ECC errors pretty regularly, I'm open to doing some experimentation. However, it appears my motherboard has pretty specific memory population guidelines so it's not as simple as trying the single DIMM test with the problematic DIMM in each slot. I'd likely install all 8 and try moving that DIMM around to see what channels get flagged each time there's an error.​
                              For the purpose of debugging, a known good DIMM installed in a single slot for each of the physical slots is sufficient to determine the channel mapping; it doesn't need to be the faulty DIMM.

                              Comment


                              • #45
                                Originally posted by keith View Post
                                For the purpose of debugging, a known good DIMM installed in a single slot for each of the physical slots is sufficient to determine the channel mapping; it doesn't need to be the faulty DIMM.
                                Unfortunately, I checked with Supermicro and both boards only support a single DIMM config by populating slot DIMMC1. IE, I can't put a single stick in slot DIMMD1 and leave the rest of the slots empty.

                                Even if it were possible to do single DIMM configs with each available slot, I still don't understand how this would work since the MemTest logs are always showing the correct channel mappings EXCEPT when displaying an ECC error. I'm guessing you missed my attempts to clarify this earlier in this thread. I'd appreciate it if you could answer this as I want to make sure I'm not totally confused here.

                                Comment

                                Working...
                                X