Announcement

Collapse
No announcement yet.

ECC Polling issue & Seeking Recs. for DDR5-6400 RDIMM testing hardware

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ECC Polling issue & Seeking Recs. for DDR5-6400 RDIMM testing hardware

    Hello, this post will be split into 2 parts, but can be perceived as being highly related.

    We have a need to test a wide range of DDR5-6400 RDIMM modules and are having trouble determining a good setup to do so (native 6400, non-overclocked).

    Part 1

    I have been experimenting with a Supermicro X14SBI-F board paired with a Xeon 6505P CPU. We have MemTest Site 11.4 running over PXE.

    This setup is very cost effective and I would love for it to work but theres a couple things hindering me from going all-in on it:

    1. For some reason, ECC Polling does not work on this board. I have updated the BIOS and BMC firmware to the latest versions, WHEA is Enabled. I am unable to set ECC Polling to Enabled in MemTest. It simply gives me the message of "ECC is not enabled/unsupported on this system". I've gone through every BIOS menu and cant find any explicit options for enabling/disabling ECC all together so as far as I can tell, ECC ought to be working. (unless this particular hardware is unsupported by MemTest itself, please advise). If it is of any use as additional information, we also operate X13SRA-TF boards (DDR5-4800 RDIMM) that do allow for toggling ECC Polling.
    Side Note: On the aforementioned X13 boards, I have been running sticks that reported correctable ECC errors (via ProxMox on a production system) for like 120+ hours now and still have not seen them appear in MemTest. I don't think they were a fluke because the error addresses were consistent across multiple instances of the error.

    2. I am not completely sold yet on the viability of Supermicro motherboards for general memory testing. They are the most cost effective option on paper, but if they can't actually facilitate memory testing via MemTest, they're kind of off the table. (see part 2 of this topic for more details).

    Part 2

    For some background, we are a memory module company. We test thousands of memory modules a day. We have been operating DDR5 RDIMM testing machines for over a year at this point and have never once had a DDR5 RDIMM modules give an actual MemTest redline error while running as a single module (ive witnessed them when running multiple modules but could never be reproduced when ran as singles).

    Before jumping to conclusions about this being poor DDR5 RDIMM support with MemTest, I suspect it could have to do with Supermicro boards being picky about memory in general. It is fairly common for modules to simply not POST at all rather than enter MemTest and receive a testing error. It seems to be a hard pass-or-fail just attempting to POST on these Supermicro boards (both X13 and X14).

    I wanted to leave this kind of open ended to the guys over at passmark. What are you using during development for testing DDR5 RDIMMs and what would you say is the least finicky motherboard/CPU vendors for DDR5 RDIMM testing that you've found? We are not opposed to AMD so long as they will produce reliable and consistent results, we just need something that can be trusted to give adequate testing results. Have you experienced oddities with DDR5 RDIMM on Supermicro boards similar to what ive described (impossible to find a stick that actually posts and errors in MemTest)?

    I will end this off by apologizing for the sporadic nature of the post. I'm at my whits end trying to understand this and trying to actually find a bad DDR5 RDIMM module (via MemTest).
    If you have a DDR5 RDIMM module that reliably redlines in MemTest, I will literally buy it off of you at this point.

    I am highly technical and am more than willing to provide whatever additional details are necessary to assist. Let me know.

    Thank you for reading and taking the time to consider.

  • #2
    For ECC support can you boot from USB and collect a debug log file
    https://www.memtest86.com/tech_debug-logs.html

    Many normal RAM errors should be corrected by the ECC function. So not too surprising you are seeing very few errors with ECC RAM. But Memtest86 should also be reporting the ECC correction events (at least for supported systems).

    Running single module tests means running them as single channel. We've seen numerous cases where dual channel testing gives different results from single channel testing.

    We don't have a wide RAM of server machines in house. Most of our development is done on regular desktop machines. We certainly have had problems with Supermicro boards in the past however.
    https://forums.passmark.com/memtest8...election-modes
    https://forums.passmark.com/memtest8...permicro-board

    If you have a DDR5 RDIMM module that reliably redlines in MemTest
    We've got custom DDR5 RAM sticks made with errors at exact known address locations. As far as I know these are unique. Very hard to make. So very expensive. We don't have any for ECC RAM.

    However we do have this DDR4 ECC interposer to insert errors. These provoke errors in Memtest86. Eventually we'll be making a DDR5 version of this hardware (or we'll do it when we have a customer that wants a bunch of them).


    Comment


    • #3
      Thank you for your prompt reply. I have a couple things to attend to but I will do this today and report back. As for the USB boot and debug log file, is there a minimum test duration you need for useful data or simply just a boot?

      Comment


      • #4
        Here is that log file that you had requested. Ran with a MemTest Free version 10.5 and a single Micron 96GB 2Rx4 DDR5-6400 EC8 RDIMM.

        Thank you again for your response. It was very insightful and very cool that you have a modified DIMM that is guaranteed to throw errors.

        In the interim I have been looking at other options for testing this spec. The pickings are pretty slim for native 6400 RDIMM without paying $10K per rig (real servers).
        Since you say that Supermicros are notorious for having issues, are there any vendors you would say are on the other end of that scale? AKA, who has the most standard, least quirky stuff generally. I know its case by case with each board/chipset/CPU but this is a root question I've been trying to find the answer to for years.

        Looks like the only real options for this particular spec/market segment though is AsRock Rack, Gigabyte, Supermicro, and maybe some Tyan stuff. AMD EPYC 9005 and Intel Xeon 6 options are pretty comparable on price at the moment (AMD seems a bit more scarce though in the QTYs we need).

        Do you have any comments you're able to share on these vendor options and whats had the least quirks historically and if AMD/Intel makes any notable difference in regards to memory testing?

        I am looking forward to your insights. Thanks a million.
        Attached Files

        Comment


        • #5
          Log is actually from V11.5 (not V10.5).
          We'll take look and get back to you.

          ECC support has been pretty awful for decades. Poor testing by the vendors, several different standards for reporting errors, different BIOS settings per vendor, secret documents only available under NDA, ever different model of CPU needing different code to enable ECC.

          This last point is especially hard on us, as we have to update MemTest86 for each new CPU. And often there is no documentation as to what is required for each new model.

          The whole ECC thing is a bit of a scam. It should be enabled on all memory sticks and CPUs. Yes, it might cost 10% more per RAM stick. But it would be worth it.

          Microsoft (or someone) needs to force some standards on the market.

          We don't have any perfect solution for testing 6400 RDIMM.






          Comment


          • #6
            Yes you are correct about the 11.5, typo. I appreciate it and understand the headache of cutting edge/proprietary nonsense. Not that its on the scale of MemTest but I've built SPD reprogramming software that tries to cover everything, JEDEC doesn't really update their SPD byte documentation beyond the first couple years of standards and that doesn't even scratch the surface of all the odd-ball stuff (Looking at you Intel. XMP and Optane documentation is nearly non-existent. And various other proprietary crap that can only be found over time via thousands of physical examples).

            With that all being the case though, probably best that I stick with this Xeon 6505P since you guys have that log data now and presumably will include support for it in the next release.

            Comment


            • #7
              Originally posted by benzing View Post
              Thank you for your prompt reply. I have a couple things to attend to but I will do this today and report back. As for the USB boot and debug log file, is there a minimum test duration you need for useful data or simply just a boot?
              Thanks for the logs. We don't have access to some details of the chipset but made an initial, untested attempt based on the previous generation chipset.

              If possible, can you give the following build a try and send the logs as well.
              https://www.passmark.com/temp/memtes...-11.5.1006.zip

              Comment


              • #8
                Hey Keith,

                Here is that log file from running the build you sent over.

                Additional note: The "ECC Enabled" option had a value of "N/A" on the official build, and on this test build the value was "No".

                Attached Files

                Comment


                • #9
                  Thanks for the logs.

                  We were able to obtain some register details for the chipset and corrected the implementation.

                  Can you give this build a try:
                  https://www.passmark.com/temp/memtes...-11.5.1007.zip

                  Comment


                  • #10
                    Here are the logs from 11.5.1007. Memory installed was a single Micron 96GB DDR5-6400 RDIMM. ECC Polling unable to be enabled. Noticed an oddity with the number of reported channels being 12 rather than 1.

                    I just received an additional board to try out (hoping its less finicky, and faster, in general to boot than the Supermicro). Its an AsRock Rack GNRD8-2L2T, takes the same CPU.
                    Let me know if you want me to start providing logs from both of these boards if it will provide additional useful details. Its probably a different flavor but I think the AsRock uses an AMI bios as well.

                    Thanks again!
                    Attached Files

                    Comment


                    • #11
                      Thanks for the logs.

                      We made some slight fixes. Can you give this build a try:
                      https://www.passmark.com/temp/memtes...-11.5.1009.zip

                      Logs from different platforms would absolutely be useful, especially after once we can this one working.

                      Comment


                      • #12
                        That seemed to work!!! ECC Polling is now enabled/enable-able. On both boards. See below for the 2 log files. If there's anything else that I can help with regarding this, please do not hesitate to ask.

                        Regarding a production release, when do you think our site edition could be updated with this support? Is there any chance we could get a prelim site license build to run until full release?

                        Thanks again so much Keith and David, this was my first experience on the forums and its been the most useful/productive I've ever had with any companies support.
                        Lets hope it wont be necessary again but if I ever have issues, I know where to come now

                        Attached Files

                        Comment


                        • #13
                          We got a bit lucky as we got the (seriously hard to get) documentation from Intel for this CPU just recently.

                          As there is no ECC standardization for reporting faults, the same situation will likely happen again next year with a new family of CPUs.

                          Any prelim release wouldn't have code signing (so wouldn't work with secure boot). Which might or might not be a problem for you? It takes Microsoft a week or more to code sign any package for us.

                          Comment


                          • #14
                            Good to hear the ECC capabilities appear to be detected correctly.

                            We did find some issues with detecting the memory speed in the logs, and made the appropriate fixes.

                            If possible, can you give this build a try:
                            https://www.passmark.com/temp/memtes...-11.5.1010.zip

                            Regarding a preliminary Site Edition build, we can provide it but we'll need the license details. Can you send us an e-mail or DM.

                            Comment

                            Working...
                            X