Announcement

Collapse
No announcement yet.

ECC injection and 3990x

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ECC injection and 3990x

    CPU: AMD Threadripper 3990x
    Motherboard: Asus ROG Zenith II Extreme Alpha
    Memory: 128GB (4x32GB) Micron ECC

    I'm using the Pro version of memtest86, just downloaded a few minutes ago.

    I enabled ECC error injection, initially this failed with "ECC injection may be disabled for AMD ryzen 30-3f"

    I then set this option to FALSE in the BIOS:
    "CBS->DDR4 Common Options -> Common RAS -> Disabled Memory Error Injection"

    Re-ran memtest, and it no longer complains about "ECC injection may be disabled".

    However, no errors are reported.

    This BIOS has no "Platform First Error Handling" / PFEH option.
    I suspect there is no platform first error handling, as I've seen corrected ECC memory errors (WHEA log entries in Windows). I've been able to cause corrected errors in Windows if the memory is overclocked/undervolted.

    Two questions:
    1. I don't see the Ryzen 30-3f mentioned in the memtest ECC Supported list -- is this supported by memtest86 for ecc injection?
    2. Can memtest86 also test multi-bit ECC errors, or can that be added?

    I do want to test how the system behaves with multi-bit errors, as well as single bit errors.

  • #2
    AMD's ECC support on non Epyc CPUs is a mess.

    You can find the list of CPUs for which we support ECC injection here
    https://www.memtest86.com/compare.html

    Injection is disabled in most AMD retail CPUs​. It is very unusual to have a motherboard that claims to enable it. We haven't done any testing with this motherboard however.

    I found a discussion on this topic here
    https://hardwarecanucks.com/forum/th...d.75041/page-6

    Response from AMD was,
    • AM4 support ECC function
    • AM4 does not support ECC error reporting function
    • So AM4 platform CPU (Ryzen 1000,2000,3000 series) can all support ECC correction, but not ECC report function

    However when you really want to be sure you are injecting errors, we have a hardware option as well. Press the little yellow buttons at either end of the interposer to inject some errors.
    It does both single bit and double bit errors.
    https://www.passmark.com/products/ecc-tester/index.php

    Click image for larger version  Name:	ecc_tester.png Views:	0 Size:	1.30 MB ID:	53913


    Comment


    • #3
      Originally posted by David (PassMark) View Post
      AMD's ECC support on non Epyc CPUs is a mess.

      You can find the list of CPUs for which we support ECC injection here
      https://www.memtest86.com/compare.html

      Injection is disabled in most AMD retail CPUs​. It is very unusual to have a motherboard that claims to enable it. We haven't done any testing with this motherboard however.

      I found a discussion on this topic here
      https://hardwarecanucks.com/forum/th...d.75041/page-6

      Response from AMD was,
      • AM4 support ECC function
      • AM4 does not support ECC error reporting function
      • So AM4 platform CPU (Ryzen 1000,2000,3000 series) can all support ECC correction, but not ECC report function

      However when you really want to be sure you are injecting errors, we have a hardware option as well. Press the little yellow buttons at either end of the interposer to inject some errors.
      It does both single bit and double bit errors.
      https://www.passmark.com/products/ecc-tester/index.php


      I would rather not spend $250-1000 for something that could be done in software

      The motherboard is a TRX40 motherboard, not sure how different that may be from the AM4 reference you site.

      Another post later in the forum post you linked to says that ECC reporting is working with the 39--X series on certain motherboards:

      https://hardwarecanucks.com/forum/th...-7#post-909585


      ECC Reporting working is consistent with what I've seen too, on this motherboard.

      Other posts reference AGESA documents that appear to reference registers that could be checked for proper ECC support, and whether reporting is enabled. It sounds like some registers may need to be SET for reporting to work -- perhaps that isn't being done by the BIOS, and needs to be done by the OS? Is memtest86 doing this?

      Can you confirm if memtest86 supports ECC injection with this CPU, or is there any debugging or logging that can be checked?






      Click image for larger version

Name:	whea.png
Views:	582
Size:	67.9 KB
ID:	53916

      Comment


      • #4
        Any attempted hardware injection should be in the debug log.
        https://www.memtest86.com/tech_debug-logs.html

        Comment


        • #5
          A few things:
          1. The log seems to confirm that PFEH is supported on this system, but not enabled - so it's not "getting in the way" here.
          2. I ran memtest with all 64 cores (SMT disabled) for about 12 hours, without ECC injection (ECC is still enabled), and saw no detected ECC correction events. That's suspicious to me, as the same (deliberately mildly unstable) memory configuration causes detected ECC correction events in windows. Has the memtest86 ECC detection/reporting been verified on this platform/CPU family? If not, that may explain the lack of reporting with ECC injection, too.
          3. Looking at the memtest86 log (it's named differently now relative to your tech_debug-logs link), it appears memtest86 is timing out on the last processor, setting it to single core mode, but later decides to use that core anyway (when run all cores in parallel is selected), and there are other wait errors on that core during the memory test. That seems like odd logic to use a core that previously failed the wait test?
          4. Enabling SMT, again, memtest86 detects an error on in the last logical core (#127). Is there anything more specific that could be passed along to the BIOS provider for address a possible BIOS bug here? Strange it's only the last logical core being an issue.
          5. I've attached a (trimmed) log of a single run with 64 cores, SMT disabled, with ECC injection enabled.
          Code:
          2022-12-01 13:19:09 - MP test failed. Setting default CPU mode to SINGLE
          2022-12-01 13:19:10 - CPUID[0x00000001]:EDX[31:0] = 178BFBFF (MCA=1)
          2022-12-01 13:19:10 - CPUID[0x80000007]:EBX[31:0] = 0000001B (PfehSupportPresent=1, ScalableMCA=1)
          2022-12-01 13:19:10 - PFEH_CFG=0000000000000000 (PfehEnable=0)​
          ,,,
          
          2022-12-01 13:20:11 - MtSupportRunAllTests - Injecting ECC error
          2022-12-01 13:20:11 - inject_ryzen - CfgAddressCntl = 00000060 (SecBusNum=60)
          2022-12-01 13:20:11 - inject_ryzen - Setting bus 00 -> 60
          2022-12-01 13:20:11 - inject_ryzen - DramScrubBaseAddr=00000003 (DramScrubEn=1)
          2022-12-01 13:20:11 - inject_ryzen - Writing 0x00000002 to DramScrubBaseAddr
          2022-12-01 13:20:11 - inject_ryzen - RedirScrubCtrl=00000003 (RedirScrubMode=3)
          2022-12-01 13:20:11 - inject_ryzen - Writing 0x00000000 to RedirScrubCtrl
          2022-12-01 13:20:11 - inject_ryzen - UMC_MISCCFG[0] = 00000112​
          
          ..
          
          2022-12-01 13:20:11 - Start memory range test (0x0 - 0x6820400000)
          2022-12-01 13:20:12 - RunMemoryRangeTest - CPU #63 completed but did not signal (test time = 14ms, event wait time = 1001ms, result = Success) (BSP test time = 14ms)
          2022-12-01 13:20:12 - WARNING - possible multiprocessing bug in BIOS​
          Attached Files

          Comment


          • #6
            For my 3960x on Gigabyte board I have seen the same it appears to be working everywhere but it doesn't seem possible to test in software.

            e.g.

            inject_ryzen - UMC_MISCCFG[0] = 00000112
            inject_ryzen - UMC_ECCERRINJ_0[0] = 00010001
            inject_ryzen - writing UMC_ECCERRINJ_0[0] = 00010001
            inject_ryzen - UMC error injection cannot be enabled (MISCCFG[2] = 00000117)

            Comment


            • #7
              Originally posted by Toby View Post
              For my 3960x on Gigabyte board I have seen the same it appears to be working everywhere but it doesn't seem possible to test in software.

              e.g.

              inject_ryzen - UMC error injection cannot be enabled (MISCCFG[2] = 00000117)
              Toby, it looks like your BIOS is set to block ECC injection. That's the default in my BIOS too. You shouldn't see the "cannot be enabled" line above.

              I had to set this option to FALSE in the BIOS:
              "Advanced -> AMD CBS->DDR4 Common Options -> Common RAS -> Disabled Memory Error Injection"

              You may need to enable advanced mode in the BIOS to see these options.

              What does this section look like in your log? If PFEH is enabled, you may have to look for an option to disable Platform First Error Handling too.
              Code:
              2022-12-01 13:19:10 - CPUID[0x00000001]:EDX[31:0] = 178BFBFF (MCA=1)
              2022-12-01 13:19:10 - CPUID[0x80000007]:EBX[31:0] = 0000001B (PfehSupportPresent=1, ScalableMCA=1)
              2022-12-01 13:19:10 - PFEH_CFG=0000000000000000 (PfehEnable=0)​​
              Please post whether you get any sign of ECC injected errors being detected/reported if you make the changes above.

              I may try injection in Linux later, as that may work there.
              Last edited by eccman; Dec-02-2022, 04:22 AM.

              Comment


              • #8
                1. To David's point my Gigabyte board doesn't have this setting, I only have ECC symbol size, enable/disable, UECC retry options.

                Comment


                • #9
                  Toby, which model gigabyte board are you using that doesn't have the "Disabled Memory Error Injection" option?

                  Here's more evidence from my point of view that memtest86 ECC reporting / ECC polling isn't picking up the correction events. All the below is without using ECC injection.

                  I reduced the voltage to the memory further, to increase the error rate. Plenty of logged WHEA 47 corrected memory error events logged in windows, dozen or so in a few minutes of testing.

                  Booting into memtest86, the only things that showed up were a few multi-bit errors during compares. No displayed corrected or uncorrected ECC events. It's unlikely there weren't corrected events during this run. Uncorrected events should also show up as MCA/ECC events, right?

                  This is all I saw ...

                  Code:
                  2022-12-02 22:09:09 - [MEM ERROR - Data] Test: 3, CPU: 3, Address: 168974391C, Expected: 00000000, Actual: 00100010
                  2022-12-02 22:09:09 - [MEM ERROR - Data] Test: 3, CPU: 37, Address: 17135A631C, Expected: 00000000, Actual: 00100010
                  2022-12-02 22:09:09 - [MEM ERROR - Data] Test: 3, CPU: 17, Address: 16C1B4C95C, Expected: FFFFFFFF, Actual: FFEFFEFF​


                  Regarding the multi-processor comment I made in earlier reply, I was also left wondering if there's an off-by-one type of issue -- are enough wait events being allocated and initialized for the multi-processor support/test?
                  Last edited by eccman; Dec-03-2022, 06:46 AM.

                  Comment


                  • #10
                    I have a TRX40 AORUS MASTER (rev. 1.0), with F7 BIOS, all the GBit ones seem to be similar design maybe the DESIGNARE works as this is more popular with the workstation community

                    Comment


                    • #11
                      Originally posted by eccman View Post
                      A few things:
                      Code:
                      2022-12-01 13:19:09 - MP test failed. Setting default CPU mode to SINGLE
                      2022-12-01 13:19:10 - CPUID[0x00000001]:EDX[31:0] = 178BFBFF (MCA=1)
                      2022-12-01 13:19:10 - CPUID[0x80000007]:EBX[31:0] = 0000001B (PfehSupportPresent=1, ScalableMCA=1)
                      2022-12-01 13:19:10 - PFEH_CFG=0000000000000000 (PfehEnable=0)​
                      ,,,
                      
                      2022-12-01 13:20:11 - MtSupportRunAllTests - Injecting ECC error
                      2022-12-01 13:20:11 - inject_ryzen - CfgAddressCntl = 00000060 (SecBusNum=60)
                      2022-12-01 13:20:11 - inject_ryzen - Setting bus 00 -> 60
                      2022-12-01 13:20:11 - inject_ryzen - DramScrubBaseAddr=00000003 (DramScrubEn=1)
                      2022-12-01 13:20:11 - inject_ryzen - Writing 0x00000002 to DramScrubBaseAddr
                      2022-12-01 13:20:11 - inject_ryzen - RedirScrubCtrl=00000003 (RedirScrubMode=3)
                      2022-12-01 13:20:11 - inject_ryzen - Writing 0x00000000 to RedirScrubCtrl
                      2022-12-01 13:20:11 - inject_ryzen - UMC_MISCCFG[0] = 00000112​
                      
                      ..
                      
                      2022-12-01 13:20:11 - Start memory range test (0x0 - 0x6820400000)
                      2022-12-01 13:20:12 - RunMemoryRangeTest - CPU #63 completed but did not signal (test time = 14ms, event wait time = 1001ms, result = Success) (BSP test time = 14ms)
                      2022-12-01 13:20:12 - WARNING - possible multiprocessing bug in BIOS​
                      Thanks for the logs.

                      We are thinking it may be an issue with accessing the ECC registers for the higher memory channels. We are investigating at the moment.

                      To confirm this, would you be able to move the DIMMs to the lower memory channels (ie. DIMM slots 0,1,....) and see if you get different results, If possible.

                      Comment


                      • #12
                        The 3990x is in the preferred quad channel memory configuration, matching the ideal physical configuration in the motherboard instruction manual.

                        Are you suggesting running the system in dual channel mode at slower clock-rate with all 4 DIMMs populated to one side of the CPU? I wouldn't run the system that way.

                        HWINFO64 shows the populated DIMMs as:
                        P0 Channel A/DIMM 1
                        P0 Channel B/DIMM 1
                        P0 Channel C/DIMM 1
                        P0 Channel D/DIMM 1

                        Comment


                        • #13
                          Originally posted by eccman View Post
                          The 3990x is in the preferred quad channel memory configuration, matching the ideal physical configuration in the motherboard instruction manual.

                          Are you suggesting running the system in dual channel mode at slower clock-rate with all 4 DIMMs populated to one side of the CPU? I wouldn't run the system that way.

                          HWINFO64 shows the populated DIMMs as:
                          P0 Channel A/DIMM 1
                          P0 Channel B/DIMM 1
                          P0 Channel C/DIMM 1
                          P0 Channel D/DIMM 1
                          We are just interested in collecting debug information for different configurations, in order to get to the bottom of the issue. Not suggesting it should be configured this way.
                          In either case, upon further investigation it may not be related to the slot configuration.

                          Do you happen to have the logs for the MemTest86 run when the voltage was reduced to induce ECC errors?

                          Comment


                          • #14
                            Attached part of the log for the run referenced earlier. I imagine there were dozens (if not hundreds) of corrected ECC events by the time the first multi-bit error was picked up during comparison.

                            Can you also look at why the last processor is failing the wait test? That causes memtest to start in single core mode, and causes timeouts as noted in the log. I'm guessing not enough wait events are being allocated/initialized.


                            Attached Files

                            Comment

                            Working...
                            X