Announcement

Collapse
No announcement yet.

AM5 platform, RAM stability issue, inconsistent MEMTEST results

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • AM5 platform, RAM stability issue, inconsistent MEMTEST results

    Hello,
    In October I built a new R9 7900X based system, with Asus TUF X670E PLUS WIFI motherboard, and it seemed like I was fighting RAM related issues since day 1.

    I was using some GSkill Flare X5 5600 RAM which is on the motherboard’s QVL, first signs of issue was when occasionally the memory training stage of POST would fail. The motherboard would reboot itself, and lock the RAM in the default 4800 JEDEC speed until I manually turn EXPO off and on again. Every time this happened, I would run MEMTEST86, and it never detected any errors. Also with EXPO profile enabled, upon entering BIOS, it would freeze for a few seconds before becoming normal. I initially tossed this up to new platform, early BIOS, and early adaptor issues.

    At some point, I turned on Memory context Restore, to try to speed up the POST speed by making the motherboard skip memory training when able. Few days after I did this, I encountered my first total OS corruption. It was a boot loop of memory related BSOD, followed by various service failure etc. I ran Memtest86 immediately after this, it still detected no errors.

    I figured the EXPO profile was probably not stable, and decided to run the sticks at JEDEC speeds instead until further BIOS improvement could be made, and turned Memory context restore back to normal so the motherboard can proper train the RAM as needed. This was fine and lasted several weeks where it seemed very stable, until yesterday.

    Out of the blue, the machine had BSOD related to Memory Management, and entered a very similar BSOD loop similar to the first one. Once again my Windows installation was not recoverable. Given this happened at the stock JEDEC 4800Mhz, I immediately run MEMTEST86 again, and this time it showed me 10k+ errors 3 minutes into the test.
    It was all 1 bit errors during Test 6, and occurred over a relatively large address space, with every CPU thread detecting errors.
    Lowest Error Address 0x121C848 (18MB)
    Highest Error Address 0x27FBA8A68 (10235MB)
    Bits in Error Mask 0000000010000000
    Bits in Error 1
    Max Contiguous Errors 1
    CPUs that detected memory errors { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }
    Test # Tests Passed Errors
    Test 0 [Address test, walking ones, 1 CPU] 1/1 (100%) 0
    Test 1 [Address test, own address, 1 CPU] 1/1 (100%) 0
    Test 2 [Address test, own address] 1/1 (100%) 0
    Test 3 [Moving inversions, ones & zeroes] 1/1 (100%) 0
    Test 4 [Moving inversions, 8-bit pattern] 1/1 (100%) 0
    Test 5 [Moving inversions, random pattern] 1/1 (100%) 0
    Test 6 [Block move, 64-byte blocks] 0/0 (0%) 10401
    Test 7 [Moving inversions, 32-bit pattern] 0/0 (0%) 0
    Test 8 [Random number sequence] 0/0 (0%) 0
    Test 9 [Modulo 20, ones & zeros] 0/0 (0%) 0
    Test 10 [Bit fade test, 2 patterns, 1 CPU] 0/0 (0%) 0
    Test 13 [Hammer test] 0/0 (0%) 0
    Last 10 Errors
    2022-12-02 16:48:00 - [Data Error] Test: 6, CPU: 5, Address: 251127048, Expected: FFFEFFFF, Actual: EFFEFFFF
    2022-12-02 16:48:00 - [Data Error] Test: 6, CPU: 4, Address: 24CC1AE48, Expected: FFFBFFFF, Actual: EFFBFFFF
    2022-12-02 16:48:00 - [Data Error] Test: 6, CPU: 6, Address: 255329148, Expected: FFFF7FFF, Actual: EFFF7FFF
    2022-12-02 16:48:00 - [Data Error] Test: 6, CPU: 7, Address: 258B35008, Expected: FF7FFFFF, Actual: EF7FFFFF
    2022-12-02 16:48:00 - [Data Error] Test: 6, CPU: 10, Address: 264765488, Expected: FFFFDFFF, Actual: EFFFDFFF
    2022-12-02 16:48:00 - [Data Error] Test: 6, CPU: 3, Address: 248A16088, Expected: FFFFFFDF, Actual: EFFFFFDF
    2022-12-02 16:48:00 - [Data Error] Test: 6, CPU: 1, Address: 24157EE08, Expected: F7FFFFFF, Actual: E7FFFFFF
    2022-12-02 16:48:00 - [Data Error] Test: 6, CPU: 10, Address: 2646E5288, Expected: BFFFFFFF, Actual: AFFFFFFF
    2022-12-02 16:48:00 - [Data Error] Test: 6, CPU: 2, Address: 24453D888, Expected: FFFFF7FF, Actual: EFFFF7FF
    2022-12-02 16:48:00 - [Data Error] Test: 6, CPU: 3, Address: 24895EA08, Expected: FFFF7FFF, Actual: EFFF7FFF
    Figuring one or both of the sticks completely failed, I tried to isolate which stick is still good so I can still reinstall the OS and have a functional PC. To my surprise both sticks passed individually. Then I plugged both sticks back into their original slots, and to my surprise again, both sticks passed all 4 passes this time! I ran additional tests since then but could not reproduce the error. I had several RAM failures in the past 15 years or so I’ve been building PCs and they were all reproducible with relative consistency.

    But two OS corruptions later, I can no longer trust the machine. So I got myself another set of DDR5 RAM today, same brand, but rated at 6000Mhz this time, also on motherboard QVL.
    I tested the RAM sticks at their EXPO speed immediately after I installed them, they passed all 4 passes no issues. But I figured I want to be extra sure of stability, so I turned off EXPO profile and ran them at JEDEC speed as I prepared for my OS reinstall and data restore process. Before I do that, I figured I will kick off another MEMTEST86 run at JEDEC speed just to be sure they are stable…annnnd…it detected a single error during Test 8 on Pass Forum The error did not reoccur in the subsequent passes and the entire run ended with 1 error detected.
    Lowest Error Address 0x47E698DC8 (18406MB)
    Highest Error Address 0x47E698DC8 (18406MB)
    Bits in Error Mask 0000000000000010
    Bits in Error 1
    Max Contiguous Errors 1
    CPUs that detected memory errors { 0 }
    Test # Tests Passed Errors
    Test 0 [Address test, walking ones, 1 CPU] 4/4 (100%) 0
    Test 1 [Address test, own address, 1 CPU] 4/4 (100%) 0
    Test 2 [Address test, own address] 4/4 (100%) 0
    Test 3 [Moving inversions, ones & zeroes] 4/4 (100%) 0
    Test 4 [Moving inversions, 8-bit pattern] 4/4 (100%) 0
    Test 5 [Moving inversions, random pattern] 4/4 (100%) 0
    Test 6 [Block move, 64-byte blocks] 4/4 (100%) 0
    Test 7 [Moving inversions, 32-bit pattern] 4/4 (100%) 0
    Test 8 [Random number sequence] 3/4 (75%) 1
    Test 9 [Modulo 20, ones & zeros] 4/4 (100%) 0
    Test 10 [Bit fade test, 2 patterns, 1 CPU] 4/4 (100%) 0
    Test 13 [Hammer test] 4/4 (100%) 0
    Last 10 Errors
    2022-12-03 19:07:18 - [Data Error] Test: 8, CPU: 0, Address: 47E698DC8, Expected: BC64DD7C, Actual: BC64DD6C
    This is another situation that I never ran across, in my past experiences RAM either throw large amount of errors or none at all. Plus the same sticks literally just passed at their higher EXPO speed while overclocked but then had an error when running stock? Given I just swapped the RAM sticks, could this really be faulty RAM or could something else be at play here? Is this just a fluke? Or is my luck really this bad? I really don’t know anymore after all this…

  • #2
    Ryzen 9 7900X in Sept with DDR5 is still fairly bleeding edge. So I wouldn't be totally surprised it is was released with a few instabilities and or early sub-optimal BIOS.

    It isn't unusual for RAM to work when used as a single stick, but fail in dual channel mode (2+ sticks).

    Something (RAM or CPU) is marginal. On the edge of working / not working. Maybe temperature or EMI push it one way or the other.

    Do you have another machine to try the two sets of RAM in at similar speeds?

    Comment


    • #3
      I recommend never skipping the memory training, particularly if you are not filling in every single memory timing value in the BIOS.

      I would first try the JEDEC timings, and bump the DRAM voltage slightly. On my ASUS motherboard, it's DRAM AB and DRAM CD voltage settings.
      See what the BIOS has picked as the default voltage values, and try setting manually and bumping by 0.04 to start, e.g. 1.20 becomes 1.24.

      If you had already set the voltage manually, note the JEDEC timings generally have a lower preferred voltage setting (vs. XMP/etc). Set the DRAM voltage to auto in that case, and follow the step in the earlier sentence to find a good voltage value to set manually. Too high voltage can also lead to instability.


      As a side note, too low voltage to the DRAM can lead to "Cache hierarchy error" on AMD platforms, at least from my testing. That's discussed in the thread below, but people generally haven't focused on the memory voltage/stability as the solution.

      https://community.amd.com/t5/process...re/td-p/392750

      Bumping the DRAM voltage slightly, or increasing the latencies can avoid that and other instability issues.

      Last edited by eccman; Dec-05-2022, 12:03 AM.

      Comment


      • #4
        Originally posted by David (PassMark) View Post
        Ryzen 9 7900X in Sept with DDR5 is still fairly bleeding edge. So I wouldn't be totally surprised it is was released with a few instabilities and or early sub-optimal BIOS.

        It isn't unusual for RAM to work when used as a single stick, but fail in dual channel mode (2+ sticks).

        Something (RAM or CPU) is marginal. On the edge of working / not working. Maybe temperature or EMI push it one way or the other.

        Do you have another machine to try the two sets of RAM in at similar speeds?
        I do not have another machine with DDR5 given how new it is. But I did swap the initial RAM out already, even though I could not reproduce the failure, I figured the sticks were probably marginal. The new sticks ran with no errors with EXPO on, but threw 1 error with EXPO off. I further tested it with HCI memtest with EXPO off, and it threw 3 errors 2 passes in at JEDEC speed. These low number of errors is making me believe this is not a failing stick, but rather a stability issue, please let me know if you agree...

        I retested the new sticks with EXPO on, both MEMTEST86 and HCI and it passed both this time for multiple hours without any issue.. So it seems my stick is not stable at stock, but is stable once overclocked..? This makes no sense...
        Last edited by actionzhe; Dec-05-2022, 01:03 AM. Reason: Edit:spelling

        Comment


        • #5
          Originally posted by eccman View Post
          I recommend never skipping the memory training, particularly if you are not filling in every single memory timing value in the BIOS.

          I would first try the JEDEC timings, and bump the DRAM voltage slightly. On my ASUS motherboard, it's DRAM AB and DRAM CD voltage settings.
          See what the BIOS has picked as the default voltage values, and try setting manually and bumping by 0.04 to start, e.g. 1.20 becomes 1.24.

          If you had already set the voltage manually, note the JEDEC timings generally have a lower preferred voltage setting (vs. XMP/etc). Set the DRAM voltage to auto in that case, and follow the step in the earlier sentence to find a good voltage value to set manually. Too high voltage can also lead to instability.


          As a side note, too low voltage to the DRAM can lead to "Cache hierarchy error" on AMD platforms, at least from my testing. That's discussed in the thread below, but people generally haven't focused on the memory voltage/stability as the solution.

          https://community.amd.com/t5/process...re/td-p/392750

          Bumping the DRAM voltage slightly, or increasing the latencies can avoid that and other instability issues.
          I think you bring up a good point. I think the experience with the past two sets of RAM, and them being unstable at stock JEDEC speed does indicate a potential voltage issue.

          My new set of sticks that throws a low number of error at stock, but stable once overclocked may further point to this. With 6000 Mhz EXPO profile on, the profile calls for 1.35v on the voltage, compared to the 1.1v stock. Maybe the chips just need more than 1.1v to be stable at even stock speeds. Idk how it passes QC like that, on BOTH sets of sticks. Or maybe it's my motherboard/bios that is allowing excessive voltage drops?

          This may also explain why with my previous 5600 stick I was having random memory training failures. The 5600 Mhz EXPO profile on that stick calls for 1.2v, and that may be inadequate, despite that is supposed to be the "tested and verified" profile. And both sets of sticks may just be straight up unstable with 1.1v, and for the full month I was running at JEDEC speed, it was probably silently erroring out in the background. I was just oblivious to this until the memory fully corrupted my OS the second time. I even had memory training on during that boot too.

          The new set of stick despite showing a 1 error in MEMTEST and 3 in HCI at stock speed, with EXPO 6000Mhz profile passed 500% of HCI, and all 4 passes of MEMTEST86 today. I'm throwing some OCCT on top of that for good measure, but it does appear to be stable...for now...

          Comment


          • #6
            Originally posted by actionzhe View Post

            My new set of sticks that throws a low number of error at stock, but stable once overclocked may further point to this. With 6000 Mhz EXPO profile on, the profile calls for 1.35v on the voltage, compared to the 1.1v stock. Maybe the chips just need more than 1.1v to be stable at even stock speeds. Idk how it passes QC like that, on BOTH sets of sticks. Or maybe it's my motherboard/bios that is allowing excessive voltage drops?

            This may also explain why with my previous 5600 stick I was having random memory training failures. The 5600 Mhz EXPO profile on that stick calls for 1.2v, and that may be inadequate, despite that is supposed to be the "tested and verified" profile. And both sets of sticks may just be straight up unstable with 1.1v, and for the full month I was running at JEDEC speed, it was probably silently erroring out in the background. I was just oblivious to this until the memory fully corrupted my OS the second time. I even had memory training on during that boot too.
            Yes, what you describe could very well be voltage related.

            I see about 0.02v vdroop associated with the DRAM on an asus zenith II extreme alpha motherboard.


            If your memory kit is based on e.g. Micron or other "mainstream" memory manufacturer, you can often find the datasheet for the memory, as well as manufacturer published SPD memory timing values for those modules. These tend to be far more conservative than the "re-labelled" memory kits. Hard to say how much QC is really done on a per-memory module basis, and how extensive the initial QC is done to arrive at the SPD values that get programmed into those re-labelled kits.

            I'd be interested if you can overcome the default JEDEC timing value instability by adjusting the voltage. I can't imagine needing more than say 1.27v for the jedec timings. Try starting with 1.2v on the 1.1v kit.

            Comment

            Working...
            X