Announcement

Collapse
No announcement yet.

Rowhammer mitigation

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Rowhammer mitigation

    My i7-5820K/GA-X99-UD4/2400MHz Crucial Ballistix DDR4 system was failing rowhammer (a few hundred errors per pass) until I reduced the refresh interval timing from the default of 7.8ms, in spite of the fact that DDR4 is supposed to include rowhammer mitigation (source: https://en.wikipedia.org/wiki/Row_hammer#Mitigation)

    In my board's BIOS, the two settings were tREFI (default of 9360) and tREFIX9 (default of 82).


    refresh interval (ms) = tREFI / (RAM clock (MHz) / 2)

    tREFIX9 = 8.9 * tREFI / 1024


    so...


    9360/(2400/2)=7.8ms

    (Source: page 123 of http://www.intel.com/content/dam/www...-datasheet.pdf)


    The standard recommendation is to reduce the refresh interval to 3.9ms and thereby double the refresh rate (source: http://support.lenovo.com/us/en/prod...ity/row_hammer). Doing that gave me one error per pass at the same address both times, so I reduced the interval to 75% of 3.9ms (i.e. tREFI=3510, tREFIX9=31) and it's now error free over 8 passes overnight.


    Because the memory controller is refreshing slightly more frequently, it reduces the frequency which the RAM can be accessed, so it slightly hurts performance, but I don't care too much. 45GB/s is enough for me.

  • #2
    Yes, it is surprising that you had an error with DDR4. This is the first report or errors with have seen with DDR4.

    Reducing the refresh rate was always known to be a solution, but it is nice to see the details. As you say it does reduce performance and battery life (in laptops)

    Comment


    • #3
      Originally posted by David (PassMark) View Post
      Yes, it is surprising that you had an error with DDR4. This is the first report or errors with have seen with DDR4.
      Yup. The reading I've done since suggests that DDR4's rowhammer mitigations aren't automatic, but that they require some support, presumably from the BIOS/UEFI and how it sets up the memory controller to enable TRR.Or maybe that option is obscured by the BIOS terminology (there's an option for 'enhanced stability' that I haven't changed from 'normal'). Or maybe because this is "crazy overclocker RAM" it doesn't do TRR (or do it properly).
      Reducing the refresh rate was always known to be a solution, but it is nice to see the details. As you say it does reduce performance and battery life (in laptops)
      My intent was that it'd save a few hours of research for anyone else trying to mitigate, whether on DDR4 or not.

      Comment


      • #4
        cowbutt thank you for info.

        David does passmark display refresh interval (ms) of the ram. Is anyone aware of any windows utility that would display the tREFI and tREFIX9 values. I have some Haswell Lenovo PC that don't expose memory timing setting and I would like to figure out what the refresh rate is.

        Comment


        • #5
          RAMMon displays the minimum requested tREFI value for each RAM stick. But not the value in use.

          There is also a new field in the RAM SPD data called, pTTR (Pseudo Pseudo Target Row Refresh). This field allows the RAM stick to indicate the MAC (Maximum Active Count) level which is the RAM can support. A typical value might be 200,000 row activations.

          This pTTR value appears in newer DDR3 and (maybe) all DDR4 RAM. We are in the process of updating RAMMon to display this value.

          Xeon CPUs starting with the Intel Ivy Bridge provide support for pTRR. Subsequently, Haswell and Broadwell Xeon CPUs from Intel also included support for the Joint Electron Design Engineering Council (JEDEC) Targeted Row Refresh (TRR) algorithm. The TRR is apparently an improved version of the previously implemented pTRR algorithm and does not inflict any performance drop or additional power usage.

          Comment


          • #6
            pTRR requires the MC to do the row refreshes.
            It is just a guarantee by the DRAM vendors on a certain Tmac.
            It is used for LPDDR3.

            TRR will refresh the rows automatically when the MC gives it the MR command to do a TRR.
            (it doesn't have to do the actual row refreshes). I have heard a rumor that in the future all DDR4 will be row hammer free, and they will remove TRR From the spec.

            We will wait and see ...

            Comment


            • #7
              Hello cowbutt,

              I am going to try to recreate the failure you have seen. Can you tell me did you try any other memory other than the Crucial Ballistic DDR4 2400? Do you see the failure at any other speed other than 2400? thanks!

              Comment


              • #8
                Originally posted by FuturePlus View Post
                I am going to try to recreate the failure you have seen. Can you tell me did you try any other memory other than the Crucial Ballistic DDR4 2400? Do you see the failure at any other speed other than 2400? thanks!
                No, I didn't test with any other type of DDR4. I didn't test at any other speed either.

                Comment


                • #9
                  Ok thanks for your reply. We did test on an ASUS x99 with that same brand of memory stick and did not see any failures. Would you mind parting with that memory stick? I would certainly be willing to purchase it from you.

                  Comment


                  • #10
                    I'd rather not, as it's my main system. Also, the RAM is not a single stick, but four sticks of 8GB.What settings did you use when attempting to reproduce the issue? Specifically:* Fast Boot* XMP* Memory timing mode* Channel Interleaving* Rank Interleaving* tREFI* tREFIX9* Memory Multiplier* BCLK

                    Comment


                    • #11
                      I've been running across a comparable scenario.
                      My system (~4 weeks old, ie new):
                      CPU i7-6700K
                      Mobo ASUS Z-170P (BIOS Version 0601 = latest version, does not yet contain a fix for the Skylake AVX instruction bug, ie the CPU's Microcode patch level is still at 39)
                      RAM 4 x 8GB Crucial Ballistix Sport DDR4 2400 MHz
                      GPU NVIDIA GTX 980ti

                      No overclocking, all CPU and RAM-related settings were set to their respective defaults ('Auto' or 'Standard') in BIOS (also XMP is disabled).

                      Other tests:
                      Been running multiple test tools (IPDT, Prime95 (with setting CpuSupportsFMA3=0), Windows Memory Diagnostics, Unigine Heaven Benchmark (GPU stress test), among others - all indicated no errors whatsoever, except of course the aforementioned AVX instruction bug.
                      Also, during all those tests, temperature levels of all components involved never exceeded allowed/recommended levels; in fact, they didn't even come close.

                      Tests with Memtest86 (v6.3.0 Free Edition):
                      Initial repeated tests resulted in ~20 Errors in test 13 (Rowhammer) only.
                      I then gradually increased the RAM's refresh rate by lowering tREFI (called 'DRAM Refresh Interval' in my board's BIOS, default of 9364) and tREFIX9 (called 'tREFIX9' in my board's BIOS, default of 79) according to the formula posted by cowbutt above.
                      As was to be expected, small changes to the refresh rate (< 20%) didn't yield any change regarding the memory test results.
                      At tREFI = 7250 / tREFIX9 = 63, test 13 errors went away, yet the warning ([Note] RAM may be vulnerable to high frequency row hammer bit flips) remained, but suddenly a small amount of test 7 errors (< 10 during a 9.5 hrs test) started to pop up.
                      After more tests with various settings for tREFI / tREFIX9, I ended up with the following ones that yield no Memtest86 errors whatsoever: tREFI = 7130 / tREFIX9 = 62. Only thing that remains now is the aforementioned warning, but I can live with that (my goal here is to increase the refresh rate as little as possible, in order to stay as close as possible to the standard settings, and to avoid possible performance penalties as much as possible).

                      So lessons learned here seem to be
                      - It looks as if certain combinations of mobo's and RAM may display this kind of problem (judging by the various problems reported by Skylake / Z-170 chipset users (freezing problems etc, though some of those might be related to the Skylake AVX instruction bug), that combination might be especially sensible at this point).
                      - Increasing the RAM refresh rate might mitigate the test 13 error problem, but the required increase rate differs for any combination of board and RAM, and thus has to be determined using a "trial & error" approach.
                      - Changes to the RAM refresh rate might induce other errors in Memtest86, so you have to look out for those.
                      - It's not out of the question that a (future) BIOS update provided by the board's manufacturer might fix this kind of problem.
                      Overall, to me this looks more like a hardware compatibility and/or a chipset problem, than a problem due to faulty hardware (of course can't totally rule out the latter - nothing's perfect in this world).

                      Hope this helps anyone who experiences similar problems.

                      Comment


                      • #12
                        Would like to obtain the same exact DIMM.

                        Originally posted by YseGuy View Post
                        RAM 4 x 8GB Crucial Ballistix Sport DDR4 2400 MHz

                        So lessons learned here seem to be
                        - It looks as if certain combinations of mobo's and RAM may display this kind of problem (judging by the various problems reported by Skylake / Z-170 chipset users (freezing problems etc, though some of those might be related to the Skylake AVX instruction bug), that combination might be especially sensible at this point).

                        Hope this helps anyone who experiences similar problems.
                        Would it be possible for you to provide a photo of the sticker on your Crucial Ballistic Sport DDR2400 or of the packaging? I would like to try and duplicate as I have the same ASUS MB and i7. I believe it is a combination of parts, so I'd like to get exactly what you have in system.

                        I believe my BIOS is 603, so that is an interesting factor as well.
                        Thanks

                        Comment


                        • #13
                          No photo since I'd have to dismount the modules to take one, and since it's a pre-built machine I don't have the packaging either.
                          But I can provide the module's part number: BLS8G4D240FSA.16FA

                          Your BIOS version seems to be more recent than the latest one available for d/l on ASUS's site, that's indeed interesting (we're talking bout the Z-170P here right? not the Pro or any other of the various Z-170 chipset models ASUS is offering).

                          Comment


                          • #14
                            I also wonder about BIOS in this situation.

                            Thank you for the part number. That is the same part number on my Ballistix Sport DDR4 DIMMS.

                            About the BIOS, my ASUS MB is the Z170-A, so no we don't have the same exact motherboard. I can kinda see where the BIOSs are different for these two PCBs. That is interesting also.
                            Last edited by RowHammer; Feb-23-2016, 09:08 PM.

                            Comment


                            • #15
                              It is a bit confusing for me that you mix "reduce refresh interval" with "reduce refresh rate".

                              The mitigation is to reduce the refresh interval, that is, increase the refresh rate, right?

                              Comment

                              Working...
                              X