Announcement

Collapse
No announcement yet.

Access methodology of Row Hammer Test (Test 13)

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Access methodology of Row Hammer Test (Test 13)

    Hi,
    I read the following paragraph in the MemTest86 User Manual. As it describes, we did see the warning message during test.
    "Starting from MemTest86 v6.2, potentially two passes of row hammer testing are performed. On the first pass, address pairs are hammered at the highest possible rate. If errors are detected on the first pass, errors are not immediately reported and a second pass is started. In this pass, address pairs are hammered at a lower rate deemed as the worst case scenario by memory vendors (200K accesses per 64ms). If errors are also detected in this pass, the errors are reported to the user as normal. However, if only the first pass produces an error, a warning message is instead displayed to the user."

    I have some questions, as the folloings. Hope some expert could help.
    Q1: How is the 200K access distributed in the 64ms of 2nd pass ?
    Q2: What is number of the "highest possible rate" in first pass ? Would it be the same distribution way as 2nd pass ?

    Very appreciate your help.

    Frank

  • #2
    Q1: Hammering is done on 2 addresses corresponding to adjacent rows. There is a somewhat complex process of working out what we think the pairs of addresses are. This process isn't 100% accurate (we are aware of methods to improve this, but it would be significant technical effort to 'improve' it). But improving it might also not be a good thing, as failing 100% of RAM sticks makes the software useless as a memory tester.

    Q2 is easy. Answer is: As fast as the CPU and RAM allows it to run.

    A single pair is used for a short period (up to 320ms) then the software moves on to a new pair.


    Comment


    • #3
      Hi David,
      Thank you. I'd like to express further to make my question more clear.
      Q1. As 8K rows chip, the tREFI would be 7.8us (64ms/8K). That means the tool would need to issue refresh command every 7.8us.
      As the statement says : 200K access per 64ms.
      Assuming the address being accessed is the same among this 200K access. If this 200K access are distributed average in time, that could be 25 (=200K/8K) times access in a 7.8us period. Even further, the 25 times are spreaded equally during this 7.8us. In this way, I feel the harmmering is very slightly. If this 200K access are not distributed euqlly but happen in a very short period (maybe accorss several 7.8us period), the hammering would be stronger and probably cause error message.

      Because we saw our test showing the warning message, it meant it FAILED at 1st pass test, and PASSED in the 2nd pass test. So we hope to know how the 2nd pass test proceeds.

      Comment


      • #4
        The warning message means the RAM failed the high speed hammer test. (i.e. bits were flipped).
        Passing the 2nd phase means the RAM passes the slower speed hammer test. So in theory you are unlikely to see this bad behaviour very often in real life applications. Unless you get unlucky

        I think it would take a bit of time to work out the exact and precise description for all possible different hardware and document it. And of course we are working on other projects / future releases most of the time.

        This is more of a consulting job.

        Comment


        • #5
          Thank you. Looking forward to further information.

          Comment


          • #6
            Looking forward to further information
            Email us for current paid consulting hourly rates if you would like to proceed.

            Comment


            • #7
              Originally posted by Frank_WEC View Post
              Thank you. Looking forward to further information.
              Passmark won't help you further (for free), but I've been researching this recently and have some tips.

              There are multiple layers of confusion that must be penetrated to do a good Rowhammer test.

              First is translating the linear addresses visible to software into rank, bank, row and column of a particular DIMM. This isn't well documented. The only solid thing you can assume is that the lowest three bits select a specific byte within the 64-bit units DIMMs deal in, as not mapping those units one-to-one with cache lines would be insane.

              On modern systems, the "memory controller" is integrated into the CPU chip itself, and the motherboard simply provides an electrical connection to the DIMMs. This means once you figure out one CPU model, you've figured out every motherboard that includes it.

              Data sheets for your CPU can help to a point. For instance, a PC's memory map requires space for devices, which have to be at low addresses for backward compatibility. The data sheet for my Haswell says that this is handled with a hidden separate linear address space that maps all the DIMMs and only the DIMMs. The visible address space maps directly to the hidden address space most of the time. There is a space in the visible address space beyond the maximum address in the hidden space that is remapped (by simple subtraction) into some of the memory hidden by the largest hole (from an address configured by the BIOS to 4GB). Not all of that memory is revealed, some of it is "stolen" for other uses and kept hidden from the OS. It implies the only way to see memory "behind" the smaller holes (640K-1M and 15M-16M) is to turn them off. (The 15M-16M hole will usually be off anyway, since only certain ancient ISA cards benefit from it and most Haswell motherboards have no ISA slots.)

              After that, documentation is vague. But you can make some educated guesses.

              First, if you have two DIMMs, the fourth bit probably selects between them. This is called interleave, and it boosts performance. Some actual systems interleave three DIMMs; that could get crazy because the actual hardware logic likely uses an unintuitive algorithm that is less painful to implement with logic gates than the obvious "divide by 3 with remainder".

              Except for the interleave, you can expect the columns of each row to be contiguous, and you know from SPD how many columns there are.

              Next is pulling apart rank, bank, and row. Here the processor may deliberately confuse things in order to reduce accidental "bank thrashing". Row hammering only works by causing bank thrashing, so you need to have both rows in the same bank and rank to be effective. But this gives the determined tester an angle, because if your hammer pattern doesn't actually cause a thrash, it will complete much faster than an effective hammer.

              Finally, you need to figure out which rows are close to one another. This is a thorny problem, because nothing forces DIMM designers to number their rows in order. An unscrupulous DIMM manufacturer could even deliberately try to anticipate the patterns MemTest86 will attempt and try to hide their weakest row combinations from them.

              Here your best hope is to deliberately overclock the DIMM (ie: set tREFI too high) so that rowhammer errors become common, and carefully study the results to produce a map you can use to torture-test the DIMM under its rated tREFI or lower.

              I'm sure Passmark knows more than me, in general. Although they obviously aren't using my approaches since there is no provision to remember how to better stress a particular DIMM on a particular CPU.​

              (A final note; all of this ignores TRR, a feature in newer RAM that recognizes a hammer in progress and tries to defeat it. If TRR always worked, we could at least not worry about it, but it often defends RAM that is so vulnerable that subjecting multiple rows at once to weaker hammering defeats it.)

              Comment


              • #8
                Yes, it is complex mess. Hard to get the limited documentation. Hard to code. Very hard to test the code for accuracy.

                We did a lot of work on address decode to locate memory errors to a specific chip over the last couple of years. We didn't decode to the row & column level however. And it only works for some CPUs and some RAM configurations. Too many permutations to deal with them all.

                This same decode information could be used as the starting point of an 'improved' row hammer test (it isn't at the moment, as of Mar 2024). But it might be too effective and fail all RAM sticks tested. Which isn't great for a RAM testing product.

                Comment


                • #9
                  Originally posted by David (PassMark) View Post
                  This same decode information could be used as the starting point of an 'improved' row hammer test (it isn't at the moment, as of Mar 2024). But it might be too effective and fail all RAM sticks tested. Which isn't great for a RAM testing product.
                  That's if you think the only options on discovering a row hammer flaw are either to decide to live with it or throw away the DIMM. It's only like that if your BIOS doesn't provide access to memory underclocking options.

                  Before TRR, row hammer-fallible DIMMs aren't qualitatively different from bulletproof ones. They merely have their tREFI oversold. (With TRR, it's become a mess.)

                  A merciless row hammer test is useful in deciding what is the true, rather than advertised, tREFI for your RAM.

                  Comment


                  • #10
                    Most of our users / customers don't have a super deep knowledge of RAM timings nor row hammer techniques nor underclocking. For 90% of them it is a black and white issue. Any errors of any sort means the stick is bad and must be returned to the vendor. People would stop trusting MemTest86 if all (or most) RAM sticks fail, even though the same sticks work fine in the real world.

                    Comment


                    • #11
                      Hi Michael,
                      Thank you for detail explanation. I know that DRAM physical address decoding is very difficult because CPU vendors release very few information about it.
                      So it looks like, the memory test tool should first recognize what the CPU is for knowing the address mapping(decoding), and then it could know what the "effective address" of DRAM to be tested. Right ?​

                      Comment


                      • #12
                        Originally posted by David (PassMark) View Post
                        Most of our users / customers don't have a super deep knowledge of RAM timings nor row hammer techniques nor underclocking. For 90% of them it is a black and white issue. Any errors of any sort means the stick is bad and must be returned to the vendor. People would stop trusting MemTest86 if all (or most) RAM sticks fail, even though the same sticks work fine in the real world.
                        Indeed this is a dilemma. Since the tool would not be too strict, if failure happens in Row Hammer test, the DRAM chip might be really not strong enough.

                        Comment


                        • #13
                          Interesting discussion here for sure, but when your RAM only fails the hammer test -- what does this actually mean? I'm just a typical customer. My crappy AliExpress unit with simple megatrends BIOS offers no mechanism to adjust RAM timings. I don't even know the manufacturer of the MOBO. So when my DDR5 fails the Hammer test, am I supposed to believe the RAM is bad, or am I supposed to get another RAM stick and test it or do another option? I read the explanation within the documentation regarding failed hammer test, I can't say it left me with any clear path forward.

                          Comment


                          • #14
                            More details are here
                            https://www.memtest86.com/troubleshooting.htm#hammer

                            Quote,
                            "Starting from MemTest86 v6.2, the user may see a warning indicating that the RAM may be vulnerable to high frequency row hammer bit flips. This warning appears when errors are detected during the first pass (maximum hammer rate) but no errors are detected during the second pass (lower hammer rate). See MemTest86 Test Algorithms for a description of the two passes that are performed during the Hammer Test (Test 13). When performing the second pass, address pairs are hammered only at the rate deemed as the maximum allowable by memory vendors (200K accesses per 64ms). Once this rate is exceeded, the integrity of memory contents may no longer be guaranteed. If errors are detected in both passes, errors are reported as normal.​"

                            Warning note looks like this.
                            Click image for larger version  Name:	image.png Views:	0 Size:	144.2 KB ID:	56867
                            So getting the "warning" note isn't good. But it probably won't make your PC too unstable.
                            Getting the "error" isn't good all all. There is a much higher probability that you'll see problems in real life. But it is random in the end and depends on usage.

                            It is very hard for a non expert end user to assign blame. Could be BIOS settings or the RAM vendor at fault. But if the system isn't used for anything critical and it is stable in real life usage, you can probably ignore the issue. Old saying, don't fix it unless it is broken. However if it was a medical device or something else critical you need to take some action.

                            crappy AliExpress
                            We've seen several people complain about low quality AliExpress parts recently. I don't know why people do it.

                            Comment


                            • #15
                              David (PassMark) Hmm weird -- I'm not getting any warning in regards high frequency vulnerability running MemTest86 v10.7 Free. I would infer from your explanation, due to the absence of this notification, I'm likely getting errors in both passes. I'm ordering a replacement RAM SODIMM module to test. Could be the BIOS but the BIOS is so limited in this "Topton Unit" that there isn't anyway to adjust any RAM timings at all -- it's truly really really basic.

                              In terms of your advice about critical and stable in real life usage -- that's a tough one. I usually test all my RAM prior to using the device in real life since I'm trying to avoid any foreseeable problems prior to using the unit. I've found a lot of errors on many RAM sticks using this program (whether these are true errors or not remain to be seen) however I just keep replacing modules until I get a "good" module. I'm aware this might not be the best way to do things, however I'm not sure how you would ever verify or test RAM without a program such as this or similar program (ie. stress-ng or equivalent). In terms of AliExpress parts being low quality -- yes likely you're right -- however so many American distributors use the same parts as these units -- rebranded units -- and sell them as their own. Being that the RAM and Disks are the only thing which you could supply your own, you're kind of limited how to improve upon the quality factor.
                              Click image for larger version

Name:	IMG_6132.jpg
Views:	21
Size:	246.2 KB
ID:	56873
                              Attached Files

                              Comment

                              Working...
                              X