Announcement

Collapse
No announcement yet.

Rowhammer problem, not found by memtest86 6.3.0

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Rowhammer problem, not found by memtest86 6.3.0

    Hello,

    we recently bought two computers with DDR3 RAM (G-Skill Jan 2016 in the label of the DIMMs), and in one of them we experienced a corruption in database contents. Checking, I noticed that 7 bits were flipped along a blob of 14MB.

    I tested using the userspace google rowhammer_test and it hit a bitflip in Iteration 10 (after around 10 seconds). Repeatedly running the test it effectively failed always below iteration 300.

    I tested the RAM with memtest 6.3.0 (that should include a rowhammer test), and it found nothing. I tested the RAM with memtest 4.3.7 (without rowhammer) and it found nothing.

    Nevertheless, using the memtest86+ 5.01 with rowhammer test (https://github.com/CMU-SAFARI/rowhammer) it found always addresses with bitflips. In both computers.

    So, to sum up:
    1) we have some evidence that there is a RAM problem: database corruption with bitflips
    2) the google rowhammer_test linux program finds bitflips very quickly in one computer, and not so quick in the other computer, yet finds.
    3) the memtest86+ 5.01 with rowhammer tests finds the bitflips in every run, in both computers.
    4) memtest86 6.3.0 doesn't find any problem in those computers, having run the set of tests two or three times in each.
    5) Replacing the RAM with good modules, memtest86+ 5.01 and the google rowhammer_test do not find any problem.

    That makes me not trust much memtest86 6.3.0, definitely. The evidence happens in two computers, spanning 4 DIMMs. Do you have any idea of what can be going on? Is there a bug in memtest86 6.3.0?
    Attached Files

  • #2
    Can you upload or e-mail us a copy of the MemTest86.log file under EFI\BOOT\ of the USB drive.

    The row hammer test algorithm was changed in v6.2 so that a potential of 2 passes are executed. So it is possible that errors were detected in the first pass but not the second pass. This appears as a warning rather than a test failure.

    Comment


    • #3
      Hello Laq,

      When Passmark changed their Rowhammer test in 9/2015, I brought this to their attention. At this point, it was pretty obvious that they sold out to the memory vendors and redesigned their Rowhammer test to allow failing systems to pass. So, they get a pat on the head from the memory and system vendors, but end users, such as yourself, get left with data corrupting systems. The quality of Memtest86 has diminished greatly since Passmark took over. But, their worst offense has been bending over to hide memory problems. A diagnostic is supposed to show failures. Passmark wants to make something that "passses" even when there are failures.

      And now, we're starting to see DDR4 failures in this test, even after it was made weaker. You did the right thing by choosing other diagnostics. I don't trust Passmark software and I stopped using Memtest. My original thread is here:

      http://www.passmark.com/forum/showth...ed-8-Sept-2015

      Comment


      • #4
        A few points,

        1) Before we took the project over, there was no row hammer test. We implemented it in MemTest86 (well before Google did it). We also implemented native 64bit support, UEFI support, DDR4 support, XMP2.0 support, ECC support, logging to disk, mouse support and many other features.

        2) Google was doing a double hammer. Which is a different algorithm. While it probably produces more errors (we haven't done extensive testing), the algorithm is much more artificial. To date I believe no one has produced any evidence that RAM access patterns like the double hammer appear in real applications. If you discover an error that never occurs, does it matter?

        3) Google reported errors in 52% of the non ECC machines they tested. Are 52% of all computers in the world faulty and unstable? Clearly not. But this is what the Google test might have you believe. The only conclusion is that Google is over reporting the errors. ie. reporting minor faults as something very serious. Should we tell half the people on the planet to return their computers as faulty? Think of the economic impact of this false reporting.

        4) Not all database problems are RAM errors. A disk failure, software bug, CPU fault or virus is just as likely to have produced DB corruption as a RAM error. It is far from conclusive. Also, if you are doing any serious DB work, then you should be using ECC RAM. You can expect to get soft RAM errors in any system that runs over a long period of time. If it was a RAM error that caused the DB corruption it would have either been auto corrected or flagged to the user.

        5) Nothing was hidden in V6.3. What changed was the wording of the text on the screen. Some minor errors are now displayed as warnings. There is also a extra testing pass to categorize errors into minor or serious errors. The test was not weakened, a categorization step was added to it (so it is also slightly longer than it was).

        6) You can revert the warnings to error via a change in the configuration file. So you can have the old behaviour if you want it.

        7) It doesn't make sense to claim we are hiding the problem, but then also report that you see MemTest86 reporting the problem in DDR4, which was in fact designed to ameliorate the problem. How weak can it be if you still get errors in RAM that was designed to fix the errors.

        To claim the older version of Memtest86, from before we took over the project was better in any way is just uninformed trolling.

        Comment


        • #5
          Hello,

          well, I quite think that the problem is the RAM. I could think that it would be the disk if I had ONE bitflip, but I had seven. Here is the xxd diff:

          $ diff objectok.txt objectbroken.txt
          561659c561659
          < 00891fa0: f91d a2a6 facb 46bb af1f 4705 213d bcf6 ......F...G.!=..
          ---
          > 00891fa0: e91d a2a6 facb 46bb af1f 4705 213d bcf6 ......F...G.!=..
          561665c561665
          < 00892000: 160c dc30 18c0 9d16 ac2c ad5c 91ae e181 ...0.....,.\....
          ---
          > 00892000: 160c dc30 18c0 9d16 ec2c ad5c 91ae e181 ...0.....,.\....
          561736c561736
          < 00892470: a510 020a d423 feae 347d 32eb 759f 9935 .....#..4}2.u..5
          ---
          > 00892470: a510 020a d423 feae 3c7d 32eb 759f 9935 .....#..<}2.u..5
          561962c561962
          < 00893290: 0bc3 aefb 8beb 89df 0d8d 1156 1945 e397 ...........V.E..
          ---
          > 00893290: 0bc3 aefb 8beb 89df 0d8d 1156 9945 e397 ...........V.E..
          562071c562071
          < 00893960: 6f76 0b60 8a24 d0de 4f79 4aa8 d48b 5043 ov.`.$..OyJ...PC
          ---
          > 00893960: 6f76 0b60 8a24 d0de 4f79 4aa8 d58b 5043 ov.`.$..OyJ...PC
          562293c562293
          < 00894740: 9132 7845 66b4 0369 acbe c47a d9a1 01a1 .2xEf..i...z....
          ---
          > 00894740: 9132 7845 46b4 0369 acbe c47a d9a1 01a1 .2xEF..i...z....
          562301c562301
          < 008947c0: 483b fa06 b581 f3c3 eaad a528 0181 bbe6 H;.........(....
          ---
          > 008947c0: 483b fa06 b781 f3c3 eaad a528 0181 bbe6 H;.........(....

          The probability of onebitflip in disks (spinning disk) is very low, and not to say, seven.

          Then, as for the google test... They provide two implementation: rowhammer_test and double_rowhammer_test (read two neighbour rows to bitflip in the middle row). I was talking all the time about the rowhammer_test, not double.

          This is a GNU/Linux very recently installed and without any binary out of the standard distribution. I don't believe there can be any "virus".

          This was not "serious DB work". It's a simple sqlite database for our code DVCS, that for its own safety has sha1 checks for all objects at every commit, because it's cheap to check so the software checks.

          Memtest86 6.3.0 did not report any warning at all, at least on screen. I didn't check even the existence of any log file.

          Additionally, replacing the RAM made all tests pass in all cases, and not being able to reproduce any failure in google rowhammer_test.

          I don't have the RAMs anymore neither any log file now; I gave the RAM for replacement under warranty. And the only copy of the USB stick where I now have memtest86 6.3.0 doesn't have any log file; maybe I used a USB stick I already overwrote in my play with different memtest86/memtest86+.

          Whether I believe that 52% of the computers are faulty? Well, I don't think google would be lying on that. And in our case, we could get a bitflip in 10 seconds from the userspace test program, which is different than getting a bitflip in one hour.

          As additional note, I also tested memtest86+ 4.2.0 (very old), which doesn't have a rowhammer test, and it reported errors. But the error bitmask had many random bits, different every time, so I guess they were related to a bad memory map or so.

          Sure, memtest86 being a non-free project will always put the managing company under suspicion whether it has pressure from RAM/Computers manufacturers. I did not have that suspicion on start... it just happened that I first had tested the memtest86+ flavours and google tests, and last, memtest86 6.3.0. I even thought that 6.3.0 lacked a rowhammer test, but then I saw it had it, and decided to report the case here.

          Comment


          • #6
            A link about a large statistical analysis bitflips: https://www.usenix.org/legacy/events...tml/index.html

            Comment


            • #7
              Notice that I have no idea about that number you gave of 52% of faulty computers, as if it were a Google claim. I guess that there is a difference about getting a bitflip in hours and getting a bitflip in seconds.

              About the implementation and who was first... also https://github.com/CMU-SAFARI/rowhammer/ seems also earlier than the Google userspace test. But the google test was more about reproducing it in a running system by an unprivileged user, and then showing the possibility of an exploit (replacing virtual memory page tables and all that). As for me, I just wanted to test the DIMMs because I had the bitflips in the database.

              Comment


              • #8
                Originally posted by laq View Post
                A link about a large statistical analysis bitflips:
                This was an article about data corruption on hard drives, it doesn't in fact mention bit flips. I am not sure how relevant this is, except to show that disk errors happen (0.3% to 3.5% of all nearline disks get corrupt, at least in 2007 when the data was collected). It doesn't discuss the type of corruption (single bit vs multi-bit).

                Would have been interesting to run a few additional tests before the RAM was returned.

                - Multithreading load vs single threading. We have noticed that Multithreading is better at finding some errors. We would like to make this the default behavior, but there are some buggy motherboards that freeze.

                - Multi-channel testing vs single channel. Access patterns vary depending on the number of RAM sticks installed. This can vary the row hammer result a lot.

                - UEFI based testing vs traditional BIOS (the memory mapping & reserved memory is different)

                and also to get the logs. So it is unlikely we'll get to the bottom of the issue now.

                Comment


                • #9
                  Yes, it's a pity. I decided to write about this only once I returned the RAM. Who knows, maybe we will get back the same RAM, or similar, and this will happen again. We do not have news from the shop yet. If I can, I will try to provide more information.

                  I posted the link on hard disk data corruption because you mentioned it as a possible reason of the bitflips.

                  There were two ram sticks installed of 4GB each. The memtest86+ were done as default (no SMP).

                  Thank you.
                  Last edited by laq; Mar-16-2016, 10:36 AM.

                  Comment


                  • #10
                    Another possible (even likely) cause for the difference in behaviour of the various testing software it that for coding a row hammer test the hard part isn't the hammering. The hard part is working out which memory addresses represent which physical rows in the ram chips. i.e. working out where to hammer.

                    The mapping algorithms are more of less kept secret by Intel, so some guess work is involved in working out the optimal addresses and we know we aren't always getting it right.

                    Comment


                    • #11
                      Once the mapping is known, any address is going to suffer from hammering? Or it is going to affect only a few rows?

                      Are there many combinations possible for that hammering?

                      The memtest86+ with rowhammer certainly tests only a subset of addresses, and so they explain in the documentation. Nevertheless, the tests take little more than 5 minutes; I wonder if there could be coded in some way that used more than 5 minutes just to do a more exhaustive test if the user has the will to wait.

                      Comment


                      • #12
                        any address is going to suffer from hammering? Or it is going to affect only a few rows?
                        There will be some rows that are more susceptible to hammering than others. It could be the case that the threshold for bitflips is very low in some ram sticks. So most rows will be effected. In other cases it might be that the design of the is such that it is sitting right on the threshold and only a few rows might be effected.

                        However I would think that is most cases you would either have the situation where many rows are effected or no rows are effected.

                        Comment


                        • #13
                          We got the 4 DIMMs replaced with RMA. They pass all tests in memtest86+ 5.01 with rowhammer.

                          So, we can't get any relevant log for the memtest86 6.3.0, and the only remaining trace is my own memory: I did not notice any warning or error on screen.

                          Comment


                          • #14
                            Originally posted by dsullaustin View Post
                            Hello Laq,

                            When Passmark changed their Rowhammer test in 9/2015, I brought this to their attention. At this point, it was pretty obvious that they sold out to the memory vendors and redesigned their Rowhammer test to allow failing systems to pass. So, they get a pat on the head from the memory and system vendors, but end users, such as yourself, get left with data corrupting systems
                            I totally agree with your point !, Passmark Memtest86 should notify the first pass errors as they are: ERRORS, not a tricky and shady NOTE, that makes me wonder and believe, hey, it states that my memory MAY be affected by Bit Flip attacks. Well, not only MAY be affected by such attacks, but algo, WILL BE PRONE to have corrupted data because of this failures.

                            My Memtest86 results show a NOTE on each pass, but no errors, but with the ROWHAMMER test from GitHub, i-m able to create errors every single test i do.

                            SO
                            Originally posted by David (PassMark) View Post
                            x
                            , i suggest to show the actual errors o the first pass by default. I will start spreading the word.
                            Last edited by traktorkontrol; Aug-07-2016, 07:59 PM.

                            Comment


                            • #15
                              We don't have exact numbers, but lets pretend that 30% of all computers fail the row hammer test. So maybe a billion machines. Should we be responsible for claiming all these machines are bad, when only maybe 1% of them are bad enough to actually cause a problem in normal use? We are going to be scaring and confusing a huge number of people unnecessarily, most of whom have zero understanding about what row hammer is.

                              There needs to be some type of balance given that many of the MemTest86 users aren't very technical.
                              If you are an advanced user, then just treat the warning as a error. It is your choice.

                              Comment

                              Working...
                              X