Announcement

Collapse
No announcement yet.

How to relate to errors in Hammer Test 13 ?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to relate to errors in Hammer Test 13 ?

    Just rebuilt most parts of my system. Always test with latest MemTest86 (6.0 beta3), unfortunately including Hammer Test 13 Get around 40 bit errors on single run of hammer test, consistently. No errors on other tests. Will double-check with a multi-hour run tonight, excluding hammer.

    Click image for larger version

Name:	20150205_01_memtest86_photo.png
Views:	1
Size:	88.0 KB
ID:	35194

    Relevant hardware specifications:
    - MB: ASUS Z97-WS (hw-rev: 1.02) (BIOS: 2013)
    - CPU: i7-4790K (@stock 4.00GHz, stepping C0)
    - RAM: 2x8GB Corsair Vengeance Pro DDR3-1600MHz Ver8.22 (9-9-9-24/1.5V)
    - RAM: Manufacture Date: Week 15 / 2013

    Also tried with DDR3-1333MHz CL9 & CL11 timings. No difference on hammer test, about 40 bit errors each time.

    My question: Any guestimate/recommendation on how we should relate to these type of memory errors going forward ?

    Have tried to read up on the research paper mentioned, and other posts. Do understand that these errors looks highly unlikely in normal operation. But having lived through the consequence of 'bad memory' a few times, I always try to avoid it at all cost. My stance is that if CPU/MB/RAM is within specifications, and memory read/write is within specification. Any bit error should be miles apart. Independent of how contrived the scenario is. Not blaming you guys though, just fed up with QA control on RAM and SSD's :/

    Already ordered a set of 2x8GB Kingston HyperX Savage DDR3-1866 HX318C9SRK2/16, to test. Not to run it at DDR3-1866. But they are spec'ed with profiles for JEDEC DDR3-1600 CL11 and XMP DDR3-1600 CL9 (also 1.5V at DDR3-1866). Hoping the binning for DDR3-1866 give better chance of quality/stability at DDR3-1600. Just a theory, may be way off. I never overclock CPU/RAM, stability is pri-1-2-3, pri-4 is ok'ish speed

    Wild theory: You guys mention that hammer test 13 can touch UEFI BIOS subsystem/bits in some cases. Is it at all possible that UEFI BIOS logic can change subsystem/bits at the same time as MemTest86 is running ? Just looking for all possible angles, if there are fringe cases of false positives.

    PS: If any value. Can give you several MemTest86 logs with hammer test bit errors.

  • #2
    I think they are real errors. Your wild theory doesn't explain why so many different memory addresses are effected, nor why the errors are all single bit errors.

    As to the question of what to do about it. It is impossible to predict with any accuracy if these errors will occur in real life applications. To predict that one would need a complete list of all applications that you might run on the machine and then do a forensic analysis of each application to study how it makes use of the RAM while it executes. And even then you still couldn't be sure you checked all possible execution paths and all the different permutations of the software running at the same time. Different inputs into the software might lead to different behaviour.

    Even when an error does occur, in many cases it might go unnoticed. The error might produce a wrong result in a Excel spreadsheet, a loop than finishes slightly too early, an apparent spelling mistake on a web page, or a pixel being the wrong colour in an image. Its impossible to predict.

    My own PC throws a single error in the row hammer test, but it has been rock solid in normal office / development use for 12 months.

    If this was a computer running some bit of medical equipment, flying an aircraft or running in a bank, you'd replace the RAM without question, and probably also switch to ECC RAM. A 1 bit difference in someone's bank balance is pretty catastrophic for a bank. But for home use you can probably live with the errors until it manifests itself as a real problem.

    Comment


    • #3
      Thanks for taking the time to answer,

      As long as they are real errors I need to find out how to relate to them. Ran a new +8h hour test, excluding hammer, 5 passes with 0 errors. From the 'old' way of testing, my RAM is ok.

      I know this is impossible to answer for now. But I really would like, at some point, basic rules to go by. I.e., as long as 0 errors on other tests, and <10 (or something) on hammer, probably ok'ish RAM. Because I am guessing that after v6 is out of beta, you will get an influx of people asking questions. Will of course depend on the % of real-life RAM affected of this problem (vs the % in paper).

      I'll get the new RAM in a week. See how many errors I get on that RAM, if any. Probably test a few other variables too. Personally I want 0, obviously. Tired of equipment failing me, where it should not. First series of OCZ Vertex SSD's, Intel 320 SSD 8MB fiasco, Intel 530 SSD BIOS boot incompatibility, Samsung 840 TLC decay-read-slow. Aaaaargh, I just want a stable machine.

      I'll update the thread with statistics of new RAM when arrived.

      PS1: I could have gone for ECC. Always done it before (Intel C2xx chipsets). Though this time I wanted as fast as possible single-threading, without overclocking (i.e. i7-4790K). Shame that Intel gimps their mainstream CPU/chipsets of ECC support. Although, paper mentions that simple ECC cannot prevent all hammer errors (sigh).

      PS2: Just for reference. Did two runs of all tests on my laptop. Got 1 hammer error on second round. Low address, multiple bits (Address: 3C, Expected: FFFFFFFF, Actual: F000F065). For now I'll just ignore it, most stable laptop I've ever had.

      Click image for larger version

Name:	20150206_02_memtest86_photo.png
Views:	1
Size:	54.3 KB
ID:	34909

      Relevant hardware specifications:
      - MB: ThinkPad X230 (BIOS: 2.62)
      - CPU: i7-3520M (@stock 2.90GHz)
      - RAM: 2x8GB Kingston KTL-TP3C/8G DDR3-1600 (11-11-11-28/1.5V)
      - RAM: Manufacture Date: Week 24 / 2012

      Comment


      • #4
        But I really would like, at some point, basic rules to go
        Clearly the more errors you have during testing, the higher the probably is of the errors manifesting themselves in real life applications.

        It like defects your DNA leading to cancer. You might get lucky and the defects aren't anywhere important, or you might be unlucky and get cancer.

        Without some major scientific study of 1000 of computers and their usage patterns, nobody knows to what degree the low level row hammer error rate translates into real failures.

        As you point out you machine (and my desktop) are both very stable despite a row hammer error.

        In our testing most of the errors have been single bit. So ECC must help.

        If you really want to be worried read this article on Cosmic Rays causing RAM errors.
        http://en.wikipedia.org/wiki/Soft_error
        "IBM estimated in 1996 that one error per month per 256 MiB of ram was expected for a desktop computer"
        So with 32GB of RAM, this is several RAM errors per day from Cosmic Rays at ground level.

        They then say,
        "Computers operated on top of mountains experience an order of magnitude higher rate of soft errors compared to sea level. The rate of upsets in aircraft be more than 300 times the sea level upset rate."

        It surprising that anything works really.

        Update: Further research shows those numbers from IBM look too high & other studies report a lesser rate.

        Comment


        • #5
          Thanks again,

          I do get that there are no real answers yet, and that there may never be. I am certainly NOT demanding it. Not paid a dime for MemTest86. Although, very satisfied with the purchase and usage of PerformanceTest and BurnInTest Pro. Always respected the non-bloat design of them. Keep that up

          Agree that our imperfect world cannot give predictable 0 or 1s all the time. We humans arrange these 0 and 1s, and mess the logic up. In addition the containers with 1 and 0s are always pushed to the limit, sometimes giving the opposite answer of what we put in them.

          Under these imperfect conditions I try to simplify and be pragmatic. Unstable machine: Test with MemTest86, if error, swap RAM. In majority of cases, have solved problem. With this experience I also run MemTest86 on new machines. Good track record of 0 errors giving stable machines, and vice versa. It is an invaluable tool. Hence my current dilemma: How to evaluate MemTest86 with only error(s) in Hammer Test 13.

          Again, I am not asking PassMark to solve this. You have already given us a new RAM test, without specialized hardware. Which is impressive.

          But I would like to constructively compare notes and experience with others through this forum. Getting a sample size of 1000 is not that hard on the internet. Although, defining / getting good data / organizing it, that is difficult.

          If your forum is not intended for this purpose, say so. I'll show myself out the door and find another soapbox to stand on It is your forum, your utility, your rules. I have no problems with that.

          PS: If we are drawing parallels to DNA. One consequence of this new Hammer Test 13 might be like those new personal DNA sequencing tests. You get a lot of data. But in real life, not much to do about it, except worry. In the MemTest86 case, I'm that idiot

          Comment


          • #6
            Finally got my new Kingston RAM. Relevant hardware specifications:
            - MB: ASUS Z97-WS (hw-rev: 1.02) (BIOS: 2013)
            - CPU: i7-4790K (@stock 4.00GHz, stepping C0)
            - RAM: 2x8GB Kingston HyperX Savage DDR3-1866MHz (@1600MHz, 9-9-9-27-2T/1.5V)
            - RAM: Manufacture Date: Week 03 / 2015

            A few quick tests with only Hammer Test 13 (MemTest86 6.0.0, release). Then a +8h hour with all tests, about 4 passes. 0 errors all around

            Think that does it for me. Happy having 0 on hammer. Would I have gotten problems with the Corsair RAM in daily use ? I'm guessing not. All other tests, except hammer, gave 0 errors. I also installed Win 8.1, lots of programs/drivers, pushed a few TB with verify, all on those Corsair sticks. Not a single OS/application/driver crash, no failed verify. The new Kingston sticks gives me peace of mind though.

            For reference, did a few extra tests while at it. Test machine I use for various stuff:
            - MB: ASUS B85-PLUS (BIOS: 2201)
            - CPU: Intel G3220 (@stock 3.00GHz)
            Following two 4GB RAM sticks:
            - RAM: 4GB Kingston DDR3-1333MHz KVR1333D3N9H/4G (@1333MHz, 9-9-9-24-1T/1.5V)
            - RAM: Manufacture Date: Week 40 / 2013
            - RAM: 4GB Kingston DDR3-1600MHz KVR16N11S8/4 (@1333MHz, 9-9-9-24-1T/1.5V)
            - RAM: Manufacture Date: Week 44 / 2014

            Both 4GB sticks gave 0 errors on hammer, 4 passes. I did try the Corsair sticks too (@1333MHz, 9-9-9-24-1T/1.5V). They gave consistently about 20 bit errors each time (B85-PLUS), vs. 40 on my main system (Z97-WS). Separate sticks about 10 on each.
            Last edited by sveinan; Feb-16-2015, 08:44 PM. Reason: corrected to B85-PLUS

            Comment


            • #7
              Just wanted to add another data point -- I bought some new RAM (2x8GB Kingston HyperX Fury DDR3-1866) and ran MemTest as soon as I got them in there. The sticks failed test 13 only, with around 211 errors on test 13 each pass.

              After seeing this thread I put in an order for the Kingston HyperX Savage, although I got the 2x8GB DDR3-2400 instead of DDR3-1866 (I'm sure they use the same chips but with some binning). Fired up MemTest, 0 errors on all tests including 13.

              So it definitely sounds like Kingston HyperX savage is a good bet on passing the hammer test.

              My theory: these are quite new on the market, last 2-3 months at most. It's possible some revisions have been made to make these sticks non-suspectible. The Kingston HyperX Fury, on the other hand, have been around for at least 8 months, meaning logically they were designed prior to the publication of this suspectibility. I wouldn't be surprised if most sticks in the last year or two (especially very high density ones) fail test 13, and maybe the manufacturers are now coming out (quietly) with revisions to fix it.
              Last edited by klmd; Feb-20-2015, 06:12 AM.

              Comment


              • #8
                Here's the report from the HyperX Fury 1866 sticks. Z77 board (Gigabyte G1.Sniper M3)

                System Information

                EFI Specifications 2.31
                CPU Type Intel Core i7-3770K @ 3.50GHz
                CPU Clock 3504 MHz [Turbo: 4104.1 MHz]
                # Logical Processors 8
                L1 Cache 64K (115714 MB/s)
                L2 Cache 256K (62475 MB/s)
                L3 Cache 8192K (38081 MB/s)
                Memory 16439M (22742 MB/s)
                DIMM Slot #0 8GB DDR3 PC3-14200
                Kingston / KHX1866C10D3/8G
                10-11-10-29 / 888 MHz / 1.5V
                DIMM Slot #1 8GB DDR3 PC3-14200
                Kingston / KHX1866C10D3/8G
                10-11-10-29 / 888 MHz / 1.5V

                Result summary

                Test Start Time 2015-02-17 16:16:19
                Elapsed Time 2:53:21
                Memory Range Tested 0x0 - 41F000000 (16880MB)
                CPU Selection Mode Single: CPU # 0
                # Tests Passed 20/21 (95%)
                Lowest Error Address 0x15145358 (337MB)
                Highest Error Address 0x41EBFC994 (16875MB)
                Bits in Error Mask 00000000FFFDFFD7
                Bits in Error 29
                Max Contiguous Errors 1
                Test # Tests Passed Errors
                Test 0 [Address test, walking ones, 1 CPU] 2/2 (100%) 0
                Test 1 [Address test, own address, 1 CPU] 2/2 (100%) 0
                Test 2 [Address test, own address] 2/2 (100%) 0
                Test 3 [Moving inversions, ones & zeroes] 2/2 (100%) 0
                Test 4 [Moving inversions, 8-bit pattern] 2/2 (100%) 0
                Test 5 [Moving inversions, random pattern] 2/2 (100%) 0
                Test 6 [Block move, 64-byte blocks] 2/2 (100%) 0
                Test 7 [Moving inversions, 32-bit pattern] 2/2 (100%) 0
                Test 8 [Random number sequence] 2/2 (100%) 0
                Test 9 [Modulo 20, ones & zeros] 1/1 (100%) 0
                Test 10 [Bit fade test, 2 patterns, 1 CPU] 1/1 (100%) 0
                Test 13 [Hammer test] 0/1 (0%) 228
                Last 10 Errors
                [Data Error] Test: 13, CPU: 0, Address: 41D844130, Expected: FFFFFFFF, Actual: FFFFFEFF
                [Data Error] Test: 13, CPU: 0, Address: 413847674, Expected: FFFFFFFF, Actual: FFFFEFFF
                [Data Error] Test: 13, CPU: 0, Address: 40E3FC6FC, Expected: FFFFFFFF, Actual: DFFFFFFF
                [Data Error] Test: 13, CPU: 0, Address: 40C846850, Expected: FFFFFFFF, Actual: FBFFFFFF
                [Data Error] Test: 13, CPU: 0, Address: 401046B20, Expected: FFFFFFFF, Actual: FEFFFFFF
                [Data Error] Test: 13, CPU: 0, Address: 3FD447BB0, Expected: FFFFFFFF, Actual: FBFFFFFF
                [Data Error] Test: 13, CPU: 0, Address: 3F97FD8D4, Expected: FFFFFFFF, Actual: FFFFF7FF
                [Data Error] Test: 13, CPU: 0, Address: 3F7C45B14, Expected: FFFFFFFF, Actual: FFFFEFFF
                [Data Error] Test: 13, CPU: 0, Address: 3F0FFF70C, Expected: FFFFFFFF, Actual: FF7FFFFF
                [Data Error] Test: 13, CPU: 0, Address: 3EC3FDFB4, Expected: FFFFFFFF, Actual: FFFFEFFF

                Comment


                • #9
                  ...and maybe the manufacturers are now coming out (quietly) with revisions to fix it.
                  Yes maybe.

                  Nice to see the problem go away after the RAM was swapped. Also interesting is the spread of errors across the bits and spread across the address space. Both are more spread out that you would see with typical RAM errors. That is to say, it is random. But all single bit flips, so ECC would be effective at fixing it in your case.

                  We had anonymous contact offering to act as a go between between us and unnamed memory companies, with a view to paying us not release the new version of MemTest86. Who knows how serious the offer was.

                  Needless to say we didn't take up that option, and just released the software anyway.

                  But the issue is a BIG issue. The lack of publicity up to now is somewhat surprising considering the implications. Many computers are fundamentally (slightly) unreliable in a random ways. Maybe this doesn't matter for home use, but for medical devices, banking systems, flight control systems, etc.. it is a big deal.

                  I'll be using ECC RAM on our next server. At least you'll eventually get a warning if things go bad.

                  Equally worrying is that our algorithm for provoking the problem is probably non optimal. Meaning that with prefect knowledge of the addressing scheme on each CPU, the channels in use and ram timings, etc.. we could probably force even more errors. The current algorithm is fairly general and not targeted at any particular RAM setup or CPU.

                  Comment


                  • #10
                    Originally posted by David (PassMark) View Post
                    Needless to say ...
                    Not needless. Too much scheming and shortcuts taken in tech business. So thank you for your integrity

                    Originally posted by David (PassMark) View Post
                    But the issue is a BIG issue. The lack of publicity up to now is somewhat surprising considering the implications. ...
                    It is depressing. But I am guessing it'll blow up slowly. Like the Samsung 840 EVO, who just went into round two. Everyone can measure the impact on 840 EVO. Still, Samsung are dragging their feet, trying to create firmware/software solutions. With Hammer 13 I could guess a myriad of PR speak: Within 'normalized' specifications... Negligible impact with normal usage... and so on :/ But as you rightly pointed out. There are systems where 1 single unintended bit flip can have a major impact. And you can bet many of them are using normal RAM where ECC would be sensible (cost).

                    Originally posted by David (PassMark) View Post
                    Equally worrying is that our algorithm for provoking the problem is probably non optimal ...
                    Not sure if I'm looking forward to or dreading any possible enhancements

                    Comment


                    • #11
                      Originally posted by sveinan View Post
                      - RAM: 2x8GB Kingston HyperX Savage DDR3-1866MHz (@1600MHz, 9-9-9-27-2T/1.5V)
                      - RAM: Manufacture Date: Week 03 / 2015
                      Do you mind giving the model # and retailer? I just bought some G.Skill Ripjaws to replace G.Skill Snipers and they both fail miserably. I think Ill be sending the Ripjaws back for credit and purchasing what you bought. I have a similar setup (Asus mobo w/ 4790k).

                      Comment


                      • #12
                        Also FYI:
                        After we started this thread some researchers at Google came out with some demo code that used Row hammer faults to take over a computer (in specific circumstances). So it is a security and reliability issue now.
                        http://googleprojectzero.blogspot.co...g-to-gain.html

                        Comment


                        • #13
                          OK, I took a chance and ordered the HyperX Savage 16GB 2400 off Amazon. I ordered these ones and keeping my fingers crossed that they will be OK.

                          http://amzn.com/B00N9PVZ3O

                          Comment


                          • #14
                            Originally posted by Armchair View Post
                            Do you mind giving the model # and retailer? ...
                            Originally posted by Armchair View Post
                            OK, I took a chance and ordered the HyperX Savage 16GB 2400 off Amazon. I ordered these ones and keeping my fingers crossed that they will be OK. http://amzn.com/B00N9PVZ3O
                            A little late. But what you ordered is same series as mine, just binned even better probably. You wouldn't want my retailer/price here in Norway anyway For reference, the series I bought from was Kingston HyperX Savage. Basic idea: Newly introduced 'high-end' series, newly produced, newer chips, better tested for row hammer (maybe). I went for the DDR3-1866 HX318C9SRK2/16 2x8GB (PDF spec). I just want to run as close to 'standard' specs as possible (stability/quality then performance). One of the XMP profiles on the DDR3-1866 sticks is DDR3-1600 9-9-9-27-2T/1.5V, using that.

                            As you probably have observed you have bought the same DDR3-2400 HX324C11SRK2/16 2x8GB (PDF spec) as user klmd a few posts back (he got 0 errors on them). Not many datapoints to brag about though. But I hope and think you'll be ok. Hopefully this series will prove to be tested/resistant to row hammer going forward. For those that care.

                            Originally posted by David (PassMark) View Post
                            Also FYI: After we started this thread some researchers at Google came out with some demo code that used Row hammer faults to take over a computer (in specific circumstances). So it is a security and reliability issue now.
                            http://googleprojectzero.blogspot.co...g-to-gain.html
                            Yeah, I saw that. Very creative usage :/ Since last time, someone has also created a good Row hammer Wikipedia page (MemTest86 6.0.0 mentioned at bottom of it). A little disappointed not more about it in tech press.

                            Comment


                            • #15
                              Originally posted by sveinan View Post
                              A little late. But what you ordered is same series as mine, just binned even better probably. You wouldn't want my retailer/price here in Norway anyway For reference, the series I bought from was Kingston HyperX Savage. Basic idea: Newly introduced 'high-end' series, newly produced, newer chips, better tested for row hammer (maybe). I went for the DDR3-1866 HX318C9SRK2/16 2x8GB (PDF spec). I just want to run as close to 'standard' specs as possible (stability/quality then performance). One of the XMP profiles on the DDR3-1866 sticks is DDR3-1600 9-9-9-27-2T/1.5V, using that.

                              As you probably have observed you have bought the same DDR3-2400 HX324C11SRK2/16 2x8GB (PDF spec) as user klmd a few posts back (he got 0 errors on them). Not many datapoints to brag about though. But I hope and think you'll be ok. Hopefully this series will prove to be tested/resistant to row hammer going forward. For those that care.


                              Yeah, I saw that. Very creative usage :/ Since last time, someone has also created a good Row hammer Wikipedia page (MemTest86 6.0.0 mentioned at bottom of it). A little disappointed not more about it in tech press.

                              Yeah, Norway would have taken a bit of shipping time I wanted to get the order in before the weekend. I noticed Klmd had ordered the DDR-2400 as well and it was cheaper then the 1866 at Amazon so I thought why not.

                              Comment

                              Working...
                              X