Announcement

Collapse
No announcement yet.

Weird bit fade errors

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Weird bit fade errors

    Hi! I could really do with some practical advice...

    I ran memtest86 5.1.0 on a machine that has been crashing fairly regularly - once every two to three weeks, roughly. It reported multiple bit fade errors on its first two runs (1, 2) with all four DIMMs plugged in.

    I then tested each pair of DIMMs separately, and then all four DIMMs back again in their original sockets, testing 10 full passes each time - but all tests ran perfectly and without errors. Those bit fade errors seem to have vanished.

    Here's the trouble: The machine is deployed as part of a voluntary / community project - it needs to run reliably, 24/7, with no technical support on hand. (And yes, I'm aware that a server class machine with ECC memory would be much more appropriate!) There's literally no budget to replace the memory or entire machine. The memory has a lifetime warranty, but given that I can't narrow down which stick has the intermittent fault, I'm not going to have any luck claiming warranty replacement for all four DIMMs on the basis of failed tests that can no longer be replicated.

    Can anyone offer any practical advice on what I should do now? How many successful tests would legitimise simply ignoring the former bit fade errors? Is there anything else I can do to try to narrow down which DIMM is faulty, or even if it's the motherboard, etc?

    Any hints would be greatly appreciated.

    Cheers!

  • #2
    The Bit Fade Test verifies whether the memory contents are retained after several minutes. So any errors discovered are caused by changes in memory contents within that time period. This could be due to any number of factors, such as external interference, defects in DIMM or issues with the firmware. The error bits can sometimes give you a clue of the possible cause (eg. a few 1-bit errors could mean possible hardware-related error). Your best bet may be to swap with different modules (even cheap, second-hand ones) to see if the behaviour changes.

    Comment


    • #3
      Just wanted to follow up my original post with a note in praise of the new hammer test in the 6.0 beta version.

      It turns out that the confusing intermittent fault I was seeing before, which was exposed occasionally in the 5.1.0 bit fade test, is exposed reliably - every single pass - in the new hammer test. So now I've been able to narrow down definitively which DIMM is causing the trouble, and I'll be able to pursue warranty support accordingly.

      So: Thumbs up for both 6.0 beta and the hammer test - they've proved to be a major help. Great job - thank you!

      Comment


      • #4
        Nice to know it helped.

        We are a bit worried that the hammer test might expose too many errors. i.e. errors that are real errors, but only happen in real life extremely rarely. For example one of our test machines fails the hammer test some of the time, but is pretty stable is day to day usage.

        Comment


        • #5
          Hi David -

          I can see what you mean, but ultimately an error is an error is an error!

          Especially since the industry has worked so hard over the years to gradually and artificially restrict ECC-capable hardware to expensive server-class equipment, even a theoretical error is important.

          In my case it was a machine that runs 24/7 crashing every two to three weeks, but with increasing numbers of users suspending their machines instead of shutting down every day - so critical data structures might be quietly sitting right on a fault and not moving around the map from one week to the next - who's to know how many people are affected? (Out of curiosity I left the test running on four other machines last night, and they all passed without fault.)

          Having seen enough desperately shoddy hardware and software design over the years, I find it hard to feel sorry for vendors. Even if the chance of being affected by a fault is mostly theoretical, and even if the actual rate of affected modules is nowhere near what's described in the Kim paper, I sincerely hope everyone runs this test on their machines and makes use of their warranties where possible.

          It's the vendors' responsibility to fix it. As users it's our responsibility to make 'em wear it!

          Comment

          Working...
          X