Announcement

Collapse
No announcement yet.

Cosmic ray incidents, addresses, and testing strategy

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cosmic ray incidents, addresses, and testing strategy

    I bought some RAM and ran MemTest86 on it thinking, “I’ll just do a set of 4 passes for peace of mind since if no errors come up, it’s unlikely that more will come up afterwards.”

    During the test a single error came up in test 7, so then I ran 3 more tests of 4 passes each and all passed.

    So that got me thinking - since random errors are possible because of cosmic rays and such, how many more tests should I run to rule out the possibility of the RAM being defective? 24 hours of tests? More?

    Is there a length of time between errors where you’d say “This is too frequent to be a coincidence.”?

    A part of me is tempted to let it slide because the error could have come up on the 4th test instead of the first. If it did, I never would have seen it because if I stuck to my original plan, I might have run a second test overnight and maaaaybe run a third a bit later, but almost definitely wouldn’t have run a 4th test if the first 3 tests passed.

    I’m a bit reluctant to return it because I’ve already bought and returned 2 other sets of RAM because of errors - one with very obvious, frequent errors on one stick, and another with infrequent errors (about 1 per 3 hour test). So I’m left wondering how common this is and if I’ll end up spending weeks testing and exchanging sets before I find a good one. I also don’t want to exchange it unless I’m sure something is wrong.

    As an aside, I’ve noticed that test 7 has revealed more errors than any other for the sets of RAM I’ve checked.

    Some other questions.

    When there is an error and an address is shown, does that address change if you:
    • Reboot the computer
    • Switch the 2 sticks of RAM in the same sockets
    • Switch the 2 sticks of RAM again in the same sockets (back to original position)
    • Put the RAM in different dual channel sockets
    • Put the RAM back into the same dual channel sockets as before (back to original position)
    • Give the RAM to someone else to test on a different machine (different motherboard and CPU)
    If I know the address and test it failed on (eg. test 7), would it make sense to:
    • Only run test 7 multiple times on the range containing that address
    • Run all tests multiple times on the range containing that address
    • Run test 7 multiple times on all addresses
    Or is it better to just run tests normally?

    If I can't get errors to show up predictably within a sane time frame (preferably an error in less than 3 hours), it makes it impractical to troubleshoot it (by raising the voltage, etc.).

  • #2
    >When there is an error and an address is shown, does that address change if you...

    It depends on the type of error. If it is in fact cosmic rays, temperature sensitivity, or EMI then one would expect it to be purely random or somewhat random.

    If its a flaw in manufacturing, then it should be a lot more consistent. Totally dead bytes (e.g. bits stuck high, or stuck low) are typically picked up in manufacturing. So the errors that get into the field are often the more subtle ones. e.g. a manufacturing defect that is only apparent under certain timings, patterns, speeds or temperatures.

    Other than rebooting, the other points will probably change the memory address (can't be sure, for example if the different machine is the same model, then the addresses might be the same).


    >As an aside, I’ve noticed that test 7 has revealed more errors than any other for the sets of RAM I’ve checked.

    From what people have reported, tests 6 and 7 is where most errors are detected.

    >If I know the address and test it failed on (eg. test 7), would it make sense to...

    It its failing on just test 7, it might make sense to run just test 7 multiple times and on all addresses.

    Comment


    • #3
      Thanks for the response Simon!

      I did some more tests after my post and another error came up. This one was very close to the first in address:
      • First error - 19F57D7790 (6645 MB) on Test 7 (on first set of 4 passes)
      • Second error - 19F785790 (6647 MB) on Test 8 (on 6th set of 4 passes)

      Two errors that close together in address is beyond cosmic coincidence, so it seems most likely to be a subtle defect.

      I've moved the 2 sticks to the other pair of dual channel slots and am testing them there as well, but I suspect I'm wasting my time.

      In general, is Test 10 important to keep in the lineup if it has been passed it a number of times previously? It adds 40 mins to each set of 4 passes, so leaving it out would let you get more passes in.

      Even if RAM doesn't have a defect that test 10 would detect, does giving RAM a break help to reveal errors in other tests, or is it healthier for the RAM?

      I'm starting to wonder if subtle defects like this are very common with RAM, like dead pixels in monitors. Most people will buy RAM and won't test it unless they get frequent crashes and someone tells them to check for it. If someone else bought this RAM, I doubt they would ever bother to check because it works properly nearly all the time.

      Have I just been unlucky by getting 3 kits of RAM consecutively with defects or is it unrealistic to try to get RAM that doesn't fail a test with 24 hours of testing?

      Comment


      • #4
        Here is Linus Torvalds (creator of Linux) having a rant about memory being unreliable and the need for ECC RAM.

        But it would be extremely unusual to get 3 bad sets of RAM. Where they all the same model from same vendor?

        Maybe your BIOS has an issue. e.g. running the voltage too low, timing too tight, or clock speed too high. I've certainly seen cases where RAM fails at the advertised max clock speed but works fine just under the max with particular motherboards.

        Comment


        • #5
          That post by Linus was really interesting!

          I had no idea Intel suppressed the adoption of ECC in consumer RAM.


          The 3 kits of RAM have had very different types of errors. All were tested using the XMP profiles.

          One gave errors very easy to spot, with it failing multiple tests very quickly on the first pass. I was able to isolate the stick giving trouble with this one. This one was on my motherboard's QVL.

          Another had a single error about every 3 hours. With 2 of the errors, the same address failed on 2 different tests (7 and 8 on 2 separate groups of tests done hours apart). This kit wasn't on my motherboard's QVL but it was producing errors when the RAM was below it's advertised XMP speeds (but with lower error frequency when tested at slower speeds). It was also giving errors when tested on the other pair of dual channel slots.

          I'm still testing the kit I mentioned in my original post (this kit is on the QVL - the same module number as the one I mentioned giving multiple errors quickly). I switched it to the other pair of dual channel slots and have been running tests on it. So far it has passed all tests (total 40 passes, 8 including test 10, the rest excluding it) over 26h 15m of testing.

          If it stays good after another 24 to 48h of tests, I'll suspect the first pair of dual channel slots.

          Comment


          • #6
            I heard back from the vendor I returned the second RAM kit to (error every 3 hrs) - they confirmed a defect in the RAM, but I don't know if they did a scan for errors. I heard from them less than 2 hours after it arrived, so either it didn't start at the advertised speed on their machine as well, or they did a scan with a powerful desktop and it turned up errors very quickly.

            I'm positive the first one is defective, so that's a definite 2 out of 3.

            Comment

            Working...
            X