Announcement

Collapse
No announcement yet.

[ECC Errors detected] - problem, or problem prevented?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • [ECC Errors detected] - problem, or problem prevented?

    I'm running 7.0.0 Beta 1 on a system with 64GB of ECC RAM, and getting a lot of these:

    [ECC Errors detected] Test: 13 Addr: DB9A9C040

    Does this mean "errors occurred but ECC corrected them" which would mean I don't have anything to worry about, or "errors occurred in spite of ECC" which would mean I do have something to worry about?

  • #2
    Yes, the errors have been corrected by ECC so from the CPU's point of view, the data integrity is preserved. However, the fact that there were actual memory errors that were corrected should serve as a warning of faulty memory hardware. In your case, the corrected ECC errors seem to occur during the row hammer test (Test 13) which means that your RAM may be susceptible to row hammer errors.

    See this page for details:
    http://www.memtest86.com/troubleshooting.htm#hammer

    Comment


    • #3
      Okay, but how serious is this?

      Originally posted by keith View Post
      Yes, the errors have been corrected by ECC so from the CPU's point of view, the data integrity is preserved. However, the fact that there were actual memory errors that were corrected should serve as a warning of faulty memory hardware. In your case, the corrected ECC errors seem to occur during the row hammer test (Test 13) which means that your RAM may be susceptible to row hammer errors.
      Okay, but if ECC is able to detect & correct the errors, how serious a problem is this in the real world? Having no errors to correct is obviously better, but if the errors ARE getting corrected, should I still be wary of trusting the system?

      Comment


      • #4
        From time to time you might get a two bit error, which ECC won't correct.
        But two bit errors are rarer than single bit errors.

        Comment


        • #5
          Hello JAustin,

          I work a lot with ECC systems and I consider ECC errors to be the canary in the coalmine. If you're seeing them with frequency, I would be concerned. My belief is that ECC errors warn us that it's time to replace hardware before it fails. As you say that you saw "a lot" of these errors, I would err on the side of caution. It's just a matter of time before multiple errors occur.

          Out of curiosity, what is the system and what type of memory is it?

          Comment

          Working...
          X