Announcement

Collapse
No announcement yet.

ECC Error(s)

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ECC Error(s)

    Testing some memory for a ZFS file server and wondering if I should be overly concerned about this.

    Click image for larger version

Name:	memtest86_error1.JPG
Views:	2731
Size:	64.8 KB
ID:	38392

    It has been 126 Hrs. The ECC Errors weren't logged in the IPMI/BMC console for the server either, not 100% sure why.

    Can I conclude what memory module caused these ECC Errors based on this output? I had run this same test (8 passed) on 64G of RAM already without error and had run a day long test on the other 64G before starting this 128G test. I suspect it is from the 2nd half of RAM but considering how long it took to get these to show up I don't think I can say that with 100% certainty.
    Last edited by JayG30; Jul-03-2017, 08:49 PM.

  • #2
    If the motherboard's UEFI BIOS is compatible with threading then you could run all 32 cores for faster testing.

    125 hours is a long test. If you believe the ECC report, then the bad RAM is in Channel 2 slot 0. But it isn't always obvious which one is zero on channel 2 by looking at the motherboard.

    The errors got corrected and there were only 2 of them. So it isn't a critical issue. In the end however, the level of importance depends on you, not us. If this machine was running a nuclear reactor or flying an aircraft, then it is obvious more serious than for a machine serving up web pages.

    Comment


    • #3
      So, after letting the tests run to completion I had a few more ECC errors. Seemed to mostly occur at the end. Interestingly only 1 of them was logged in the IPMI/BMC console. I don't get how Memtest can report an ECC Error correction but Intel's event log doesn't.

      Memtest86 Errors
      Last 10 Errors
      [ECC Error] Test: 13, (Col,Row,Rank,Bank): (N/A,N/A,N/A,N/A), ECC Corrected: Yes, Syndrome: N/A, Channel/Slot: 2/0
      [ECC Error] Test: 13, (Col,Row,Rank,Bank): (N/A,N/A,N/A,N/A), ECC Corrected: Yes, Syndrome: N/A, Channel/Slot: 2/0
      [ECC Error] Test: 13, (Col,Row,Rank,Bank): (N/A,N/A,N/A,N/A), ECC Corrected: Yes, Syndrome: N/A, Channel/Slot: 2/0
      [ECC Error] Test: 13, (Col,Row,Rank,Bank): (N/A,N/A,N/A,N/A), ECC Corrected: Yes, Syndrome: N/A, Channel/Slot: 2/0
      [ECC Error] Test: 13, (Col,Row,Rank,Bank): (N/A,N/A,N/A,N/A), ECC Corrected: Yes, Syndrome: N/A, Channel/Slot: 2/0
      [ECC Error] Test: 13, (Col,Row,Rank,Bank): (N/A,N/A,N/A,N/A), ECC Corrected: Yes, Syndrome: N/A, Channel/Slot: 0/0
      [ECC Error] Test: 13, (Col,Row,Rank,Bank): (N/A,N/A,N/A,N/A), ECC Corrected: Yes, Syndrome: N/A, Channel/Slot: 0/0
      [ECC Error] Test: 7, (Col,Row,Rank,Bank): (N/A,N/A,N/A,N/A), ECC Corrected: Yes, Syndrome: N/A, Channel/Slot: 2/0
      [ECC Error] Test: 7, (Col,Row,Rank,Bank): (N/A,N/A,N/A,N/A), ECC Corrected: Yes, Syndrome: N/A, Channel/Slot: 2/0
      IPMI/BMC Event Log
      700 07/05/2017 22:45:28 Mmry ECC Sensor Memory Correctable ECC. CPU: 1, DIMM: C1. - Asserted
      The error in IPMI (DIMM C1) was either Channel/Slot 0/0 or 2/0 during the RowHammer test (#13).

      Unfortunately Memtest86 tells me that threading probably doesn't work on this board (I posted about that previously actually) so I can't even try to run it that way. It also can't gather certain information about my machines RAM and told me to save a log file I believe. I'm pretty sure that is what I'm attaching to this post (file was to big to upload so had to delete stuff). Interestingly when I first did this same exact test previously I would get 100's of ECC errors on test #10 right away (you can see that in the log file). Was very strange.
      Attached Files
      Last edited by JayG30; Jul-07-2017, 03:30 PM.

      Comment


      • #4
        Most of those errors appear to be in test #13, which is a special case.
        See,
        http://www.memtest86.com/troubleshooting.htm



        Comment


        • #5
          Thanks for the info.

          More information on testing that mght be interesting. I swapped out 2 sticks for a pair I had tested good previously.

          Remember I mentioned that when I initially ran memtest86 on the 128GB of RAM I was getting a TON of errors for Test #10 (instantly). You can see this in the log I attached above. Here is a screenshot of that run:

          Click image for larger version

Name:	test1_128GB.PNG
Views:	1617
Size:	25.0 KB
ID:	38502

          When I ran the SAME exact thing again later I didn't get any of those errors and only those ECC ones I posted (same memory modules, tests, etc).

          So here I am testing with only 2 different memory modules installed that I've tested as good before. And I'm getting those same Errors. Currently I have 670 Errors being reported. NOT ECC errors, but Errors! I'm not sure what is going on but I feel like something isn't right here. Perhaps it is testing addressees not actually assigned to RAM?

          Click image for larger version

Name:	errors.JPG
Views:	1277
Size:	105.2 KB
ID:	38501

          It seems REALLY unlikely that this is accurate.

          Comment


          • #6
            I initially ran memtest86 on the 128GB of RAM I was getting a TON of errors for Test #10 (instantly)
            Likely there is more than one issue.

            The errors in Test #10 are likely due to a UEFI BIOS bug where the memory map is not correct. So some other device / software / firmware on the system is using a block of RAM, but that block of RAM is never being marked as in use. MemTest86 can dump out the memory map if you want to take a look at it.

            The result of which is having both MemTest86 and some other device writing to the same RAM at the same time. This wouldn't provoke a ECC error, as from the RAMs point of view both writes are valid. But from MemTest86's point of view it is an error. Something external is changing values in the RAM while the test is running.



            Comment


            • #7
              Originally posted by David (PassMark) View Post

              Likely there is more than one issue.

              The errors in Test #10 are likely due to a UEFI BIOS bug where the memory map is not correct. So some other device / software / firmware on the system is using a block of RAM, but that block of RAM is never being marked as in use. MemTest86 can dump out the memory map if you want to take a look at it.

              The result of which is having both MemTest86 and some other device writing to the same RAM at the same time. This wouldn't provoke a ECC error, as from the RAMs point of view both writes are valid. But from MemTest86's point of view it is an error. Something external is changing values in the RAM while the test is running.


              Can you point me to how to dump the memory map? Thanks.

              ​​​​​​​If this is happening do I have any recourse so I can test the ram without issue?

              Comment


              • #8
                Can you point me to how to dump the memory map?
                It is in the debug log, and you can also display it on the screen by selecting "View Memory Usage" in the System information widow.

                You can also limit the address range being tested to avoid the bad section if you wanted.

                If the memory map is wrong however, it is likely going to cause an issue for the operating system as well (two things using the same memory at the same time, overwriting each others data).


                Comment

                Working...
                X