Announcement

Collapse
No announcement yet.

MemTest86 7.1 [ECC Errors detected], but Errors=0

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • MemTest86 7.1 [ECC Errors detected], but Errors=0

    MemTest86 7.1
    The message appears - [ECC Errors detected], but summary Errors=0
    All information on screenshot, what does it mean???

  • #2
    Hello Dames,

    Can you post a brief system configuration, such as make/model/processors, etc? And most importantly, the details of the memory modules. This is a DDR4 ECC protected system and it's showing vulnerability to a Rowhammer attack. ECC errors are very real errors and can show a tendency towards a system that might yield lower performance, halts, reboots, and in some cases application data corruption. Passmark appears to not count these are errors because they were correctable, but it's my opinion that they should be regarded as errors. We pay a premium for ECC, so anomalous behaviors should be noted.

    Comment


    • #3
      CPU: Intel Xeon 2xE5-2620v4
      MB: Supermicro MBD-X10DRL-i
      RAM: 4x16GB DDR4 ECC REG Micron 36ASF2G72PZ-2G1B1

      Comment


      • #4
        Yes, "ECC errors detected" means there were errors, but they were corrected by the hardware. So from an applications point of view there was no error.

        This page has more details
        https://www.memtest86.com/troubleshooting.html

        Comment


        • #5
          Hi, I just saw this thread which is very similar to my issue and didn't want to make a new one.

          I have x2 "Samsung DDR4-2133 32GB/4Gx72 ECC/REG CL15 Server Memory M393A4K40BB0-CPB" one is showing a lot of ECC correctable errors mostly on test 13, and a couple errors on tests 0,1,2,3 after 1 or 2 passes, and the other one is doesn't show any errors at all. Does that mean what I think it means that I have one bad memory stick?

          Comment


          • #6
            If you see "ECC errors detected" but the overall error count is zero, then there were no uncorrected errors.

            This page has more details (especially about test 13)
            https://www.memtest86.com/troubleshooting.html

            So you could decide to live with the errors, knowing that they are correcting themselves, or you could take the hard line and declare it faulty. It depends a bit on how critical the machine was. If this was in a machine being send to Mars by NASA I would replace the stick. If this was a web server running a Wordpress blog I'd be tempted to leave it running if the machine was stable. Further depends on if the memory vendor will replace it for free under warranty or not.

            Comment


            • #7
              The thing is I wish I could know what causes errors in this one stick in particular despite them being correctable or not... after 3 full passes I have got 700+ "correctable errors"... I mean the other one is completely error-free, so what gives? that is what's bugging me.
              I will try to replace it soon if I can... my system is completely new by the way.
              ...Yes and I forgot to mention that the reason I began testing is because my OS (FreesNAS, latest version) is keep reporting about a memory error every once in awhile (same exact error, over and over), something like this:
              • MCA: Bank 5, Status 0xd40000c000900090
              • MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
              • MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 0
              • MCA: CPU 0 COR OVER RD channel 0 memory error
              • MCA: Address 0x12ef39498

              Comment


              • #8
                MCA = Machine Check Architecture
                COR probably means "correctable".

                I wish I could know what causes errors in this one stick
                Very likely is a bad bit in one of the chips. e.g. a manufacturing defect. Sometimes there are compatibility issues with certain motherboards. The RAM is marginal at certain voltages and speeds. The motherboard vendors typically publish a compatibility list of RAM they have tested.

                Comment


                • #9
                  One thing for sure, there is definitely something wrong with it. The RAM is on the motherboard's compatibility list, so no issues there.
                  Thanks for the help.

                  Comment


                  • #10
                    Hi,

                    Please advise.

                    DDR3 Board: Tyan S7053
                    CPU: Intel XEON E5-2650 @ 2.0 GHz
                    Memory: 16GB 1600speed SOUDIMM from SMART Modular

                    I am running row hammer test of V8.3 and get error

                    "ECC errors detected Test : 13 Channel/Slot: 0/0"
                    In the past, on an ASUS motherboard a Test 13 error provides the failing address and failing data

                    In this format
                    Addr:1250042EC, Expected:04612B33,Actual:04212B33.
                    why am I not getting the ECC failing address and data for Tyan S7053?

                    Comment


                    • #11
                      Either the memory controller didn't provide the details or MemTest86 didn't find (or didn't decode) the details.
                      Or maybe the ECC error was really in Channel & Slot 0.

                      Comment


                      • #12
                        Thanks David.
                        We have a license. You indicated 'Memtest didn't find or didn't decode the details. If Memtest can’t decode or display the data, can Memtest diagnose further and confirm if it is motherboard issue or Memtest issue ?

                        Comment


                        • #13
                          Won't be a motherboard issue. Motherboard is just wires that connects the CPU to the RAM.
                          ECC details are kept secret by Intel. So they are hard to support without getting the correct documentation for the particular CPU and an example of that hardware.
                          This is a pretty old CPU. It is possible that old CPU never supplied valid channel and slot IDs when an error occurs.

                          Comment

                          Working...
                          X