Announcement

Collapse
No announcement yet.

Different results between 9.1 and 9.4 for ECC errors

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Different results between 9.1 and 9.4 for ECC errors

    In preparation for a new computer, I downloaded memtest86 9.4 pro. In order to check that the USB was bootable and everything seemed to be working as expected, I booted my desktop from the 9.4 disk. Almost immediately I started seeing ECC errors. Well Nuts! That wasn't on my schedule.

    Before I went too far down that path, I decided to boot with the older version of memtest86 I had used when first assembling my desktop. So, I ran memtest86 9.1 pro. No errors! Hurray?

    Suggestions? Which version is correct?

    Main System specs:

    Code:
    AMD Ryzen 9 5900X 12-Core Processor
    ASUS B550-Plus Prime AMD AM4 ATX Motherboard
    32GB Kit 2x16GB DDR4-3200 PC4-25600 ECC Unbuffered 2Rx8 Memory for Server/Workstation by NEMIX RAM
    I use the system every day without issue, and a cron job runs edac-util every day at 17:17. It's never reported any errors.

  • #2
    With each new release of MemTest86 we are adding support for detecting ECC errors from more different memory controllers. See
    What's new in MemTest86.

    Ryzen ECC support was added in V9.2 & V9.3.

    So they are probably real errors.

    I use the system every day without issue,
    ECC RAM corrects single bit errors. It's purpose in life is to fix small errors without you noticing.
    (Maybe your edac-util version also needs updating for Ryzen?)

    Comment


    • #3
      Thanks very much for your reply. I ran the usual 4 pass test overnight and 242 corrected errors were noted. All at the same location. Hmmm. Now I get to decided whether to leave well enough alone, or start replacing memory. Given that the system has probably always had this "bad" memory cell, I'll probably leave well enough alone for now.

      Comment


      • #4
        You were 100% correct about edac-utl needing an update. It turns out that it's been deprecated entirely and replaced by rasdaemon and ras-mc-ctl.

        Whether user errror (yours truly), or bugs, ras-mc-ctl isn't as useful as I'd like.

        Code:
        sudo ras-mc-ctl --error-count
        Label CE UE
        mc#0csrow#3channel#0 0 0
        mc#0csrow#3channel#1 0 0
        mc#0csrow#2channel#0 0 0
        mc#0csrow#2channel#1 0 0
        and

        Code:
        sudo ras-mc-ctl --summary
        No Memory errors.
        BUT

        Code:
        sudo ras-mc-ctl --errors | grep CECC | head -5
        1 2021-08-22 16:34:44 -0400 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=17), mcg mcgstatus=0, mci CECC, memory_channel=0,csrow=3, mcgcap=0x0000011c, status=0x9c2041000000011b, addr=0x3fdc1fcc0, misc=0xd01a000101000000, walltime=0x6122b4e4, cpuid=0x00a20f10, bank=0x00000011
        2 2021-08-22 16:39:58 -0400 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=17), mcg mcgstatus=0, mci CECC, memory_channel=0,csrow=3, mcgcap=0x0000011c, status=0x9c2041000000011b, addr=0x3fdc1fcc0, misc=0xd01a000201000000, walltime=0x6122b61e, cpuid=0x00a20f10, bank=0x00000011
        3 2021-08-22 16:45:12 -0400 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=17), mcg mcgstatus=0, mci CECC, memory_channel=0,csrow=3, mcgcap=0x0000011c, status=0x9c2041000000011b, addr=0x3fdc1fcc0, misc=0xd01a000301000000, walltime=0x6122b758, cpuid=0x00a20f10, bank=0x00000011
        4 2021-08-22 16:50:26 -0400 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=17), mcg mcgstatus=0, mci CECC, memory_channel=0,csrow=3, mcgcap=0x0000011c, status=0x9c2041000000011b, addr=0x3fdc1fcc0, misc=0xd01a000401000000, walltime=0x6122b892, cpuid=0x00a20f10, bank=0x00000011
        5 2021-08-22 16:55:40 -0400 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=17), mcg mcgstatus=0, mci Error_overflow CECC, memory_channel=0,csrow=3, mcgcap=0x0000011c, status=0xdc2040000000011b, addr=0x3fdc1fcc0, misc=0xd01a001101000000, walltime=0x6122b9cc, cpuid=0x00a20f10, bank=0x00000011
        About 1 corrected error every 5 minutes, AND the address matches the location reported my memtest86! Good job memtest86 folks!





        Comment


        • #5
          sudo ras-mc-ctl --summary
          Funny that the summary says everything is fine. When clearly it is somewhat less than fine.

          Comment

          Working...
          X