Announcement

Collapse
No announcement yet.

Correctable ECC Errors with different RAM Modules

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Correctable ECC Errors with different RAM Modules

    Hello Everyone,

    I have a question regarding my new server build at home:

    When I did my burn-in-testing with Memtest86 it found correctable ECC errors.
    After those were pretty much always repoducelable I returned the modules and bought new ones.

    So with the new modules I get "Correctable Errors" again.
    You could think that's just bad luck, but it's exactly 20 correctable errors every run.
    Can that really be a coincidence?

    Attached screenshots from the summary, I could upload detailed logs if needed.

    Thanks and Regards
    Chris
    Attached Files

  • #2
    Yes, strange to get the same number of errors from two different sets of RAM.
    In both cases the errors appeared very early in the test run as well.

    Was almost like the memory controller had a bunch of queued up ECC errors to report before MemTest86 had even started the tests.
    On the other hand the channel / slot details in the ECC error report seem kind of believable.

    Wonder if it might be some type of BIOS bug where the ECC stuff isn't being initialised correctly via BIOS. So machine starts up with a bunch of fake ECC reports, but only once per cold boot.

    Can you try,
    1) Cold boot the machine & run tests for a few minutes (at which point I assume you get the some ECC errors).
    2) Quit tests in MemTest86 and restart them (without a cold boot). Do you get errors on the 2nd run?

    Comment


    • #3
      Thanks for the reply.
      Will do so starting Sunday and report back.

      Comment


      • #4
        Hi David,

        your hunch was right:
        1) First run after boot -> Errors (exyctly 20 again in first 2-3 minutes)
        2) Second run after soft reboot - No Errors after ~1 hour. I don't believe any will occur.

        What does this mean for me?
        All fine and can be ignored, or else ...?

        Comment


        • #5
          Likely a BIOS bug in the Fujitsu BIOS.
          You could try and explain it to them, but I doubt they would care.

          We'll also have a look in MemTest86. Maybe we can clear existing queued errors on a cold boot (without reporting them as real ECC errors).

          Comment


          • #6
            Originally posted by SofaKingBoring View Post
            Hello Everyone,

            Attached screenshots from the summary, I could upload detailed logs if needed.
            Thanks for the screenshots. If possible can you send a copy of the logs under EFI\BOOT\ of the USB flash drive.

            Comment


            • #7
              Have attached it here.

              Best Regards
              Chris
              Attached Files

              Comment


              • #8
                Thanks for the logs.

                The ECC errors look like legitimate errors reported by the chipset, though it appears to be triggered not by faulty RAM but on the memory controller side.
                This is indicated by the ECC status registers indicating no errors before the start of the test, but error bits are set once the test starts.

                Also, it appears that not all test runs result in ECC errors at the start of the test. The common occurrence for test runs resulting in ECC errors is that the CPU temperatures are relatively low (~40C). Even after clearing the status registers, the same ECC errors are detected until they disappear after a certain amount of time (around test 2).
                Wonder if the ECC errors still occur even after a cold boot after it's been powered on for a while.

                It might also be worth checking the BIOS memory settings that may be triggering the errors (eg. low voltage settings)

                Comment


                • #9
                  Following what you said, I ran small series of tests today, immediately one after another.
                  I let the system run for a while, although not under heavy load

                  Test1: Cold boot => same 20 ECC errors as before, abort.
                  Test2: Soft boot => No errors, abort.
                  Test3: Cold boot => Errors, let it run for 20 Minutes, to raise temp.
                  Test4: Cold boot => Errors (51°C starting CPU temp). Continue for 30 Minutes
                  Test5: Cold boot => Errors (50°C starting CPU temp)

                  So far it's 100% repeatable that a test after cold boot will produce errors, no matter if the system was running before or not.
                  Soft reboots never produce errors.

                  If logs are needed, I have saved them.

                  Comment


                  • #10
                    I suppose you guys won't be able to help me any further, but I try it anyway:

                    But on top of the before I get a lot of errors like this in linux console every few seconds:

                    Jun 12 13:35:41 <hostname> kernel: EDAC MC0: 1 UE ie31200 UE on unknown memory (csrow:3 channel:1 page:0x0 offset:0x0 grain:1)

                    As I understand it, this would be unrecoverable errors, correct? But not once in all the memory tests did I come across a non-recoverable error.
                    Hundreds of UE errors should affect stability as well, so I assume this is some kind of false reporting?

                    Any tip what to do from here?
                    I wouldn't like to trash the whole server?


                    Comment


                    • #11
                      EDAC = Error Detection and Correction event
                      MC0 = Memory controller
                      UE = Uncorrectable Errors
                      csrow = Chip-Select Row

                      Uunfortunately we aren't really experts in the Linux EDAC module.
                      We agree it doesn't seem to be normal to get these errors however.

                      In case anyone else is searching for the same issue, this was with a Fujitsu D3644-B motherboard with a Xeon E-2144G CPU with 2 sticks of DDR4 RAM.
                      (I also had look for the specs for this motherboard, Fujitsu don't seem to mention it on their web site. Doesn't seem to exist, which is also strange, or really bad customer support).

                      Comment


                      • #12
                        To the last part about the mainboard, which is not entirely correct:

                        Fujitsu gave up parts (or all?) of their mainboard section in 2020 and it was taken over in some parts by Kontron.
                        They sell those and the mentioned 3644-B is still in "extended lifecycle support":
                        https://www.kontron.com/en/products/...b-uatx/p157722


                        From there you'll get FW updates and all the documents.
                        Said motherboard was still developed as a Fujitsu board. So all documents are branded that way.
                        Last edited by SofaKingBoring; Jun-13-2023, 06:38 AM.

                        Comment


                        • #13
                          Kontron thing was interesting. Allowed me to do some more Googling and maybe find the explaination.

                          These two threads
                          https://serverfault.com/questions/52...internal-error
                          https://gathering.tweakers.net/forum...97650#65997650

                          Here is the key quote:
                          EDAC complaning about most(all?) memory banks while Memtest shows no errors at all most likely means, that your ECC RAM is OK, but was not initialized properly by the BIOS on boot. In order to initialize ECC bit - memory has to be written before it can be used. Usually it is done by BIOS, but with some motherboards (ASUS P5B for example) this step is skipped if "Quick Boot" is enabled. So, on every access of uninitialized cells you will get EDAC errors with server working without problems at the same time. Try disabling Quick Boot in the BIOS and see if it helps.

                          So a BIOS bug with quick boot option and ECC RAM?

                          Comment


                          • #14
                            Thanks a lot for all your effort David!

                            I found that reference to a bug in EDAC/BIOS quick boot on a german website before https://www.thomas-krenn.com/de/wiki...Linux_Systemen
                            Unfortunately the KONTRON's AMI BIOS has no such quick boot option.
                            So I have tried to disable that already, but just cannot do that, if I haven't missed anything.

                            Comment


                            • #15
                              I think you are going to need to talk to Kontron about it. I would be surprised if they care enough to fix it however.

                              Comment

                              Working...
                              X