Announcement

Collapse
No announcement yet.

ECC Test Escapes

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ECC Test Escapes

    We test storage-server platforms in manufacturing using an automated Linux-based test server to PXE boot the DUT that has AMD EPYC processor, DDR4/ECC DIMMS (typically 8/16 populated DIMM slots), and a number of PCIe connected devices/drives. We get RMAs for our Server products with reported memory ECC errors, where corrections are 1-2 per hour. We retest with burn-in test over 24hrs and these are typically no-fault-found (NFF). We can test with an older memory benchmark tool like STREAM and cause/detect ECC errors. All ECCs are correctable.

    Does anyone have a preferred configuration for burn-in test that would favor stressing/causing ECC errors? We run the default Cyclic Test today.


  • #2
    Some questions:
    What OS are you using for the testing?
    What tool are you using to detect the ECC errors?
    Are you testing the memory using the same OS as the client?
    What BurnInTest version are you using? Are you using jus the RAM test? What cycle rate are you using?
    You can try to using MemTest86 to test for ECC errors: https://www.memtest86.com/

    Comment


    • #3
      Some questions:
      What OS are you using for the testing?

      - CentOS 7
      - Linux version 5.5.13-1.el7.elrepo.x86_64 (mockbuild@Build64R7) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC)) #1 SMP Wed Mar 25 12:42:37 EDT 2020
      - Linux version 4.11.3-1.el7.elrepo.x86_64 (mockbuild@Build64R7) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri May 26 09:19:46 EDT 2017

      What tool are you using to detect the ECC errors?
      - mcelog
      - ras

      Are you testing the memory using the same OS as the client?
      - No (assuming you mean our end-users reporting the ECC failures). Our customers use a variety of OS (generally Linux based).


      What BurnInTest version are you using? Are you using jus the RAM test? What cycle rate are you using?
      - BurnInTest v3.3 (1002) Linux 64bitCPU (65 cycle) and Memory (100 cycle)

      You can try to using MemTest86 to test for ECC errors: https://www.memtest86.com/
      - Not in our production test, but in some of our RMA testing we have used MemTest86 vs Stream. Stream tends to create/force ECC errros that we are not seeing with Memtest86.


      Comment


      • #4
        This all seems rather strange.
        What version of Stream are you using? Did you compile your own version? By default the Stream benchmark only uses ~40MB memory. The rest of the memory (99.9% of it) isn't being touched by Stream, so even in theory it can't generate ECC errors for the RAM it didn't use. Maybe this was a rare fluke?

        You should also turn off the CPU test in BurnInTest if your focus is testing the RAM.

        Plus you are using a version of BurnInTest that is 5 years old.

        We are working on improvements to the memory test in Linux to make it more multi-threaded, which should add to the load a bit. But MemTest86 is already be doing this (except in the case where your BIOS doesn't support multi-threading). Is it possible to try the Window release?

        What version of MemTest86 are you using? Are you using the Multi-threading option?





        Comment


        • #5
          This all seems rather strange.
          What version of Stream are you using? Ans: Don't know, it does not report the version.
          Did you compile your own version? Ans: No. It was provided to us by one of our suppliers.

          By default the Stream benchmark only uses ~40MB memory. The rest of the memory (99.9% of it) isn't being touched by Stream, so even in theory it can't generate ECC errors for the RAM it didn't use. Maybe this was a rare fluke?
          Ans: We've had several different systems report this same thing (ECC with Stream, no ECC with BI/Memtest).



          You should also turn off the CPU test in BurnInTest if your focus is testing the RAM.
          Ans: Thanks for the recommendation. We use BI as a system stress tool - we can evaluate our balance.

          Plus you are using a version of BurnInTest that is 5 years old.
          Ans: Great product! Actually, we are looking at upgrading. Do you think that is related to this issue?


          We are working on improvements to the memory test in Linux to make it more multi-threaded, which should add to the load a bit. But MemTest86 is already be doing this (except in the case where your BIOS doesn't support multi-threading). Is it possible to try the Window release?

          What version of MemTest86 are you using? Are you using the Multi-threading option?
          Ans: For the standalone bootable USB version of MemTest 86 using, MemTest86 Pro Version 9.0. Using default - which we thing it is multi-threading

          Comment


          • #6
            This still doesn't make sense. The source code for the version of Stream we looked at only uses ~0.1% of the RAM. If the RAM isn't used, it can't generate ECC errors. (there is a small chance of row hammer style errors effecting RAM addresses that aren't used, but this is fairly uncommon in DDR4 RAM).

            So I think what you are seeing in the log is ECC errors from other past activity, or it was just random luck on one (or two?) occasions Stream generated an error. What do the log entries look like in your distro, are they time stamped?

            [Memtest86] Using default - which we thing it is multi-threading
            Default is MT, but if motherboard doesn't support it, then it falls back to ST. So can't be sure without looking at screen or the log.

            Comment

            Working...
            X