Announcement

Collapse
No announcement yet.

ECC errors - not happening all times

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ECC errors - not happening all times

    Hello,

    today i had a power outage for a few minutes, which was covered by the UPS. The system did a emergency shutdown. After booting the system i noticed a bunch of these errors in the syslog:

    Code:
    Jun 6 08:11:20 prxsrv kernel: [ 0.386924] EDAC MC: Ver: 3.0.0
    Jun 6 08:11:20 prxsrv kernel: [ 12.168935] EDAC MC0: Giving out device to module ie31200_edac controller IE31200: DEV 0000:00:00.0 (POLLED)
    Jun 6 08:11:22 prxsrv kernel: [ 15.218862] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x0 offset:0x0 grain:
    Jun 6 08:11:22 prxsrv kernel: [ 15.218864] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x0 offset:0x0 grain:
    Jun 6 08:11:26 prxsrv kernel: [ 19.314923] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#0channel#0 (csrow:0 channel:0 page:0x0 offset:0x0 grain:
    Jun 6 08:11:33 prxsrv kernel: [ 25.462965] EDAC MC0: 1 UE ie31200 UE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x0 offset:0x0 grain:
    Jun 6 08:11:52 prxsrv kernel: [ 44.904811] EDAC MC0: 1 UE UE overwrote CE on any memory ( page:0x0 offset:0x0 grain:
    the system:
    Code:
     
    EFI Specifications 2.40
    System
    Manufacturer Supermicro
    Product Name Super Server
    Version 0123456789
    Serial Number 0123456789
    BIOS
    Vendor American Megatrends Inc.
    Version 2.5
    Release Date 11/26/2020
    Baseboard
    Manufacturer Supermicro
    Product Name X11SSH-F
    Version 1.01
    Serial Number ZM17AS029357
    CPU Type Intel Xeon E3-1245 v6 @ 3.70GHz
    CPU Clock 3697 MHz [Turbo: 3776.6 MHz]
    # Logical Processors 8 (4 enabled for testing)
    L1 Cache 4 x 64K (50607 MB/s)
    L2 Cache 4 x 256K (22413 MB/s)
    L3 Cache 8192K (13610 MB/s)
    Memory 65356M (8160 MB/s)
    Number of RAM SPDs detected 4
    SPD #0 16GB DDR4 ECC PC4-21300
    Kingston / 9965684-034.A00G / 03C41ABB
    19-19-19-43 / 2666 MHz / 1.2V
    SPD #1 16GB DDR4 ECC PC4-21300
    Kingston / 9965684-034.A00G / F6841ABA
    19-19-19-43 / 2666 MHz / 1.2V
    SPD #2 16GB DDR4 ECC PC4-21300
    Kingston / 9965684-034.A00G / F6841088
    19-19-19-43 / 2666 MHz / 1.2V
    SPD #3 16GB DDR4 ECC PC4-21300
    Kingston / 9965684-034.A00G / EB84198F
    19-19-19-43 / 2666 MHz / 1.2V
    Number of RAM slots 4
    Number of RAM modules 4
    DIMM Slot #0 16GB DDR4 ECC PC4-21300
    Kingston / 9965684-034.A00G / 03C41ABB
    2667 MHz
    DIMM Slot #1 16GB DDR4 ECC PC4-21300
    Kingston / 9965684-034.A00G / F6841ABA
    2667 MHz
    DIMM Slot #2 16GB DDR4 ECC PC4-21300
    Kingston / 9965684-034.A00G / F6841088
    2667 MHz
    DIMM Slot #3 16GB DDR4 ECC PC4-21300
    Kingston / 9965684-034.A00G / EB84198F
    2667 MHz
    So, i had a few memtest runs - ECC errors only happening while Test #0 and Test #1.
    The tests i did:
    • all 4 ram modules
    • only 2 ram modules in dual channel configuration (first A, then B)
    • only one ram module, tried all ram slots
    Example results:
    All modules:

    Code:
    Result summary
    
    
    Test Start Time 2021-06-06 13:13:16
    Elapsed Time 2:46:42
    Memory Range Tested 0x0 - 1075800000 (67416MB)
    CPU Selection Mode Parallel (All CPUs)
    CPU Temperature Min/Max/Ave 31C/36C/34C
    RAM Temperature Min/Max/Ave 52C/62C/57C
    ECC Polling Enabled
    # Tests Passed 11/11 (100%)
    ECC Correctable Errors 66
    ECC Uncorrectable Errors 0
    Test # Tests Passed Errors
    Test 0 [Address test, walking ones, 1 CPU] 1/1 (100%) 0
    Test 1 [Address test, own address, 1 CPU] 1/1 (100%) 0
    Test 2 [Address test, own address] 1/1 (100%) 0
    Test 3 [Moving inversions, ones & zeroes] 1/1 (100%) 0
    Test 4 [Moving inversions, 8-bit pattern] 1/1 (100%) 0
    Test 5 [Moving inversions, random pattern] 1/1 (100%) 0
    Test 6 [Block move, 64-byte blocks] 1/1 (100%) 0
    Test 7 [Moving inversions, 32-bit pattern] 1/1 (100%) 0
    Test 8 [Random number sequence] 1/1 (100%) 0
    Test 9 [Modulo 20, ones & zeros] 1/1 (100%) 0
    Test 10 [Bit fade test, 2 patterns, 1 CPU] 1/1 (100%) 0
    Test 13 [Hammer test] 0/0 (0%) 0
    Last 10 Errors
    [ECC Error] Test: 1, (Rank,Bank,Row,Col): (0,0,1FC00,0), ECC Corrected: Yes, Syndrome: 00FF, Channel/Slot: 1/0
    [ECC Error] Test: 1, (Rank,Bank,Row,Col): (0,0,1FC00,8), ECC Corrected: Yes, Syndrome: 0077, Channel/Slot: 0/0
    [ECC Error] Test: 1, (Rank,Bank,Row,Col): (0,0,1F280,8), ECC Corrected: Yes, Syndrome: 00AC, Channel/Slot: 1/0
    [ECC Error] Test: 1, (Rank,Bank,Row,Col): (0,0,1F280,C), ECC Corrected: Yes, Syndrome: 00DB, Channel/Slot: 0/0
    [ECC Error] Test: 1, (Rank,Bank,Row,Col): (0,0,1E900,8), ECC Corrected: Yes, Syndrome: 00D5, Channel/Slot: 1/0
    [ECC Error] Test: 1, (Rank,Bank,Row,Col): (0,0,1E900,8), ECC Corrected: Yes, Syndrome: 0012, Channel/Slot: 0/0
    [ECC Error] Test: 1, (Rank,Bank,Row,Col): (0,0,1DF80,0), ECC Corrected: Yes, Syndrome: 00E2, Channel/Slot: 1/0
    [ECC Error] Test: 1, (Rank,Bank,Row,Col): (0,0,1DF80,8), ECC Corrected: Yes, Syndrome: 00CF, Channel/Slot: 0/0
    [ECC Error] Test: 1, (Rank,Bank,Row,Col): (0,0,1D600,8), ECC Corrected: Yes, Syndrome: 0041, Channel/Slot: 1/0
    [ECC Error] Test: 1, (Rank,Bank,Row,Col): (0,0,1D600,C), ECC Corrected: Yes, Syndrome: 00C5, Channel/Slot: 0/0
    Single Module:
    Code:
    Result summary
    
    
    Test Start Time 2021-06-06 11:19:31
    Elapsed Time 0:01:01
    Memory Range Tested 0x0 - 475800000 (18264MB)
    CPU Selection Mode Parallel (All CPUs)
    CPU Temperature Min/Max/Ave 30C/30C/30C
    RAM Temperature Min/Max/Ave 50C/50C/50C
    ECC Polling Enabled
    # Tests Passed 4/4 (100%)
    ECC Correctable Errors 10
    ECC Uncorrectable Errors 0
    Test # Tests Passed Errors
    Test 0 [Address test, walking ones, 1 CPU] 1/1 (100%) 0
    Test 1 [Address test, own address, 1 CPU] 1/1 (100%) 0
    Test 2 [Address test, own address] 1/1 (100%) 0
    Test 3 [Moving inversions, ones & zeroes] 1/1 (100%) 0
    Last 10 Errors
    [ECC Error] Test: 1, (Rank,Bank,Row,Col): (2,0,1F200,18), ECC Corrected: Yes, Syndrome: 0063, Channel/Slot: 1/0
    [ECC Error] Test: 1, (Rank,Bank,Row,Col): (2,0,1CC00,8), ECC Corrected: Yes, Syndrome: 00DD, Channel/Slot: 1/0
    [ECC Error] Test: 1, (Rank,Bank,Row,Col): (2,0,1A600,10), ECC Corrected: Yes, Syndrome: 00FF, Channel/Slot: 1/0
    [ECC Error] Test: 1, (Rank,Bank,Row,Col): (2,0,18000,8), ECC Corrected: Yes, Syndrome: 00F9, Channel/Slot: 1/0
    [ECC Error] Test: 1, (Rank,Bank,Row,Col): (2,0,15A00,8), ECC Corrected: Yes, Syndrome: 007F, Channel/Slot: 1/0
    [ECC Error] Test: 1, (Rank,Bank,Row,Col): (2,0,13400,10), ECC Corrected: Yes, Syndrome: 009E, Channel/Slot: 1/0
    [ECC Error] Test: 1, (Rank,Bank,Row,Col): (2,0,10E00,8), ECC Corrected: Yes, Syndrome: 00E5, Channel/Slot: 1/0
    [ECC Error] Test: 1, (Rank,Bank,Row,Col): (2,0,10000,8), ECC Corrected: Yes, Syndrome: 003C, Channel/Slot: 1/0
    [ECC Error] Test: 1, (Rank,Bank,Row,Col): (2,0,1BA00,0), ECC Corrected: Yes, Syndrome: 0050, Channel/Slot: 1/0
    [ECC Error] Test: 0, (Rank,Bank,Row,Col): (2,0,10000,0), ECC Corrected: Yes, Syndrome: 00CE, Channel/Slot: 1/0
    The above errors are happening for all modules, regardless in which slot they are seated. I think the problem is not the RAM, maybe the CPU / Mainboard is fried.

    Any ideas ?

  • #2
    There is a known BIOS bug with i3200 chipsets
    https://bugzilla.redhat.com/show_bug.cgi?id=564274
    Final comment was, "Some i3210 BIOSes have problems enabling the hardware checks at the MCU. On those hardware, customers should try to disable Quickboot and / or "Memory Remap Feature" or to disable EDAC drivers.

    This isn't the exact model you have, but behaviour sounds similar.

    Comment


    • #3
      Hello David,

      thanks for your quick reply. The 4x runs without #13 Hammer Test finished without errors. After that, i disabled Fastboot and Memory Remap in BIOS and tried a short run of Test #0 + #1 - the same errors are still happening. So i am quite uncertain about the relevance for the stable operation of the system. The behaviour sounds similar to the bug report, but the server is now around three years old and i never noticed these errors before - which is a bit strange.

      Comment


      • #4
        Sorry for doubleposting, but i can't edit the previous post anymore.

        I replaced the PSU and the CMOS battery on the failed system, then i booted it with all drives disconnected. The ecc errros are still appearing. My workstation supports ECC RAM (Ryzen 5950x), so i switched the RAM between my workstation and the server. On the workstation with the "failed" ram from the server there are so far no errors, the non-ecc (workstation) ram in the server is testing fine currently, too. So i am just more confused. Probably the CPU or the RAM controller on the server may be faulty. Or an incompabitility between the servers ram and bios, but it would be strange after three years of operation without problems...

        Comment

        Working...
        X