Announcement

Collapse
No announcement yet.

ECC error for Transcend/Innodisk DDR4 ECC 32GB

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ECC error for Transcend/Innodisk DDR4 ECC 32GB

    Dear Experts,

    We use the industrial board (Intel Xeon W-11865MRE) to test the DDR4 ECC 32GB memory mudules and get the following ECC error messages as the attached log files shown.
    We are not quite sure whether it belongs to "normal" or "abnormal" cause the "cumulative error count" is always shown "0".
    If "Normal", we really need your help how to explain it cause our customer is waiting for our answer.
    If "Abnormal", this error message only shows one time and the rest of Tests are all passed, Test-4/7/9 especially. It's weird.


    A memory module: Transcend TS4GSH72V2E-I DDR4-3200 ECC 32GB (SEC K4AAG085WA BCWE) x2
    2024-11-16 21:53:53 - [MEM ERROR - ECC] Test: 7, (Chan,Slot,Rank,Bank,Row,Col): (1,0,3,3,BF56,38, ECC Corrected: yes, Syndrome: 0068, Channel/Slot: 1-0
    ...
    2024-11-16 22:13:03 - MtSupportRunAllTests - Test execution time: 1693.065s (Test 7 cumulative error count: 0, buffer full count: 0)


    B memory module: Innodisk M4D0-BGS2Q5EM-26 DDR4-3200 ECC 32GB (SEC K4AAG085WA BCWE) x2
    2024-11-17 08:15:09 - [MEM ERROR - ECC] Test: 9, (Chan,Slot,Rank,Bank,Row,Col): (1,0,3,3,1270,3C, ECC Corrected: yes, Syndrome: 0019, Channel/Slot: 1-0​
    ...
    2024-11-17 08:15:39 - MtSupportRunAllTests - Test execution time: 1523.486s (Test 9 cumulative error count: 0, buffer full count: 0)


    C memory module: Transcend TS4GSH72V2E-I DDR4 (downgrade speed from 3200 to 2667) ECC 32GB (SEC K4AAG085WA BCWE) x2
    2024-11-16 18:58:53 - [MEM ERROR - ECC] Test: 4, (Chan,Slot,Rank,Bank,Row,Col): (1,0,3,1,5E01,250), ECC Corrected: yes, Syndrome: 00A8, Channel/Slot: 1-0
    ...
    2024-11-16 19:05:59 - MtSupportRunAllTests - Test execution time: 488.191s (Test 4 cumulative error count: 0, buffer full count: 0)
    Attached Files

  • #2
    "cumulative error count" is always shown "0"
    If the error got correctly corrected (which is the function of the ECC RAM, with a 1 bit error) then error is no longer an error and not counted in the total. The event is still logged however.

    If two or more bits are in error, it can't be corrected and will result in a error. Which is a lot worse.

    Getting RAM errors is not normal (even if they get corrected).

    While we have no statistics on the matter, many people believe that getting a corrected 1 bit error makes it more likely that an uncorrected 2 bit errors also occurs. i.e. a canary in the coal mine.

    Comment

    Working...
    X