Announcement

Collapse
No announcement yet.

Need HELP in understanding/interpreting results and logs (especially ECC-Errors!!)

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Need HELP in understanding/interpreting results and logs (especially ECC-Errors!!)

    I have some vintage hardware here, mainly of Intel X79/C602 - IvyBride-E/-EP generation.
    The system that I actually bought the MemTest Pro License for is a Dual-Xeon E5-2690v2 machine on an ASUS Z9PE-D8 WS board.
    Currently using MemTest Pro v9.3.

    I have added text and log files from 2 complete test suite runs with differing DIMM/slot distribution.
    Seems I can't upload the html report file, neither a complete log text file due to size limitations.
    So I cropped the log files, leaving some of the boot protocol and the first complete test run.
    I gathered the contents of two RAM-info files in on text file to meet the limitations for file attachments.

    I'm getting "Corrected ECC-Errors" reported after some of the test runs.
    Some test runs finish with several dozen reported of those corrected errors, while some test runs finish with no reported erros of any kind.
    Error appearance is related to module combination and distribution across the DIMM slots.
    But in any case, complete test runs are reported as: Passed, 0 Errors.

    OK, ECC Memory is supposed to self-correct some errors.
    But is it acceptable that even ECC errors are reported at all?
    I asked a local IT vendor about this and he replied that if he was admin of a server that would start beginning to report ECC errors, the corresponding modules would certainly be marked for replacement during next scheduled server maintenance time.
    So that yes, it could be expected that those modules would fail seriously and uncorrectable in the future.

    What's your opinion, team MemTest?
    The vendor grants warranty for the modules, so I can replace faulty ones.
    Are ECC errors on purchased modules reason enough to call for replacement? Or are those self-corrected errors within an "acceptable" range?

    I can provide more desired information regarding the respective system if needed, but roughly is as follows:
    • ASUS Z9PE-D8 WS -dual CPU, quad-channel memory baseboard with Intel C602 chipset. 4 DIMM slots per CPU.
    • Intel Xeon E5-2690v2 -2 pcs., each 10Cores/20Threads, 3,00GHz CPUs.
    • Samsung M386B4G70DM0-CMA4, 8 pcs. of 32GB DDR3-1866 CL13 Reg-ECC DIMMs.
      divided into:
      - 4 DIMMs initially bought as 1 set of 4, manufacturing date: week 40, 2014
      - 4 DIMMs added lately as 2 sets of 2, manufacturing date: week 47, 2014

    System ran rather stable using the first package of 4 modules (week 40, 2014).
    Test setups: in single-, dual-, quad-channel configuration (quad channel using 1 CPU only)
    Reported "0 Errors"

    Test runs in 2-CPU, single- and dual-channel config, mixing DIMMs of 40/2014 and first 2 of 47/2014, were all successful with 0 Errors of any kind.
    Test runs in 2-CPU, single-channel config, mixing first 2 DIMMs of 47/2014 and second 2 of 47/2014 were all successful with 0 Errors of any kind.
    But:
    Test runs in 2-CPU, dual-channel config, mixing first 2 DIMMs of 47/2014 ans second 2 of 47/2014 reported dozens of "Corrected" ECC-errors, in SOME, not all, configs.

    That means that some of the DIMMs passing test runs with 0 ECC Erros in single-channel config (1 DIMM per CPU) cause ECC Errors in dual-channel config (2 DIMMs per CPU).
    (Haven't tested quad-channel across all 8 DIMMS yet)
    What might be the reason for that and is this sort of foreshadow of a major failure of one or more DIMMs in the future?

    Reported Error scheme is like this:

    2022-01-16 00:59:40 - [Channel 4, Slot 0] DIMM err count=3 (prev=0)
    2022-01-16 00:59:40 - [MEM ERROR - ECC] Test: 5, (Chan,Slot,Rank,Bank,Row,Col): (4,0,N/A,N/A,N/A,N/A), ECC Corrected: yes, Syndrome: N/A, Channel/Slot: 4/0
    The reported ECC Errors are always limited to one same "channel" within one complete 4-time run of all 13 tests.
    Channel Number changes after swapping DIMMs across the slots.
    So should this mean that it's one single module that's failing?




    Attached Files

  • #2
    But is it acceptable that even ECC errors are reported at all?
    Depends what the server was used for. If it was a web server for a hobby site, I would ignore it, as the consequence of corruption are low.
    If it was an air traffic control system I would definitely replace the modules.

    That means that some of the DIMMs passing test runs with 0 ECC Erros in single-channel config (1 DIMM per CPU) cause ECC Errors in dual-channel config.
    It isn't that uncommon to only see errors in dual / quad channel mode.
    It changes the access pattern, electrical load and slots in use. So sometimes behaviour is different.

    Comment


    • #3
      Thanks for your reply and explanation, David (PassMark) .
      The server is for peronsal use only.
      The seller grants warranty even for these used DIMMs so I wanted to know if I should ask for replacement or not.
      I didn't know if these dozens of ECC errors might indicate some sort of growing error or electrical fault in the future that could turn into uncorrectable errors.

      Comment


      • #4
        I didn't know if these dozens of ECC errors might indicate some sort of growing error or electrical fault in the future that could turn into uncorrectable errors.
        We don't really know either.
        We don't have any good data on it. Common sense would seem to indicate that lots of correctable errors could from time to time generate a uncorrectable error however.

        Comment

        Working...
        X