Announcement

Collapse
No announcement yet.

ECC errors - unclear which DIMM

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ECC errors - unclear which DIMM

    I find ECC error reporting ambiguous when reporting which DIMM has failed. This is on a system with dual Haswell processors.

    Here is an example of how an ECC errors is reported:

    2017-02-15 20:34:46 - [MEM ERROR - ECC] Test: 6, (Col,Row,Rank,Bank): (N/A,N/A,N/A,N/A), ECC Corrected: yes, Syndrome: N/A, Channel/Slot: 5/0

    Clearly you are reporting only Channel/Slot instead of dimm number. The problem is that there is no clear mapping between the Channel/Slot, and the Dimms. In fact, the channel number seems inconsistent with the dimm organization, so there may be a bug in your dimm location reporting.

    There are 16 dimms (which your code numbers 0 to 15) in this Server system. Eight dimms on each processor. We are doing this testing with a known faulty dimm that is plugged into Cpu 1, HA 1, Channel 1, slot 0. So how does that get reported as Channel 5? And is there any way to figure out how this mapping is done? Why not just report the Dimm number?

    My memtest86.log exceeds the maximum upload size, so I've trimmed off one of the boots and included that.

    Thank you

    p.s. It would also be nice if you included a count of corrected ECC errors as well as the count of uncorrected errors which you currently report.
    Attached Files

  • #2
    The reported channel/slot number is from the perspective of the physical CPU, which is most likely different from the DIMM number assignment from the system point-of-view or the DIMM labeling on your motherboard. So at the moment, you may need to perform your own experimentation and/or consult your motherboard vendor to determine this mapping.

    p.s. It would also be nice if you included a count of corrected ECC errors as well as the count of uncorrected errors which you currently report.
    Not sure what you mean by this. We do show the number of ECC corrected and uncorrected errors in the test summary and report file.

    Comment


    • #3
      Sorry, I don't understand. There are two CPUs.And there are four channels on each CPU: two on each of the two HA - this is determined by Intel CPU, not by motherboard. So which Intel designated CPU/HA/Channel does a Channel 5 map to?

      Thanks; I stopped before getting to the Summary or Report file (just looked at the running screen and logfile); I'll check it out when I have the chance to do a longer run to completion.

      Comment


      • #4
        It would be in sequential order (ie. CPU0, HA0, Ch0 is channel 0, CPU0, HA0, Ch1 is channel 1, etc...)

        Comment

        Working...
        X