I find ECC error reporting ambiguous when reporting which DIMM has failed. This is on a system with dual Haswell processors.
Here is an example of how an ECC errors is reported:
2017-02-15 20:34:46 - [MEM ERROR - ECC] Test: 6, (Col,Row,Rank,Bank): (N/A,N/A,N/A,N/A), ECC Corrected: yes, Syndrome: N/A, Channel/Slot: 5/0
Clearly you are reporting only Channel/Slot instead of dimm number. The problem is that there is no clear mapping between the Channel/Slot, and the Dimms. In fact, the channel number seems inconsistent with the dimm organization, so there may be a bug in your dimm location reporting.
There are 16 dimms (which your code numbers 0 to 15) in this Server system. Eight dimms on each processor. We are doing this testing with a known faulty dimm that is plugged into Cpu 1, HA 1, Channel 1, slot 0. So how does that get reported as Channel 5? And is there any way to figure out how this mapping is done? Why not just report the Dimm number?
My memtest86.log exceeds the maximum upload size, so I've trimmed off one of the boots and included that.
Thank you
p.s. It would also be nice if you included a count of corrected ECC errors as well as the count of uncorrected errors which you currently report.
Here is an example of how an ECC errors is reported:
2017-02-15 20:34:46 - [MEM ERROR - ECC] Test: 6, (Col,Row,Rank,Bank): (N/A,N/A,N/A,N/A), ECC Corrected: yes, Syndrome: N/A, Channel/Slot: 5/0
Clearly you are reporting only Channel/Slot instead of dimm number. The problem is that there is no clear mapping between the Channel/Slot, and the Dimms. In fact, the channel number seems inconsistent with the dimm organization, so there may be a bug in your dimm location reporting.
There are 16 dimms (which your code numbers 0 to 15) in this Server system. Eight dimms on each processor. We are doing this testing with a known faulty dimm that is plugged into Cpu 1, HA 1, Channel 1, slot 0. So how does that get reported as Channel 5? And is there any way to figure out how this mapping is done? Why not just report the Dimm number?
My memtest86.log exceeds the maximum upload size, so I've trimmed off one of the boots and included that.
Thank you
p.s. It would also be nice if you included a count of corrected ECC errors as well as the count of uncorrected errors which you currently report.
Comment