History
Back in 2002 some support was added to MemTest86 V3 to support ECC RAM (Error-Correcting Code RAM). A couple of years later in V3.1 support for additional memory controllers were added. In the options menu of MemTest86 V3 there was a 'ECC mode' that could be activated.
Since 2004 this code to read ECC errors was never maintained. So as new memory controllers arrived and memory controllers got moved into the CPUs, the code worked fewer and fewer machines. In 2011 for V4.0 Chris Brady (the original MemTest86 author) decided to drop all support for ECC, along with the related code to identify chipsets. Part of the reasoning behind this was that he decided that reporting incorrect information is worse than no information.
The lack of ECC support remained in place for all V4.x releases.
Present
During 2013 for V5 we have been working to bring back support for reporting of ECC errors, for at least the popular current platforms, if not for some of the older platforms as well. It turns out that this is not a trivial exercise. Different code is required for different chipsets and the mechanisms for reporting errors in UEFI BIOS are poorly documented, with some of the documents not even being available to the public. (Note that V5 of MemTest86 will only support UEFI based hardware).
Testing of ECC RAM error reporting
Even once code to detect ECC errors is written there are great difficulties in testing the code. These testing problems revolve around,
Custom ECC test hardware
We were lucky enough to be contacted by "Team Group Inc", a company that distributes ECC RAM. They offered to supply us with some customised ECC RAM that had a button affixed to the PCB that could generate 1 bit ECC errors on demand. This is what is looked like,
It isn't a perfect solution as, being a DDR3 module, it won't help support DDR2 RAM and it also won't help to check the behaviour of multi-bit errors. Nonetheless it is a bit step up over having nothing at all.
ECC Test results
Using a combination of the customised RAM stick and the ability of some memory controllers to simulate an ECC faults we have been able to get MemTest86 V5 to report on ECC errors. Detected errors typically look like this,
Conclusion
ECC error reporting should be available again, for significant number of current chipsets in V5 of MemTest86.
Back in 2002 some support was added to MemTest86 V3 to support ECC RAM (Error-Correcting Code RAM). A couple of years later in V3.1 support for additional memory controllers were added. In the options menu of MemTest86 V3 there was a 'ECC mode' that could be activated.
Since 2004 this code to read ECC errors was never maintained. So as new memory controllers arrived and memory controllers got moved into the CPUs, the code worked fewer and fewer machines. In 2011 for V4.0 Chris Brady (the original MemTest86 author) decided to drop all support for ECC, along with the related code to identify chipsets. Part of the reasoning behind this was that he decided that reporting incorrect information is worse than no information.
The lack of ECC support remained in place for all V4.x releases.
Present
During 2013 for V5 we have been working to bring back support for reporting of ECC errors, for at least the popular current platforms, if not for some of the older platforms as well. It turns out that this is not a trivial exercise. Different code is required for different chipsets and the mechanisms for reporting errors in UEFI BIOS are poorly documented, with some of the documents not even being available to the public. (Note that V5 of MemTest86 will only support UEFI based hardware).
Testing of ECC RAM error reporting
Even once code to detect ECC errors is written there are great difficulties in testing the code. These testing problems revolve around,
- Getting enough and a variety of ECC capable hardware to test with. Typical ECC is way more expensive than normal consumer hardware. So it prohibitively expensive to purchase all the CPU, RAM and motherboards required.
- Generating errors in a repeatable manner on demand. We tried using heat guns and strong electromagnetic interference to force ECC errors, but the process was too random. For example, heating the RAM stick to 120C would often the result in the entire machine crashing before an error could be reported.
- Generating a RAM error where exactly one bit in a byte is wrong. One bit errors are important as this is the type of ECC that ECC RAM should detect and correct.
Custom ECC test hardware
We were lucky enough to be contacted by "Team Group Inc", a company that distributes ECC RAM. They offered to supply us with some customised ECC RAM that had a button affixed to the PCB that could generate 1 bit ECC errors on demand. This is what is looked like,
It isn't a perfect solution as, being a DDR3 module, it won't help support DDR2 RAM and it also won't help to check the behaviour of multi-bit errors. Nonetheless it is a bit step up over having nothing at all.
ECC Test results
Using a combination of the customised RAM stick and the ability of some memory controllers to simulate an ECC faults we have been able to get MemTest86 V5 to report on ECC errors. Detected errors typically look like this,
Conclusion
ECC error reporting should be available again, for significant number of current chipsets in V5 of MemTest86.
Comment