I am looking for some help on interpreting Memtest86 V7.5 (free) results, particularly Test #10 (bitfade).
Motherboard is an Asrock J3455B-ITX with 2x8GB Kingston KVR16LS11/8 DDR3L which is on the QVL. UEFI is 1.30 which is the latest. All clocking is at stock speeds and 300W Seasonic PSU is very lightly loaded. I am hoping to use the mobo in a home server 24/7 to replace a number of separate ARM based NAS and other server machines, which have proven very stable, so I have been performing extensive memory testing to check for suitability.
To date I have been running memtest86 V7.5 for about 120 hours in total and have seen errors reported on three occasions. I have made small changes in this time, but experimentation is slow when it takes between 8 and 50 hours for an error to show up. The changes I have tried include swapping the two sodimms around, increasing DDR3L voltage to 1.5V and disallowing CPU sleep states (C states).
Results so far:
1. After several successful test passes, memtest86 reported on several passes "Expected: 00000000 Actual: 20000000" at 0x4299DB9D0 on Test #10 (bit fade). The failing sodimm module was reported as Kingston 9905428-155.A00G. That's one bit in error in approx 10^11.
2. After sodimm positions interchanged, a similar report appeared abouit 80 test hours later, but this time it reported "Expected 00000000 Actual 01000000 at 0x4299DA9D0. The failing module was reported as Kingston 99U5428-101.A00L.
3. My longest test run (49:21:59 hrs) stopped after failing Test #7 spectacularly. The failure mode was that the memory output was consistently the inverse of the expected walking 0 pattern e.g. "Expected FFFFFEFF Actual: 00000100" Memtest86 stopped testing after accumulating 375957 errors and failed to restart testing, complaining of a UEFI firmware error. This result has been seen only once.
4. Zero test failures in overnight testing with addresses above 0x400000000 excluded.
5. Zero failures over 16 passes of Tests #7 and #10 testing only addresses between 0x420000000 and 0x430000000 !
I doubt that I could RMA either the memory or the mobo based on the above results. I intend to use the memory and mobo but to instruct the linux kernel to exclude the RAM area around the reported failure addresses. It will be difficult to test the memory integrity from within linux so I am hoping to get some advice about the wisdom of this approach. Specifically:
Question 1:
a) What is Test #10 really doing? My understanding from the memtest86 documentation is that it clears a portion of memory, waits 300 seconds and looks for bit flips. Presumably the memory is being refreshed in the background by the cpu memory controller every 64ms and so memtest is examining the memory after some 5000 refreshes without any writes.
b) Is this primarily testing the memory or the DRAM controller or both?
c) I cannot find mention of this test in the wider literature on memory testing, so what is its importance and validity i.e. what operational problem may it cause?
Question 2: The Test #10 failures are reported as being on different sodimms, but this followed swapping of the modules. The reported addresses of the two tests are 0x4299DB9D0 and 0x4299DA9D0 which are exactly 1000(hex) or 4096(dec) apart and on the same physical sodimm slot (since the problem followed the slot and not the memory module).
a) What is going on?
b) What UEFI services is memtest86 using? Does memtest86 use UEFI boot services during runtime or does it issue an ExitBootServices call as an OS would?
c) If my motherboard has a UEFI bug that is affecting memtest86, is it likely to be affecting real world operation under linux? i.e. Will the linux kernel be reliant on all the same services as memtest86. Note: in rather less rigorous testing with memtest86 V5.x under BIOS boot, no problems have been seen with this mobo + memory combination.
d) What can I deduce from the one time Test #7 failure and the memtest86 report?
Result #5 is frustrating because the problem disappeared when I put under heavier testing. Assuming the RAM or mobo hasn't suddenly reapired itself, is there any plausible explanation for why the outcome might be different under focussed testing?
Any insight into any of the above will be much appreciated.
Motherboard is an Asrock J3455B-ITX with 2x8GB Kingston KVR16LS11/8 DDR3L which is on the QVL. UEFI is 1.30 which is the latest. All clocking is at stock speeds and 300W Seasonic PSU is very lightly loaded. I am hoping to use the mobo in a home server 24/7 to replace a number of separate ARM based NAS and other server machines, which have proven very stable, so I have been performing extensive memory testing to check for suitability.
To date I have been running memtest86 V7.5 for about 120 hours in total and have seen errors reported on three occasions. I have made small changes in this time, but experimentation is slow when it takes between 8 and 50 hours for an error to show up. The changes I have tried include swapping the two sodimms around, increasing DDR3L voltage to 1.5V and disallowing CPU sleep states (C states).
Results so far:
1. After several successful test passes, memtest86 reported on several passes "Expected: 00000000 Actual: 20000000" at 0x4299DB9D0 on Test #10 (bit fade). The failing sodimm module was reported as Kingston 9905428-155.A00G. That's one bit in error in approx 10^11.
2. After sodimm positions interchanged, a similar report appeared abouit 80 test hours later, but this time it reported "Expected 00000000 Actual 01000000 at 0x4299DA9D0. The failing module was reported as Kingston 99U5428-101.A00L.
3. My longest test run (49:21:59 hrs) stopped after failing Test #7 spectacularly. The failure mode was that the memory output was consistently the inverse of the expected walking 0 pattern e.g. "Expected FFFFFEFF Actual: 00000100" Memtest86 stopped testing after accumulating 375957 errors and failed to restart testing, complaining of a UEFI firmware error. This result has been seen only once.
4. Zero test failures in overnight testing with addresses above 0x400000000 excluded.
5. Zero failures over 16 passes of Tests #7 and #10 testing only addresses between 0x420000000 and 0x430000000 !
I doubt that I could RMA either the memory or the mobo based on the above results. I intend to use the memory and mobo but to instruct the linux kernel to exclude the RAM area around the reported failure addresses. It will be difficult to test the memory integrity from within linux so I am hoping to get some advice about the wisdom of this approach. Specifically:
Question 1:
a) What is Test #10 really doing? My understanding from the memtest86 documentation is that it clears a portion of memory, waits 300 seconds and looks for bit flips. Presumably the memory is being refreshed in the background by the cpu memory controller every 64ms and so memtest is examining the memory after some 5000 refreshes without any writes.
b) Is this primarily testing the memory or the DRAM controller or both?
c) I cannot find mention of this test in the wider literature on memory testing, so what is its importance and validity i.e. what operational problem may it cause?
Question 2: The Test #10 failures are reported as being on different sodimms, but this followed swapping of the modules. The reported addresses of the two tests are 0x4299DB9D0 and 0x4299DA9D0 which are exactly 1000(hex) or 4096(dec) apart and on the same physical sodimm slot (since the problem followed the slot and not the memory module).
a) What is going on?
b) What UEFI services is memtest86 using? Does memtest86 use UEFI boot services during runtime or does it issue an ExitBootServices call as an OS would?
c) If my motherboard has a UEFI bug that is affecting memtest86, is it likely to be affecting real world operation under linux? i.e. Will the linux kernel be reliant on all the same services as memtest86. Note: in rather less rigorous testing with memtest86 V5.x under BIOS boot, no problems have been seen with this mobo + memory combination.
d) What can I deduce from the one time Test #7 failure and the memtest86 report?
Result #5 is frustrating because the problem disappeared when I put under heavier testing. Assuming the RAM or mobo hasn't suddenly reapired itself, is there any plausible explanation for why the outcome might be different under focussed testing?
Any insight into any of the above will be much appreciated.
Comment