Announcement

Collapse
No announcement yet.

Unexpected Test #10 failures under MemTest86 V7.5 (free)

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unexpected Test #10 failures under MemTest86 V7.5 (free)

    I am looking for some help on interpreting Memtest86 V7.5 (free) results, particularly Test #10 (bitfade).

    Motherboard is an Asrock J3455B-ITX with 2x8GB Kingston KVR16LS11/8 DDR3L which is on the QVL. UEFI is 1.30 which is the latest. All clocking is at stock speeds and 300W Seasonic PSU is very lightly loaded. I am hoping to use the mobo in a home server 24/7 to replace a number of separate ARM based NAS and other server machines, which have proven very stable, so I have been performing extensive memory testing to check for suitability.

    To date I have been running memtest86 V7.5 for about 120 hours in total and have seen errors reported on three occasions. I have made small changes in this time, but experimentation is slow when it takes between 8 and 50 hours for an error to show up. The changes I have tried include swapping the two sodimms around, increasing DDR3L voltage to 1.5V and disallowing CPU sleep states (C states).

    Results so far:
    1. After several successful test passes, memtest86 reported on several passes "Expected: 00000000 Actual: 20000000" at 0x4299DB9D0 on Test #10 (bit fade). The failing sodimm module was reported as Kingston 9905428-155.A00G. That's one bit in error in approx 10^11.
    2. After sodimm positions interchanged, a similar report appeared abouit 80 test hours later, but this time it reported "Expected 00000000 Actual 01000000 at 0x4299DA9D0. The failing module was reported as Kingston 99U5428-101.A00L.
    3. My longest test run (49:21:59 hrs) stopped after failing Test #7 spectacularly. The failure mode was that the memory output was consistently the inverse of the expected walking 0 pattern e.g. "Expected FFFFFEFF Actual: 00000100" Memtest86 stopped testing after accumulating 375957 errors and failed to restart testing, complaining of a UEFI firmware error. This result has been seen only once.
    4. Zero test failures in overnight testing with addresses above 0x400000000 excluded.
    5. Zero failures over 16 passes of Tests #7 and #10 testing only addresses between 0x420000000 and 0x430000000 !

    I doubt that I could RMA either the memory or the mobo based on the above results. I intend to use the memory and mobo but to instruct the linux kernel to exclude the RAM area around the reported failure addresses. It will be difficult to test the memory integrity from within linux so I am hoping to get some advice about the wisdom of this approach. Specifically:

    Question 1:
    a) What is Test #10 really doing? My understanding from the memtest86 documentation is that it clears a portion of memory, waits 300 seconds and looks for bit flips. Presumably the memory is being refreshed in the background by the cpu memory controller every 64ms and so memtest is examining the memory after some 5000 refreshes without any writes.
    b) Is this primarily testing the memory or the DRAM controller or both?
    c) I cannot find mention of this test in the wider literature on memory testing, so what is its importance and validity i.e. what operational problem may it cause?

    Question 2: The Test #10 failures are reported as being on different sodimms, but this followed swapping of the modules. The reported addresses of the two tests are 0x4299DB9D0 and 0x4299DA9D0 which are exactly 1000(hex) or 4096(dec) apart and on the same physical sodimm slot (since the problem followed the slot and not the memory module).
    a) What is going on?
    b) What UEFI services is memtest86 using? Does memtest86 use UEFI boot services during runtime or does it issue an ExitBootServices call as an OS would?
    c) If my motherboard has a UEFI bug that is affecting memtest86, is it likely to be affecting real world operation under linux? i.e. Will the linux kernel be reliant on all the same services as memtest86. Note: in rather less rigorous testing with memtest86 V5.x under BIOS boot, no problems have been seen with this mobo + memory combination.
    d) What can I deduce from the one time Test #7 failure and the memtest86 report?

    Result #5 is frustrating because the problem disappeared when I put under heavier testing. Assuming the RAM or mobo hasn't suddenly reapired itself, is there any plausible explanation for why the outcome might be different under focussed testing?

    Any insight into any of the above will be much appreciated.


  • #2
    Q1.
    a) Yes, that is pretty much it.
    b) A bit of both. Mainly the RAM
    c) Seems common sense. If the RAM can't store a value over a short period of a few minutes, then all sorts of chaos can ensue.

    Q2.
    a) b) c) I don't have time to write up an in depth answer that covers all permutations and possibilities. So here is the short incomplete answer. You can't look at a memory address and know which slot it is. The mapping from addresses to slots is super complex. If you are doing very long periods of testing then soft errors can be expected unless you are using ECC RAM.

    Comment


    • #3
      Thanks for the quick response. I have spent several days trying to work out what could be going on and wanted to make sure I wasn't completely off base. I agree Test #10 looks like common sense; I was just a bit surprised that amongst all the standard walking bit and hammer tests in the literature, nowhere other than here could I found discussion of a simple persistence test.

      I appreciate some of the complexity of memory mapping. I spent four days trying to get to grips with some of it before posting my questions and did realise that identifying the memory address to slot mapping wasn't trivial. Point taken about soft errors and ECC memory, but ECC is not practical in this instance. I realise that if I test long enough some event (cosmic/alpha ray or other) will be observed. I am trying to virtualize and consolidate a number of small server apps with an energy efficient solution in a home environment. These are running on a variety of small ARM based platforms including router, QNAP NAS and Raspberry Pi. The SDRAM in these is untested from my perspective, but they run 24/7 without appearing to cause major problems.

      Memtest86 has given me the confidence to move ahead. I will exclude the suspect region of memory and audit regularly with memtest86 to look for potential trouble.

      Again thanks for the timely support.

      Comment

      Working...
      X