Announcement

Collapse
No announcement yet.

Memtest86 passed even though I have known-bad RAM

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Memtest86 passed even though I have known-bad RAM

    I'm using MemTest86 V9.4 Pro Build: 1000 (64-bit) on a Lenovo nx360 server, booted via EFI/grub.

    The server has 16 DIMMs, two of which are flagged by the BIOS as bad (stuck bit). I do have reason to believe that two of the DIMMs really are bad, although I am not certain that the BIOS identified the correct ones.

    Running Memtest86 (3 passes) results in PASSED for all 16 DIMMs. How is that possible? This is reproducible.


    Where do I go from here? Can I trust Memtest86?


    The system event log on the server reports:
    91 05/16/2024 03:11:33 Memory Device 11 (Memory - DIMM 11): Assertion: Memory Scrub Failed (stuck bit) 16 (address). IPMB device LUN 0. Channel 0. 1
    92 05/16/2024 03:11:35 Memory Device 12 (Memory - DIMM 12): Assertion: Memory Scrub Failed (stuck bit) 16 (address). IPMB device LUN 0. Channel 0. 1






    Memtest reports (truncated for brevity)

    Note that the reporting for SPD #12 looks a bit funky; the product ID is truncated.

    SPD #11 32GB DDR4 ECC PC4-17000
    Samsung / M386A4G40DM0-CPB / 31E792CA / Channel: 1 Slot: 1
    15-15-15-36 / 2134 MHz / 1.2V
    SPD #12 32GB DDR4 ECC PC4-17000
    Samsung / Ml / 31E788CA / Channel: 0 Slot: 0
    15-15-15-36 / 2134 MHz / 1.2V




    Result summary
    Test Start Time 2024-05-16 03:38:24
    Elapsed Time 91:05:44
    Memory Range Tested 0x0 - 7080000000 (460800MB)
    CPU Selection Mode Parallel (All CPUs)
    CPU Temperature Min/Max/Ave 38C/66C/50C
    RAM Temperature Min/Max/Ave -/-/-
    ECC Polling Enabled
    # Tests Passed 42/42 (100%)
    Test # Tests Passed Errors
    Test 0 [Address test, walking ones, 1 CPU] 3/3 (100%) 0
    Test 1 [Address test, own address, 1 CPU] 3/3 (100%) 0
    Test 2 [Address test, own address] 3/3 (100%) 0
    Test 3 [Moving inversions, ones & zeroes] 3/3 (100%) 0
    Test 4 [Moving inversions, 8-bit pattern] 3/3 (100%) 0
    Test 5 [Moving inversions, random pattern] 3/3 (100%) 0
    Test 6 [Block move, 64-byte blocks] 3/3 (100%) 0
    Test 7 [Moving inversions, 32-bit pattern] 3/3 (100%) 0
    Test 8 [Random number sequence] 3/3 (100%) 0
    Test 9 [Modulo 20, ones & zeros] 3/3 (100%) 0
    Test 10 [Bit fade test, 2 patterns, 1 CPU] 3/3 (100%) 0
    Test 11 [Random number sequence, 64-bit] 3/3 (100%) 0
    Test 12 [Random number sequence, 128-bit] 3/3 (100%) 0
    Test 13 [Hammer test] 3/3 (100%) 0

  • #2
    There was a know issue similar to this reported by Lenovo
    https://support.lenovo.com/il/en/solutions/HT112272
    They claim it only happens with Micron RAM with specific firmware however. So might not be related.

    Also if this is ECC RAM, a single bit error should have been corrected. So it could be normal that no errors are reported, if they get corrected. You should see ECC warnings in this case however.

    Some memory errors are also not consistent.

    Comment


    • #3
      Thank you for that quick response. I'll definitely check out the firmware update. I strongly suspect that the memory itself is bad, though. It was originally installed in another server, and that server, too, reported that two DIMMs were bad. We just were not sure which ones.

      The behavior is actually quite consistent. On bootup, the firmware always identifies and disables two bad DIMMs, and I ran memtest86 twice, with the same result.

      I suppose time to pore over the detailed memtest86 log file.

      Good point about the ECC RAM! I suppose with that in mind, I'll leave well enough alone and accept what the firmware reports.

      Thanks again!

      Comment


      • #4
        If BIOS disables the sticks then maybe MemTest86 isn't even testing them. I guess it depends on the mechanism used to disable them.
        Modern versions of MemTest86 use the UEFI memory map however, so if the sticks are mapped by BIOS into the address space, they won't be tested.

        Comment

        Working...
        X