Announcement

Collapse
No announcement yet.

Bit Fade Test blues

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bit Fade Test blues

    Hi, we have a couple of Intel servers that after lots of component swaps, are generally failing Memtest86 v6.2.0 (UEFI boot) only on the Bit Fade Test. All other tests pass fine & have never seen any other errors on them.

    We originally test all servers with MemTest86+ and these same 2 servers could intermittently produce a strange error just as MemTest86+ started, that could cause the serverboard management to ramp-up fans and believe there was a serious error. So we changed to MemTest86 as it has more recent updates, and then observed Bit Fade Test errors, so believe these 2 error types are related and have the same cause.

    The servers appear to work just fine otherwise, eg. Windows servers running Prime95 Torture stress testing for hours is fine. So we are tempted to return them to production, but we have depended on Memtest86/+ for so many years, that it would be not right!

    We read that back in 2013 that the Bit Fade Test could produce erroneous results due to a software bug, but surely that is not possible now?

    As mentioned, these servers have now had many (!) components swapped around and around as we try to figure out the issue. New RAM, even new serverboards, different versions & multiple FW flashes. After some changes there can be success, but after more faulty-finding changes, the Bit Fade Test errors just come back.

    I read in a previous post...
    This could be due to any number of factors, such as external interference, defects in DIMM or issues with the firmware
    and I suspected a rogue PSU could be doing this, but swapping PSUs (both dual) between servers didn't seem to be really conclusive. What fault could cause only the Bit Fade Test to appear & then stay?

    The only items not changed for brand new are (a) the 2 chassis, and (b) the CPUs (3 of), but no test/combination shows them to be definitely faulty or perfect.

    I would be grateful for any insights anyone has, after spending a pile of frustrating hours on this problem, my brain is empty of ideas! We are very familiar with building Intel servers, but have never seen anything like this.
    Thank you.

  • #2
    Another possible cause is faulty BIOS firmware. So I would check to see if there is an updated BIOS available.

    We also have an updated version of MemTest86 available (v6.3.0) so I would give that a try as well. If you are still getting errors, please upload or e-mail us a copy of the MemTest86.log file under EFI\BOOT\ of the USB drive.

    Comment


    • #3
      If these are servers, then are you using ECC RAM?
      As any ECC errors would be a definitive sign of things going wrong.

      Comment


      • #4
        Thanks for the suggestions.
        We have tried re-flashing all firmware (current/latest version), and using a new download in case what we had was corrupted, but no change. I don't think we have tried older FW versions, maybe worth a try.

        Will try the new MemTest86 version, and upload results file.
        FYI we have run our next memory "goto test" - Prime95's "Torture" test (configured to use all free memory) for several days with no issues.
        Thanks.

        Comment


        • #5
          Originally posted by David (PassMark) View Post
          If these are servers, then are you using ECC RAM?
          As any ECC errors would be a definitive sign of things going wrong.
          Yes it's all ECC ram, but no actual ECC errors with these tests as yet, but yes we have seen such errors on past (eg. with DIMM slots disabled if un-correctable).
          Thanks.

          Comment


          • #6
            Quick update - turned out there was a very recent new server firmware release, but it didn't resolve the issue, nor did using MemTest86 v6.3.0. Log file emailed.
            Thanks

            Comment

            Working...
            X