Hi, we have a couple of Intel servers that after lots of component swaps, are generally failing Memtest86 v6.2.0 (UEFI boot) only on the Bit Fade Test. All other tests pass fine & have never seen any other errors on them.
We originally test all servers with MemTest86+ and these same 2 servers could intermittently produce a strange error just as MemTest86+ started, that could cause the serverboard management to ramp-up fans and believe there was a serious error. So we changed to MemTest86 as it has more recent updates, and then observed Bit Fade Test errors, so believe these 2 error types are related and have the same cause.
The servers appear to work just fine otherwise, eg. Windows servers running Prime95 Torture stress testing for hours is fine. So we are tempted to return them to production, but we have depended on Memtest86/+ for so many years, that it would be not right!
We read that back in 2013 that the Bit Fade Test could produce erroneous results due to a software bug, but surely that is not possible now?
As mentioned, these servers have now had many (!) components swapped around and around as we try to figure out the issue. New RAM, even new serverboards, different versions & multiple FW flashes. After some changes there can be success, but after more faulty-finding changes, the Bit Fade Test errors just come back.
I read in a previous post...
and I suspected a rogue PSU could be doing this, but swapping PSUs (both dual) between servers didn't seem to be really conclusive. What fault could cause only the Bit Fade Test to appear & then stay?
The only items not changed for brand new are (a) the 2 chassis, and (b) the CPUs (3 of), but no test/combination shows them to be definitely faulty or perfect.
I would be grateful for any insights anyone has, after spending a pile of frustrating hours on this problem, my brain is empty of ideas! We are very familiar with building Intel servers, but have never seen anything like this.
Thank you.
We originally test all servers with MemTest86+ and these same 2 servers could intermittently produce a strange error just as MemTest86+ started, that could cause the serverboard management to ramp-up fans and believe there was a serious error. So we changed to MemTest86 as it has more recent updates, and then observed Bit Fade Test errors, so believe these 2 error types are related and have the same cause.
The servers appear to work just fine otherwise, eg. Windows servers running Prime95 Torture stress testing for hours is fine. So we are tempted to return them to production, but we have depended on Memtest86/+ for so many years, that it would be not right!
We read that back in 2013 that the Bit Fade Test could produce erroneous results due to a software bug, but surely that is not possible now?
As mentioned, these servers have now had many (!) components swapped around and around as we try to figure out the issue. New RAM, even new serverboards, different versions & multiple FW flashes. After some changes there can be success, but after more faulty-finding changes, the Bit Fade Test errors just come back.
I read in a previous post...
This could be due to any number of factors, such as external interference, defects in DIMM or issues with the firmware
The only items not changed for brand new are (a) the 2 chassis, and (b) the CPUs (3 of), but no test/combination shows them to be definitely faulty or perfect.
I would be grateful for any insights anyone has, after spending a pile of frustrating hours on this problem, my brain is empty of ideas! We are very familiar with building Intel servers, but have never seen anything like this.
Thank you.
Comment