Sure! I just downloaded 10.2 and tried it out. This time, instead of restarting the system at "Getting SPD details..." it just gets stuck there. No other output appears after that line.
Announcement
Collapse
No announcement yet.
ECC Errors - Which RDIMM?
Collapse
X
-
UPDATE - After updating the BIOS on the new motherboard, I noticed that I could get into MemTest 10.2 but only if I let the system boot to the USB flash drive directly with no user interaction. When I tried to go through the boot menu and explicitly select the same USB flash drive to boot from, MemTest would still get stuck after "Getting SPD details...".
However, now there's a new problem -- MemTest says ECC is *not* enabled.
FWIW, this board/CPU/memory setup is:
Supermicro M12SWA-TF
Threadripper PRO 5965WX
Samsung M393A4K40EB3-CWE 32GB ECC RAM
There's only one BIOS setting that looks remotely related to ECC. It's called "Memory Corrected Error Enabling" and I've tried it set to "Enabled" and "Disabled" with no change in the ECC status as reported by MemTest.
In case it helps, I'm attaching the logs plus the RAM and sys info files.
Comment
-
UPDATE -- I got a tip from Supermicro to try checking using Ubuntu Live and dmidecode. Per dmidecode, both of my systems support "Multi-bit ECC".
So, it seems there may be a bug in MemTest86's ECC detection on this motherboard / chipset?
Here's the dmidecode output from the M12SWA-TF system.
Comment
-
Originally posted by lunadesign View PostUPDATE - After updating the BIOS on the new motherboard, I noticed that I could get into MemTest 10.2 but only if I let the system boot to the USB flash drive directly with no user interaction. When I tried to go through the boot menu and explicitly select the same USB flash drive to boot from, MemTest would still get stuck after "Getting SPD details...".
Originally posted by lunadesign View PostHowever, now there's a new problem -- MemTest says ECC is *not* enabled.
We're looking into adding support for this particular chipset. If you send an e-mail to us, we can provide you a build to test.
Comment
-
Originally posted by keith View Post
Thanks for the update. According to the logs, it appears to freeze while attempting to obtain multiprocessor info from the UEFI firmware. This is likely a BIOS bug though it is strange that the behaviour is different depending on whether MemTest86 is selected to boot from menu or not.
We're looking into adding support for this particular chipset. If you send an e-mail to us, we can provide you a build to test.
I've just sent an e-mail. Please let me know what I can do to help get this chipset supported.
Comment
-
I obtained the test build from Keith and gave it a try on the TR Pro/WRX80 motherboard (M12SWA-TF). It took 2 or 3 reboots to get it past the suspected BIOS bug (even one where I didn't use the boot menu). But once I got in, I noticed that ECC Polling is now enabled. Yay!
I ran the standard memory tests with a single ECC 32GB DIMM for a full pass and saw no errors (ECC or otherwise).
I fully populated the same motherboard with 8 32GB DIMMs and ran the standard memory tests overnight. It's currently in the middle of pass 2. So far, no errors (ECC or otherwise).
Note: The DIMMs I've been testing on the TR Pro/WRX80 motherboard so far are not the same ones that I was having problems with on the EPYC motherboard at the beginning of this thread.
Question for PassMark: Does the test build have the same ability to detect ECC errors on the TR Pro/WRX80 motherboard as 10.0/10.2 did on the EPYC motherboard (the one at the beginning of this thread)?
If yes, I can declare the TR Pro/WRX80 motherboard as "known good" and use it to test the memory that was having issues on the EPYC motherboard. This will finally allow me to determine if it was the memory or the EPYC motherboard that was having issues.
Comment
-
Originally posted by David (PassMark) View Post
Unless we have messed up something, it should be at least as functional as the older patch release.
Comment
-
TESTING UPDATE
I'm going to include a full recap that covers the full story so you don't have to scroll back. In a few cases, I'm going to identify specific DIMMS by the last 3 digits of the serial number so you can follow the movement of DIMMs.
1) EPYC motherboard with 8 x 64GB ECC RDIMMs has ECC errors.
a) Initially, errors are coming from DIMM 4E1 in channel 3.
b) I swap the channel 2 and 3 DIMMs. 4E1 is now in channel 2, 482 is now in channel 3.
c) I re-run the tests and errors are now coming from DIMM 4E1 in channel 2.
d) I remove DIMM 4E1 from the system and replace it with a new replacement (5C6). So, 5C6 is now in channel 2.
e) I re-run the tests and errors are now coming from DIMM 5C6 in channel 2.
f) I swap the channel 2 and 4 DIMMs. 5C6 is now in channel 4, 73C is now in channel 2.
g) I re-run the tests and errors are now coming from DIMM 73C in channel 2.
2) TR Pro motherboard with 8 x 32GB ECC RDIMMs fully tests out with no errors.
3) Back on the EPYC motherboard...
a) I re-run the tests on the 64GB DIMMs to make sure the problem is still reproducible. Once again, I get errors from DIMM 73C in channel 2.
b) I remove all DIMMs, use canned air to blow out all the memory sockets, carefully reinstall all DIMMs in the exact same slots they were previously in.
c) I re-run the tests. Once again, I get errors from DIMM 73C in channel 2.
4) I move all of the 64GB DIMMs from the EPYC motherboard to the TR Pro motherboard. I keep the same DIMM-to-slot assignments as I had with on the EPYC board (the two boards have the same slot-to-channel associations). I run the tests on the TR Pro and get ECC errors from DIMM 73C in channel 2.
5) I move all of the 32GB DIMMs that were previously in the TR Pro motherboard to the EPYC motherboard. I keep the same DIMM-to-slot assignments. I run the tests on the EPYC and get no ECC errors.
Since the 32GB DIMMs have worked perfectly on both boards, I think those DIMMs and both boards are fine. I think the problem is with the 64GB DIMMs.
The challenge in identifying the culprit is that I've seen errors reported on 3 different 64GB DIMMs. In case it helps:- I have 9 64GB DIMMs with the exact same part number
- 2 were manufactured in late 2020
- 7 were manufactured in late 2021
- Of the 3 that have reported errors, 1 was from the 2020 batch, 2 were from the 2021 batch
I'm not sure how likely it is that I've got 3 different bad DIMMs from 2 batches. Is it possible that another DIMM in the set is causing these 3 to report errors? Or are the channels *completely* independent? (Both motherboards have a single slot per memory channel.)
I guess I could try testing each of the problematic 64GB DIMM independently to see if triggers the errors?
Thoughts?
Comment
-
It would be strange to have 3 bad sticks. Unless it was a design flaw (i.e. all sticks of that RAM design are marginal in these AMD motherboards).
Might be time to start tweaking BIOS settings. This should totally not be required, as the vendors experts should be doing this for you.
Assuming the BIOS lets you, bump up the voltages very slightly and drop the speed (e.g. turn off XMP). Basically following the same steps as you would have if you have overclocked the RAM.
There are some more detailed comments here
https://help.corsair.com/hc/en-us/ar...locking-memory
Comment
-
Originally posted by David (PassMark) View PostIt would be strange to have 3 bad sticks. Unless it was a design flaw (i.e. all sticks of that RAM design are marginal in these AMD motherboards).
Might be time to start tweaking BIOS settings. This should totally not be required, as the vendors experts should be doing this for you.
Assuming the BIOS lets you, bump up the voltages very slightly and drop the speed (e.g. turn off XMP). Basically following the same steps as you would have if you have overclocked the RAM.
There are some more detailed comments here
https://help.corsair.com/hc/en-us/ar...locking-memory
Any thoughts as to my theory where a DIMM might be triggering the error in one or more DIMMs?
Comment
-
Any thoughts as to my theory where a DIMM might be triggering the error in one or more DIMMs?
But if one stick (or all sticks together) were drawing too much current, this might drop voltage levels below acceptable level. Thus the suggestion to bump voltage up.
Comment
Comment