Announcement

Collapse
No announcement yet.

Interpreting "[UEFI Firmware Error] Could not start CPU X"

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Interpreting "[UEFI Firmware Error] Could not start CPU X"

    Hello,

    I just purchased a new server motherboard, CPU, and memory combo so I am running Memtest prior to deploying an OS on it. I believe there is an issue with the RAM that I purchased because I saw some ECC errors within the first 10 minutes of testing. I pulled these RAM modules out and decided to put in a set of known good RAM from another system just to confirm that everything is all good with the CPU and motherboard. These RAM modules survived 4 passes of Memtest in the system they live in full time.

    During test #8 on the known good RAM, I saw an error that I never saw before. "[UEFI Firmware Error] Could not start CPU X". This error was repeated twice for CPUs 8 and 9. I let the tests run up until test #13 and then stopped them. I saved the logs and copied them over to my desktop. The HTML log is provided below, and the .log file is attached to this post. Looking through the .log file, it seems that there were many related errors that were thrown for other CPU cores during this test that did not appear in the UI.

    After reading through the available documentation and doing some googling, I believe I understand the gist of the issue. There is a bug in the UEFI firmware that is causing issues with this test. With this being a brand new system (to me, motherboard is brand new but CPU is used and the RAM that was in the system when these errors occurred is known good as I said), these results are still a bit worrying. There are admittedly very few results when searching for others who have experienced this problem, so I have a few questions:

    1. Would this UEFI firmware bug cause system instability when running a traditional OS?

    2. Could memtest actually be finding a hardware failure related issue here that is being misinterpreted as a UEFI bug?

    3. I am now running Memtest again on 4 of the 8 new RAM modules as I try to identify the faulty one (or more). This first pass just made it past test #8 without any firmware bug issues appearing in the UI. Does it make sense that this bug is more likely to arise when all 8 RAM slots are populated vs just half of them?

    4. Is there any additional information I could provide or tests that I could run that would help us all understand this issue better? I am happy to help however else I can.

    I will be running more tests with the RAM I purchased as I try to identify the faulty modules, and will follow up with additional results of further runs on this system.
    Summary

    Report Date 2023-12-03 18:12:00
    Generated by MemTest86 V10.6 Free (64-bit)
    Visit MemTest86.com to Upgrade to Pro
    Result INCOMPLETE PASS
    System Information

    EFI Specifications 2.70
    System
    Manufacturer To Be Filled By O.E.M.
    Product Name ROMED8U-2T
    Version To Be Filled By O.E.M.
    Serial Number To Be Filled By O.E.M.
    BIOS
    Vendor American Megatrends Inc.
    Version P3.40
    Release Date 09/26/2022
    Baseboard
    Manufacturer ASRockRack
    Product Name ROMED8U-2T
    Version
    Serial Number BR8PFB000800047
    CPU Type AMD EPYC 7282 16-Core
    CPU Clock 2800 MHz [Turbo: 3200.3 MHz]
    # Logical Processors 32 (16 enabled for testing)
    L1 Cache 32 x 64K (151591 MB/s)
    L2 Cache 32 x 512K (61580 MB/s)
    L3 Cache 1 x 65536K (13465 MB/s)
    Memory 262034M (14190 MB/s)
    RAM Configuration DDR4 ECC 3200MT/s / x16 Channel / 24-22-22-52 / 1.200V
    Number of RAM SPDs detected 0
    Number of RAM slots 8
    Number of RAM modules 8
    DIMM A1 32GB DDR4 2Rx8 ECC PC4-25600
    Vendor Part Info Micron Technology / 18ASF4G72PDZ-3G2F1 / 3614D2C9
    SMBIOS Profile 3200MT/s 1.2V
    DIMM B1 32GB DDR4 2Rx8 ECC PC4-25600
    Vendor Part Info Micron Technology / 18ASF4G72PDZ-3G2E1 / 27B8D90B
    SMBIOS Profile 3200MT/s 1.2V
    DIMM C1 32GB DDR4 2Rx8 ECC PC4-25600
    Vendor Part Info Micron Technology / 18ASF4G72PDZ-3G2E1 / 27B8D926
    SMBIOS Profile 3200MT/s 1.2V
    DIMM D1 32GB DDR4 2Rx8 ECC PC4-25600
    Vendor Part Info Micron Technology / 18ASF4G72PDZ-3G2F1 / 3614CF94
    SMBIOS Profile 3200MT/s 1.2V
    DIMM E1 32GB DDR4 2Rx8 ECC PC4-25600
    Vendor Part Info Micron Technology / 18ASF4G72PDZ-3G2F1 / 3614CB66
    SMBIOS Profile 3200MT/s 1.2V
    DIMM F1 32GB DDR4 2Rx8 ECC PC4-25600
    Vendor Part Info Micron Technology / 18ASF4G72PDZ-3G2F1 / 3614D326
    SMBIOS Profile 3200MT/s 1.2V
    DIMM G1 32GB DDR4 2Rx8 ECC PC4-25600
    Vendor Part Info Micron Technology / 18ASF4G72PDZ-3G2E1 / 27B8D924
    SMBIOS Profile 3200MT/s 1.2V
    DIMM H1 32GB DDR4 2Rx8 ECC PC4-25600
    Vendor Part Info Micron Technology / 18ASF4G72PDZ-3G2E1 / 27B8D920
    SMBIOS Profile 3200MT/s 1.2V
    Result summary

    Test Start Time 2023-12-03 15:57:58
    Elapsed Time 2:13:49
    Memory Range Tested 0x0 - 7FC04000000 (8372288MB)
    CPU Selection Mode Parallel (All CPUs)
    CPU Temperature Min/Max/Ave 32C/48C/41C
    ECC Polling Enabled
    # Tests Completed 11/48 (22%)
    # Tests Passed 11/11 (100%)
    Test # Tests Passed Errors
    Test 0 [Address test, walking ones, 1 CPU] 1/1 (100%) 0
    Test 1 [Address test, own address, 1 CPU] 1/1 (100%) 0
    Test 2 [Address test, own address] 1/1 (100%) 0
    Test 3 [Moving inversions, ones & zeroes] 1/1 (100%) 0
    Test 4 [Moving inversions, 8-bit pattern] 1/1 (100%) 0
    Test 5 [Moving inversions, random pattern] 1/1 (100%) 0
    Test 6 [Block move, 64-byte blocks] 1/1 (100%) 0
    Test 7 [Moving inversions, 32-bit pattern] 1/1 (100%) 0
    Test 8 [Random number sequence] 1/1 (100%) 0
    Test 9 [Modulo 20, ones & zeros] 1/1 (100%) 0
    Test 10 [Bit fade test, 2 patterns, 1 CPU] 1/1 (100%) 0
    Test 13 [Hammer test] 0/0 (0%) 0
    Attached Files

  • #2
    1. Would this UEFI firmware bug cause system instability when running a traditional OS?
    Not as far as we are aware. It is only when running multiple threads in UEFI.

    2. Could memtest actually be finding a hardware failure related issue here that is being misinterpreted as a UEFI bug?
    If you are seeing ECC errors, then they are likely real errors (that got corrected by ECC). Your log didn't seem to have any errors in it however.
    Having said that, there is yet another UEFI bug on a couple of motherboards that throws fake ECC errors.

    3. I am now running Memtest again on 4 of the 8 new RAM modules as I try to identify the faulty one (or more). This first pass just made it past test #8 without any firmware bug issues appearing in the UI. Does it make sense that this bug is more likely to arise when all 8 RAM slots are populated vs just half of them?
    Obviously the amount of RAM to be tested is now half of what it was. So this means there would be some timing changes.
    We'll have a closer look at the log.

    Comment


    • #3
      Thanks David, I appreciate you!

      The Memtest logs that I posted above were from a test while running known good RAM. The possible hardware fault I was worried about was with the CPU or motherboard, or perhaps another issue such as the CPU not being seated properly. AMD Epyc is particularly finicky with requiring the CPU to be seated correctly, AMD even has a torque spec that they require you to tighten the screws down with when installing the CPU. I was thinking perhaps an issue with an improperly seated CPU could lead to issues that Memtest is interpreting as UEFI firmware bugs.

      I left the tests running overnight with just 4 RAM modules and came back to many more of these errors. There were no ECC/RAM errors, only UEFI firmware errors, but a lot of them. I attached the logs from that run to this post. It seems that the errors all occurred between tests 6 and 13.

      Today I pulled the system out, completely disassembled it, reseated the CPU, put it back together again, and started another Memtest run. I set it to run just tests 8 and 9, since they seem to be the ones that most easily recreate the bug. Once again I got the same UEFI firmware bug errors appearing almost immediately. At this point I have reseated the CPU 3 different times and am still seeing the same bugs coming up, so I am less inclined to believe that there may be an issue with how the CPU was seated.

      I am now trying another run with the CPU selection set to round robin rather than parallel.
      Attached Files

      Comment


      • #4
        Thanks for the logs.

        Can you run MemTest86 v8.4 as well and upload a copy of the logs:
        https://www.memtest86.com/downloads/...86-8.4-usb.zip

        We want to confirm if this is indeed a UEFI firmware issue, as we've had reports where the errors don't appear (in the logs) when running MemTest86 v8.4.
        You'll likely won't see the same UEFI firmware errors on the screen as it was added in a later version of MemTest86.

        Comment


        • #5
          Thanks Keith. I must admit that I am now incredibly confused by the results I am seeing.

          MemTest86-20231204-041148-round robin (known good).log - in this log file I was testing on 8 known good ram modules with the version 10.6. I initially ran in parallel mode and got more UEFI firmware error messages. I aborted that test and started another on round robin. I forgot which selection I made after starting it (brain fart), so I aborted again and then restarted again in round robin. This test completed a full pass and 4 tests of the 2nd pass without any errors before I aborted it.

          MemTest86-20231205-020347-sequential (unknown quality).log - in this log I was testing on 4 of the new sticks of ram whose quality is yet to be determined, again with version 10.6. I ran this test in sequential mode, and it completed a full pass without any errors.

          MemTest86 - version 8.4.log - in this I was testing on the same 4 sticks of new RAM as I did on the sequential test. I immediately noticed two things: the ECC isn't being detected, and when it runs in parallel it uses all 32 threads instead of just the 16 cores that version 10.6 uses. Odd, but I attributed these to being an older version of the software. By the time it hit test 5 (parallel mode), it started reporting hundreds of memory errors. This is what the log shows. I was very confused by this, so I aborted the test, got a copy of the logs off the flash drive, and ran another test on 10.6 in parallel mode. It recognized the ECC again in this version, and it made it to test 7 without any errors being reported. I aborted, booted back into version 8.4, saw that ECC wasn't recognized again and saw more memory errors coming up by test 4.

          It's possible that something is wrong with the new RAM, but why would they be showing up so quickly in version 8.4 in parallel mode, while never showing up in a full pass in 10.6 in sequential mode or in a much longer amount of time while running in parallel? Is it possible that the UEFI bug that is caught in the later versions is leading to memory errors in this earlier version?
          Attached Files
          Last edited by AbsolutelyFree; Dec-05-2023, 07:11 AM.

          Comment


          • #6
            I am going to try splitting up the 4 modules I currently have installed into smaller groups and figure out if one or more of these are bad using v8.4 since it seems to show issues quickly. Hopefully there is just a bad module(s) in this group of 4 that is causing these issues.

            I guess I will have to disassemble my other server again and run the 8.4 tests on the 8 known good modules...
            Last edited by AbsolutelyFree; Dec-05-2023, 07:12 AM.

            Comment


            • #7
              I got more memory errors on v8.4 when just running 2 of the 4 modules that caused errors previously, but both modules completed a pass each without errors when run individually. Very confusing.

              I just disassembled my other, known good server again. I took the 8 known good modules and put them in the new server that has been having issues, and am running a test on 8.4 in parallel.

              I also took the 4 modules that caused errors in the new server and put them in my known good server and am running a test on 10.6 in parallel.

              Comment


              • #8
                the ECC isn't being detected, and when it runs in parallel it uses all 32 threads instead of just the 16 cores that version 10.6 uses
                This is normal. Old V8 software won't detect some features from new CPUs & RAM. We also turned off hyper-threading in later releases as it was making testing slower.
                Not detecting ECC shouldn't effect the actual memory test. But will effect the reporting of ECC corrected errors.

                We'll take a closer look at the logs.

                Comment


                • #9
                  Thanks David, I figured it was just differences in Memtest versions but I appreciate the confirmation.

                  Some updates:

                  I took the same 4 new RAM modules that were failing quickly in Memtest v8.4 on the new CPU and new mobo (Epyc 7282 & Asrock Rack ROMED8U-2T), put them in my other known good CPU/mobo (Epyc 7252 & Supermicro H12SSL-i), and ran Memtest v10.6. This combo completed all tests for all 4 passes with no RAM or ECC errors. I attached the logs for this test to this pass (as well as the html file since this is completely different hardware, remove the .log from the end of the filename). I am not sure why these modules would fail so quickly in the new CPU/mobo but can last for 4 full passes in the old CPU/mobo. Perhaps the new Asrock Rack mobo just doesn't like these SK Hynix modules but the Supermicro mobo does? The new SK Hynix modules are Rx4 while the old Micron modules are Rx8, maybe that makes a difference? Fewer cores maybe don't push the modules to a problematic level? Regardless, I can just use the SK Hynix modules in the Supermicro motherboard since they seem to work fine there, they're the same size, and the differences between Rx4 and Rx8 are negligible for my use cases. I have another 4 matching modules coming in the mail.

                  I have the 8 known good Micron modules testing with Memtest v8.4 in the new CPU/mobo (Epyc 7282 & Asrock Rack ROMED8U-2T). Currently this combo is halfway through pass 3 without a single error coming up so far (RAM, ECC, or UEFI firmware). This seems to confirm that the UEFI issue that is appearing in Memtest v10.6 is indeed a UEFI bug and not system instability, but I will be doing further tests. I will provide logs from this run once it completes. I will also try running some mprime with in-place large FFTs to see if I can find any CPU/memory instability inside of an operating system.
                  Attached Files
                  Last edited by AbsolutelyFree; Dec-06-2023, 04:55 AM.

                  Comment


                  • #10
                    Here are the logs from the tests with version 8.4 on the new mobo and CPU with the known good RAM modules. All 4 passes were completed without any errors appearing in the UI. Looking through the logs, it looks like more of the "CPU #X timed out" errors appeared throughout the tests as was expected.

                    This new server seems to run fine with the Micron modules so I will leave them in there, and my other server is running fine with the SK Hynix modules. Not sure why the new server hates the SK Hynix modules so much, but oh well.

                    I am now running 24 hours of mprime on this hardware to confirm stability in an OS. 2 hours in and no errors so far, but I will update again tomorrow.
                    Attached Files

                    Comment


                    • #11
                      Thanks for the logs.

                      Yes, it appears the v8.4 logs do confirm it is likely a UEFI BIOS issue (and not an introduced bug in later versions of MemTest86).

                      We're revisiting how to report such UEFI firmware errors in a way where it better represents its severity (eg. display error on screen just once, and not multiple times)​

                      Comment


                      • #12
                        I ran 29 total hours of mprime blended tests to confirm that there was no instability in an OS and did not encounter a single error. At this point I feel confident that this system is fine with this RAM and that the issue is purely with a multithreading bug in the UEFI.

                        Big thanks to David (PassMark) and keith, I really appreciate you two following up with me and confirming what I am seeing. Let me know if you would like someone to test on a known buggy UEFI, this system will be living in a Proxmox cluster so it isn't too much trouble to run tests again.


                        Originally posted by keith View Post
                        We're revisiting how to report such UEFI firmware errors in a way where it better represents its severity (eg. display error on screen just once, and not multiple times)​
                        I agree with this idea. Perhaps it could appear in the UI as a note, like when Memtest is run on RAM that is vulnerable to high frequency row hammer bit flips? The note could say something to the effect of "Detected a UEFI Multithreading bug, may impact test results" along with an explanation that appears in the Memtest documentation to better explain the situation. It's unfortunate but really the end user that's running Memtest is generally powerless to do anything about it unless the manufacturer fixes the issue in a later BIOS revision, or the bug doesn't exist in an earlier revision that can be rolled back to without detriment. At least in my case it didn't seem to actually cause any memory errors with RAM that this motherboard likes, which may not be universal but a note explaining the UEFI issue is present could help explain anomalous results if they do come up.

                        Comment

                        Working...
                        X