Announcement

Collapse
No announcement yet.

System hang when running MemTest86 on Supermicro board

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • System hang when running MemTest86 on Supermicro board

    I am encountering hangs at around 20 minutes testing if I set Boot Mode to EFI in BIOS Setup.
    Both Single / Multi-CPU mode will see the problem.
    I won't see the problem if set Boot Mode to Dual or Disable EFI LAN OPROM.

    Motherboard: Supermicro X11DPU Rev 1.10 (BIOS version 3.3)
    CPU: Intel Xeon Gold 5215M
    RAM: Micron MTA36ASF4G72PZ-2G9E2VG 32GB DDR4-2933
    Attached Files

  • #2
    Unfortunately the Supermicro boards have a history of bios bugs that cause lockups.


    Here are some of them
    https://www.passmark.com/forum/memte...supermicro-x10
    https://www.passmark.com/forum/memte...election-modes
    https://www.passmark.com/forum/memte...t-in-uefi-mode
    https://www.passmark.com/forum/memte...rucial-ecc-ram
    https://www.passmark.com/forum/memte...ots-with-v-7-4

    As a result it is hard to know if you are seeing a hardware fault of BIOS bug.
    (Probably another BIOS bug if disabling EFI LAN support fixes the problem)


    Comment


    • #3
      Hi David,

      I do some tests in X11DPU.
      I dump system memory map in efi shell before run memtest86 then directly run memtest86 to reproduce system hang issue.
      finally system hang on Test 13 [Hammer test] and memory range is 0x360000000 - 0x380000000.

      In system memory map, the range is Available.
      Available 0000000100000000-000000047FFFFFFF 0000000000380000 000000000000000F
      In memtest86 log, the range is Free Memory.
      2020-08-12 16:35:33 - 0x000100000000 - 0x00047FFFFFFF (14336MB) {Free Memory}

      The range between BIOS and memtest86 is consistent but system hang.



      I compare system memory map and memtest86 log, I find some regions are different.
      ex:
      In BIOS memory map.
      Available 000000004A281000-000000005D70BFFF 000000000001348B 000000000000000F
      LoaderCode 000000005D70C000-000000005D7FDFFF 00000000000000F2 000000000000000F

      Available 000000005DC2C000-0000000062519FFF 00000000000048EE 000000000000000F
      BS_Data 000000006251A000-0000000069341FFF 0000000000006E28 000000000000000F

      In memtest86.
      2020-08-12 16:35:33 - 0x00004A281000 - 0x00005D5D9FFF (307MB) {Free Memory}
      2020-08-12 16:35:33 - 0x00005D5DA000 - 0x00005D7FDFFF (2MB) {Loader Code}

      2020-08-12 16:35:33 - 0x00005DC2C000 - 0x000061B88FFF (63MB) {Free Memory}
      2020-08-12 16:35:33 - 0x000061B89000 - 0x000061DC8FFF (2MB) {Boot Services Data}
      2020-08-12 16:35:33 - 0x000061DC9000 - 0x0000623BAFFF (5MB) {Free Memory}
      2020-08-12 16:35:33 - 0x0000623BB000 - 0x000069341FFF (111MB) {Boot Services Data}

      Does memtest86 get system memory map table to decide testing regions ?
      Why region 0x360000000 - 0x380000000 is free in system memory map but system hang ?

      Thanks
      Attached Files

      Comment


      • #4
        system hang on Test 13 [Hammer test] and memory range is 0x360000000 - 0x380000000.
        That doesn't match your initial screen shot. The initial screen shot showed a hang in Test #7 in the range
        0x100000000 - 0x880000000

        Maybe the hang is just random?

        Does memtest86 get system memory map table to decide testing regions ?
        Yes.

        Comment


        • #5
          Hi David,

          something is wrong. the screen shop is not provided by me.
          you can refer my attached memtest86 log and system memory map table.

          Thanks

          Comment


          • #6
            Hi David,

            The issue happen 100% when system load lan efi driver but which test case hit the issue is random.
            I ever saw fail in test 3, test 7 and test 13.

            In my experience, running drivers can continue to allocate memory for using.
            Does memtest86 skip the region if running devices allcate memory ?
            Does memtest86 change memory map after run memtest86 ? I can't confirm it because system will reboot after exit memtest86.

            Thanks

            Comment


            • #7
              OK, yes sorry. I just noticed you added your problem on the end of someone else problem.

              MemTest86 always allocates the RAM to reserve it before testing. So if there is another process using RAM at the same time, there should be no conflict. This assumes that the other process also follows the correct procedure in allocating RAM as well. But it could be just a bug in the LAN EFI driver and it is writing to a random memory location.

              Comment


              • #8
                Hi David,

                If memtest86 get memory map from system.
                Why memtest86 memory table is different with system memory map ?

                Thanks

                Comment


                • #9
                  Could be a timing difference.
                  Or it could be different software (e.g. drivers) are loaded into memory in each case.

                  Comment


                  • #10
                    Does Passmark know about Redfish
                    1. This is not a workaround, it is an architectural change for checking the D.O.R.A. process of SMCI Redfish Host Interface.
                    2. & 4. Anyone programs a EFI application and execute it under EFI shell to allocate memory in EFI_BOOT_SERVICES memory map and modify its content in the background can make MemTestX86 V8.4 Free version hangs up easily.
                    The timer interrupt (INT 0) is the only one interrupt vector under EFI shell environment.
                    MemTestX86 V8.4 Free version manipulates memory region without disabling timer interrupt under EFI shell environment.
                    It causes that checking D.O.R.A process for SMCI Redfish Host Interface in the background timer event will have chance to allocate memory
                    and modify memory content in EFI_BOOT_SERVICES memory map.

                    Comment


                    • #11
                      We have no idea what "D.O.R.A. process of SMCI Redfish" is.
                      We did a search on Google for this and there was zero results.
                      Maybe you need to explain in more detail.

                      Even if there was a background interrupt driven process, it should still follow the standard process for allocating RAM. ie. it should not be allocating RAM that is already allocated and in use. It should only use free RAM.

                      Comment


                      • #12
                        well BIOS has it memory regions and Redfish interface has its memory regions.
                        Why then you dont make memtest86 know what free memory region is available and don't write into what is used.
                        Redfish is very important for all data center users.

                        Comment


                        • #13
                          Memtest86 only uses memory regions allocated through UEFI.
                          It never makes use of regions that are already in use.

                          So I still don't see your point about Redfish (which I assume is some software from SuperMicro). Unless your point is that Redfish is doing the wrong thing and making use of memory that it shouldn't be.

                          Comment


                          • #14
                            I want to see if this statement here is correct or we still dont know.
                            is not related to correctly report GetMemoryMap() provided by boot service.

                            evidence that we can present is to program an EFI application, it creates an 1ms periodic timer event with TPL_CALLBACK level to allocate. modify an EfiBootservicesData memory pool.
                            This tool can crash EFI Memtest86 V8.4 Free Version since EFI Memtest86 V8.4 Free Version is not running under TPL_HIGH_LEVEL.
                            it will have chance to be interrupted by other timer event created by other EFI application.
                            I think EFI Memtest86 V8.4 is a very sensitive tool to examine memory, it needs an exclusive operation region but it doesn’t.

                            Comment


                            • #15
                              The only way we can imagine Redfish causing a problem is if is buggy.

                              To block the timer interrupt we would need run MemTest86 at a higher priority level. But doing that is documented to lead to unpredictable behaviour.

                              Quote from the UEFI BIOS writers guide,
                              "Good coding practice dictates that all code should execute at its lowest possible TPL level, and the use of TPL levels above TPL_APPLICATION must be minimized. Executing at TPL levels above TPL_APPLICATION for extended periods of time may also result in unpredictable behavior."

                              So if Redfish really is causing a problem (and we have no proof of that at all), then they should fix their bugs in it.

                              Comment

                              Working...
                              X