Announcement

Collapse
No announcement yet.

MemTest86 Pro v8.1/8.2 crashing in Test12 128b random number test

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • MemTest86 Pro v8.1/8.2 crashing in Test12 128b random number test

    Have been running the free version, just upgraded to Pro v8.1, then v8.2.

    When running on my systems, as soon as the test sequence gets to Test12 128b Random test, MemTest crashes and causes a reboot. No error messages seen, crash is immediate.

    Running all tests except Test12 ran overnight for two full passes just fine. Test11 64b Random test works ok.

    Have tried on two different systems, exact same result. Configuration:

    (1) Lenovo S30 w/ Xeon E5-1620v2 4core/8thread, (4) 16G RDIMM DDR3-1866, 64G total, one per channel. Runs as DDR3-1866 per bios.

    (2) Lenovo S30 w/ Xeon E5-1680v2 8core/16thread, ( 16G RDIMM DDR3-1866, 128G total, two per channel. Runs as DDR3-1600 per bios.

    I have also run the latest Passmark BurnIn test on both systems for half an hour or so (one or two memory passes) with no CPU or memory errors flagged.

    So how can the crashing problem for Memtest86 be resolved?

  • #2
    Strange there is no error message. Test 12 uses SIMD instructions. So really old CPUs will not support these instructions. But these Xeons aren't that old.

    Could also be power / temperature issues. SIMD (and AVX especially) draw a lot more current in the CPU and produce more heat.

    Can you send us a log file.

    Comment


    • #3
      I see no thermal issues. Max CPU temp is reported to be 83'C which is high but not unreasonable. I saw no CPU failures running BurninTest for a while. Windows7 64b runs fine.

      Here are the last lines of the log file:
      Code:
      2019-06-04 16:10:46 - Running test #12 (Test 12 [Random number sequence, 128-bit])
      2019-06-04 16:10:46 - MtSupportRunAllTests - Setting random seed to 0x50415353
      2019-06-04 16:10:46 - MtSupportRunAllTests - Start time: 141633 ms
      2019-06-04 16:10:46 - ReadMemoryRanges - Available Pages = 16703795
      2019-06-04 16:10:46 - MtSupportRunAllTests - Enabling memory cache for test
      2019-06-04 16:10:46 - MtSupportRunAllTests - Enabling memory cache complete
      2019-06-04 16:10:46 - Start memory range test (0x0 - 0x1080000000)
      2019-06-04 16:10:46 - Pre-allocating memory ranges >=16MB first...
      I will send the full log file separately

      Comment


      • #4
        Might be a UEFI firmware bug with multi-threading (of which there has been a lot). Can you try forcing single threading mode.

        Comment


        • #5
          Ok, switched to running with just one CPU (not 4 or 8 which the systems are capable of) and then Test 12 128b Random Data runs just fine.
          So it looks like it is a parallel CPU issue.
          If I set Single CPU, or Round Robin, or Sequential CPU, all work fine. Parallel CPU causes Test 12 (only) to die and cause a reboot.
          So is there a way via the config file to have Parallel CPU mode by default, but force Single CPU mode for just Test 12?
          I guess I could make up two config files, one with tests 0-11,13 with Parallel CPU; and a second with Single CPU and just test 12.

          Comment


          • #6
            We have seen a similar UEFI firmware bug before. You could contact Lenovo and ask for better firmware, but I am sure they won't care about the bug.

            There is a black list file called, blacklist.cfg

            Each blacklisted baseboard is stored on a separate line with the following format:
            <baseboard>,<BIOS version>,<exact|partial match>,<restriction flags>

            For example,
            "Mac-F42C88C8",ALL,EXACT,RESTRICT_STARTUP
            "Z97MX-Gaming 5",ALL,EXACT,RESTRICT_MP


            There are several different types of black listings. The <restriction flags> value can be set to the value, TEST12_ONECPU. Which is what you want for your case.

            See page 45 of the User's Guide for more details. The user's guide is included on the USB drive image,
            MemTest86_User_Guide_UEFI.pdf

            Comment


            • #7
              Updating the BIOS to fix this is most likely a non-starter with Lenovo, as this system (Thinkstation S30) is no longer an active product. Not gonna happen.

              So, I will try to apply the mentioned flag to my configuration. Looking at the user guide, and my log file, it is not exactly clear
              what the "baseboard" parameter is. In my log file I see:
              Code:
              2019-06-04 16:06:50 - SMBIOS BIOS INFO Vendor: "LENOVO", Version: "A2KT65AUS", Release Date: "07/04/2018"
              2019-06-04 16:06:50 - SMBIOS SYSTEM INFO Manufacturer: "LENOVO", Product: "4352G9U", Version: "Lenovo Product", S/N: "MJ37RCB", SKU: "", Family: ""
              2019-06-04 16:06:50 - SMBIOS: Found SMBIOS BaseboardInformation (pbLinAddr=0x7C5B00B3, FormattedLen=15, iTotalLen=112)
              2019-06-04 16:06:50 - SMBIOS BASEBOARD INFO Manufacturer: "LENOVO", Product: "LENOVO", Version: "0B98417 WIN", S/N: "NONE                ", AssetTag: "                         ", LocationInChassis: "To be filled by O.E.M."
              so would this translate to: "0B98417 WIN",ALL,EXACT,TEST12_ONECPU

              using the BASEBOARD INFO Version string for the <baseboard> parameter?

              Comment


              • #8
                I think Lenovo have done the wrong thing in their BIOS setup. It doesn't make sense to have their product called LENOVO. As that data should be, and is, already in the Manufacturers field.

                So try,
                "LENOVO",ALL,EXACT,TEST12_ONECPU

                Comment


                • #9
                  Ok, so I tried adding this line to the blacklist.cfg file:
                  "LENOVO",ALL,EXACT,TEST12_ONECPU
                  but when I boot and run the test it is not matched, has no effect. In the log file I don't even see the line echoed as being read from the blacklist.cfg file. Strange.
                  So I changed it to this:
                  "LENOVO",ALL,EXACT,DISABLE_MP
                  and then when I reboot and run Memtest86 the line is recognized in the log file as being read from the blacklist.cfg file. And during the test only 1 cpu is used, as expected.

                  So it appears to me if the TEST12_ONECPU keyword is used the line is ignored in the blacklist.cfg file. Just discarded like a comment, no error/warn issued.
                  I notice no other entries in the distributed blacklist.cfg file use this entry. So has it been tested and validated to work? Just askin'.

                  Comment


                  • #10
                    I just noticed in the "Release Notes" for v8.2 it refers to the flag "TEST12_SINGLECPU" whereas in the pdf documentation (and above) you referred to "TEST12_ONECPU".
                    So could that be the issue?

                    Update: found this line in the .cfg file:
                    "X9DRi-LN4+/X9DR3-LN4+",ALL,EXACT,TEST12_SINGLECPU

                    so that appears to be the issue!
                    Last edited by donorth; Jun-06-2019, 08:17 PM.

                    Comment


                    • #11
                      Final update ... I ran the tests with the blacklist file entry changed to TEST12_SINGLECPU and it worked as expected
                      Test12 runs with one cpu only, does not crash. MultiCpu on other tests.

                      Comment


                      • #12
                        Good to hear it is working.

                        We have updated the documentation so that it is corrected to TEST12_SINGLECPU

                        Comment


                        • #13
                          A final update. SInce 2019 I have been running my Lenovo ThinkStation S30 system with the above fix in to set Test12 to single CPU mode.
                          My system has BIOS A2KT68A [v68] and I required the fix for MT86 v8.1 thru the latest v9.4 to get MemTest86 Test12 to run without freezing/crashing.

                          However, I recently decided to update to the BIOS A2KT70A [v70] release from Lenovo (released in late 2020, I am slow...).
                          I tried MT86 WITHOUT the TEST12_SINGLECPU fix, and lo and behold it works!
                          Test 12 Random 128b now runs in multiple CPU mode with the updated BIOS v70 without the Test12 SingleCPU workaround.

                          So (finally) case closed.

                          Comment


                          • #14
                            Thanks for the update. Good to know some of these UEFI BIOS bugs eventually get fixed.

                            Comment


                            • #15
                              Thanks for letting us know. Glad to hear this was fixed by Lenovo.

                              Can you change the line in blacklist.cfg to:

                              Code:
                              "X9DRi-LN4+/X9DR3-LN4+",A2KT70A,EXACT,TEST12_SINGLECPU
                              This should apply the blacklist workaround only for BIOSes earlier than 'A2KT70A'.

                              Comment

                              Working...
                              X