Announcement

Collapse
No announcement yet.

Stuck in PXE running after 39 hours testing

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stuck in PXE running after 39 hours testing

    Hi Admin,

    Do you know what the maximum duration of memtest86 PXE test is? How long did PXE work in your lab? Because I met an issue that the PXE test was getting stuck at about 39 hours, so I'm not sure if the PXE test has time restriction. I tried several times of long time PXE test, always stuck at about 39 hours. Could you please help to check? Thank you so much.

    My PXE test scenario:
    1. One computer as PXE server, include MemTest86 site version and DHCP software.
    2. One 48 port ethernet switch.
    3. Test DUTs are Surface laptop and Surface Pro.


  • #2
    There is nothing in Memtest86 that would cause it to stop at 39 hours.
    You can set the number of passes to perform, which controls the test duration. But it shouldn't be "stuck" once those passes are complete.
    How many passes did you set, and how many did it do before it got stuck?

    Did you try the same test with a USB boot instead of a PXE boot?



    Comment


    • #3
      I set 150 passes for 8GB memory, it should run more than 72 hours. But now it was stuck at 39 hours, and already passed 82 passes. Please have a look the below screenshot.

      I tried to run same test with a USB boot instead of PXE, it passed all 150 passes. So I only failed on PXE now. Do you know if there is any setting that we can have PXE running log for debugging? If any clue, please let me know. Thank you so much.

      Click image for larger version

Name:	image.png
Views:	283
Size:	275.5 KB
ID:	53203

      Comment


      • #4
        We suspect it may be related to XML status reporting via TFTP.

        Can you try adding the following line to the mt86.cfg file on the PXE server:

        Code:
        PMPDISABLE=1
        Using this option will prevent the Management Console from showing any test progress, but to narrow down the problem it is a good test.

        Comment


        • #5
          Thanks, Keith. l will try "PMPDISABLE=1" in cfg file. Let's see if all tests can pass.

          Comment


          • #6
            Hi David & Keith:

            I was still failed on PXE test until now. Always stuck at about 39 hours. Do you have any other idea about PXE test duration? May I ask if you have verified long time PXE test on your lab? Currently, all my verifications were only on my Surface Products. So if you have tested long time PXE test before, maybe I can try other brand laptops which you have tested. Thanks a lot.

            Test scenarios:
            1. Server + Switch + ethernet cables + Client PC (DUT): PXE test stuck at about 39 hours. (Failed)
            2. Server + Switch + ethernet cables + Client PC (DUT): The "PMPDISABLE=1" was useless, PXE test stuck at about 39 hours. (Failed)
            3. Server + ethernet cables + Client PC (DUT): Directly connect between Server and Client via ethernet cables. PXE test stuck at about 39 hours. (Failed)
            4. USB key +Client PC (DUT): Used USB key version on Client PC (DUT), passed test more than 72 hours. (Pass)

            Comment


            • #7
              Do you have screen shots from the other times this has happened? How close is the lockup time to the same second? I am wondering if it is some type of network timeout. Another possibility is that there is some type of resource leak related to PXE / the Network device driver in UEFI BIOS & eventually the driver fails.

              How many different machines have you tried this on? For example on a machine with twice the RAM does it take longer to lockup (indicating that the trigger might be in MemTest86 itself reaching a certain position in the testing (e.g. 82 test cycles)).

              No we haven't done 39+ hours test runs from a PXE boot (not in one go anyway).

              Also strange: The Time stamp value in your screen shot is also in a messed up format (39:36:47:52).

              Can you try significantly reducing the amount of RAM being tested via the config file (ADDRLIMLO and ADDRLIMHI). This should help if it is a resource leak,

              Comment


              • #8
                Hi David,

                I don't have more screen shots on my hand. I'm trying to run new round PXE tests, will send you once tests done.

                The lockup time was different with minutes for each round test, but all same at 39 hours currently. For example, first time it was stuck at 39:12:20, second round it might be stuck at 39:30:40, third round was stuck at 39:30:20, also it was stuck at a random test case.

                So far, I tried to test four machines, one Surface laptop, three Surface Pro. The config file of Memtest86 like the below.

                Let me try your idea to reduce the amount of RAM. Thanks.

                #
                # MemTest86 configuration file
                #
                # Please see the help file for a list of available parameters
                #
                # IMPORTANT: Lines that start with a hash character (#) will be ignored.
                # Please remove the hash character to have the setting take effect


                TSTLIST=0,1,2,3,4,5,6,7,8,9,10,11,12,13
                # TESTCFGFILE=customtests.cfg
                NUMPASS=120
                # ADDRLIMLO=0x10000000
                # ADDRLIMHI=0x20000000
                # MEMREMMB=16
                # MINMEMRANGEMB=16
                # CPUSEL=PARALLEL
                # CPUNUM=1
                # CPULIST=2,3
                # MAXCPUS=32
                # DISABLEMP=1
                # ENABLEHT=1
                # ECCPOLL=0
                # ECCINJECT=0
                # MEMCACHE=0
                # PASS1FULL=0
                # ADDR2CHBITS=12,9,7
                # ADDR2SLBITS=3,4
                # ADDR2CSBITS=8
                # LANG=ja-JP
                # REPORTNUMERRS=10
                # REPORTNUMWARN=0
                REPORTPREFIX=SYSINFOSN
                AUTOMODE=1
                # AUTOREPORT=0
                # AUTOREPORTFMT=HTML
                # AUTOPROMPTFAIL=1
                # SKIPSPLASH=1
                EXITMODE=2
                # MINSPDS=0
                # EXACTSPDS=0
                # EXACTSPDSIZE=8192
                # CHECKMEMSPDSIZE=1
                # SPDMANUF=Kingston
                # SPDPARTNO=9905402
                # SPDMATCH=1
                # SAMESPDPARTNO=1
                # BGCOLOR=BLUE
                # HAMMERPAT=0x10101010
                # HAMMERMODE=SINGLE
                # HAMMERSTEP=0x10000
                # CONSOLEMODE=1
                # CONSOLEONLY=0
                # BITFADESECS=300
                # MAXERRCOUNT=10000
                # TFTPSERVERIP=192.168.1.1
                # TFTPSTATUSSECS=60
                # PMPDISABLE=0
                # RTCSYNC=1
                # TRIGGERONERR=1
                # VERBOSITY=1
                # TPL=HIGH_LEVEL

                Comment


                • #9
                  Originally posted by Ellick View Post
                  4. USB key +Client PC (DUT): Used USB key version on Client PC (DUT), passed test more than 72 hours. (Pass)
                  If possible, can you send a copy of the logs for this run.

                  We may be able to find some clues, even if it didn't freeze.

                  Comment


                  • #10
                    Hi Keith,

                    I have attached USB log of Memtest86. Please take a look when you are available. And I found an interesting thing, the PXE test could be completed but only GUI was stuck at 39 hours. It means all I seen might be only GUI block issue.

                    This time I tried to run 60 passes of full test cases on PXE test. And the test was stuck at 39 hours as same as before. I kept the test running, then test report has been uploaded to server automatically at about 61 hours. But GUI was still stuck at 39 hours, I can't do anything on GUI, invalid keyboard and mouse input. I have checked report, it's good.

                    I'm keeping debugging to find more clues, I will share with you and David If I find something new.

                    Thank you so much.

                    Attached Files

                    Comment


                    • #11
                      Thanks for the logs and additional details.

                      I kept the test running, then test report has been uploaded to server automatically at about 61 hours. But GUI was still stuck at 39 hours, I can't do anything on GUI, invalid keyboard and mouse input. I have checked report, it's good.
                      This is interesting. Seems that the system didn't actually freeze, just the screen is stuck (which implies a UEFI BIOS bug). I assume MemTest86 was configured to run fully automatically, and restart/shutdown once the tests have completed? Just wondering if the screen is restored once the system reboots.

                      Also, if you have a different platform (ie. different UEFI firmware) you can test it might be good to see if the same issue occurs.

                      Comment


                      • #12

                        Hi Keith,

                        Yes, you are right I configured to run fully automatically, restart computer once the tests have completed. My config was as same as the example on above Private Messages And I tried different DUT with different UEFI. It had same issue that the GUI was stuck and test completed automatically.

                        So is it caused by UEFI? Perhaps the firmware has some issue with PXE communication. If you have any clue, please kindly share with me. Thank you so much.

                        Comment


                        • #13
                          And I tried different DUT with different UEFI
                          How different?
                          From a different vendor?

                          Comment


                          • #14
                            Same vendor, just different model, UEFI and SAM.
                            One is Surface Pro, another one is Surface laptop.

                            Comment


                            • #15
                              Could you try with a different device (from different vendor).
                              Seems likely two devices from same vendor are going to share same UEFI BIOS and thus same bugs (if it is a bug).

                              Comment

                              Working...
                              X