Announcement

Collapse
No announcement yet.

BurnInTest Linux Disk test v4 no errors, v5 errors after a random amount of time

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BurnInTest Linux Disk test v4 no errors, v5 errors after a random amount of time

    We're running Disk test in BurnInTest V4 (1004) for 12 hours on systems just out of production and we get no errors.
    When we're running the same test, same machine, not even rebooted, using BurnInTest V5 (1010) we get errors after a random amount of time. So far they are starting to show up after around 2-4 hours and continue to add up for as long as the test is running.

    The latest test started at around 08:00 local time and the first errors started showing around 12:06.
    The Event Log reports "Disk, /dev/nvme0n1p1: Test file could not be created...".

    Since we had around 10 newly built production units we tested them all and they all passed the test using V4 but they failed the test using V5.

    Since they are all brand new units I doubt it is a hardware error. Is anything changed in the way the tests are executed during the last couple of updates of V5? If I recall correctly we did not see these errors using build 1006 but on the other hand we did not test the exact identical systems so take that with a grain of salt.

  • #2
    There has been some changes between V4 and V5
    https://www.passmark.com/products/bu...ux_history.php

    Can you run BITLinux in debug mode (with '-d' parameter), reproduce the error, and send us the logs?

    Comment


    • #3
      I'm sorry for the late reply. I ran some tests in debug mode but since I ran tests on the CPU, RAM and graphics as well as disk-test the log file got a bit too large. I'm running disk-test-only at the moment so I can give you a complete debug log. I've attached a file where I filtered out all lines that contain the phrase "disk:" but as I said I will get a complete debug log as soon as it has failed.

      The first error occurred at May 16 12:02:29
      Attached Files

      Comment


      • #4
        Is an unusual error.
        Error 24, Too many open files

        Seems in some versions of Linux there is a low limit for number of files that can be opened (256). But it looks like there is an easy Linux configuration option to increase the limit.

        This page covers it.
        https://stackoverflow.com/questions/...-opening-files

        For a modern system I would suggest the limit be at least 10,000 files.

        Comment


        • #5
          On that system the 'ulimit -n' says the limit is set to 1024, the same as my laptop running Ubuntu at the moment. I'll raise it and see if it makes any difference.

          Comment


          • #6
            So it is a bit strange that you seem to have hit that limit of 1024. Wonder if there was any another 3rd party software you have running on the same machine that is leaking file handles.
            Let us know how it goes.

            Comment


            • #7
              I might have just discovered something. If I just open up BurnInTest without starting any tests and run
              Code:
              sudo ls /proc/<pid of burnintest>/fd/ | wc -l
              to list the open files I can see that without doing anything that number is slowly rising. The line that starts multiplying is
              Code:
              bit_gui_x <pid> root 91r FIFO 0,12 0t0 220206 pipe
              (where the FD 91r and node number 220206 is changing for every line)
              I cannot say if this affects anything because when running
              Code:
              cat /proc/sys/fs/file-nr
              to view the system-wide value it doesn't seem to change that much. (pending around 864-960).

              I'll keep digging and see what more I can find out. I'm still running a machine with the original limit of 1024 to see if it will hit the limit or not and then I may be able to figure out what process has the most open files.

              Comment


              • #8
                Here are some updates:

                Sometimes when just starting BurnInTest V5 (1010), with or without starting any tests, it will start multiplying a zombie process that using 'ps' is called [smartctl]. These processes correspond to the rising amount of
                Code:
                bit_gui_x <pid> root 91r FIFO 0,12 0t0 220206 pipe
                that shows up using
                Code:
                sudo lsof -p <pid of burnintest>
                If running a disk-test when the amount of processes reach the default soft-limit of 1024 simultaneous open file descriptors, the disk-test will start reporting occasional open/write errors since not being allowed to open any more files.

                Raising the soft-limit to 10'000 (ulimit -n 10000) is a temporary workaround but all it really does is increasing the time until failure. As of the current session I have going with soft-limit set to 10'000 it's been running for ~23 hours and the BurnIn-process is currently at 8625 open files as reported by
                Code:
                sudo ls /proc/<pid of burnintest>/fd/ | wc -l
                However this behavior is a bit intermittent. The machine currently under testing was not behaving this way during the last reboot. That time BurnInTest was only using just above 20 open files and not rising during a 12h test period. I then did a reboot and now I have the situation mentioned above. The OS is loaded from a read-only USB live so neither the hardware or software has changed between the reboots.

                EDIT:
                I just realized that at the same time as I updated to BurnInTest build 1010 I also added the smartmontools package. That package contains two applications; smartd and smartctl. My guess is that during the startup of BurnInTest it tries to gather information about the devices and in some way it uses smartctl. I'm not sure if the cause of the issue is BurnInTest or smartctl but maybe you have a clue of what happens during application startup and can point me in a direction for possibly troubleshooting smartctl. The quick-fix would be for me to just remove smartmontools but that would cause other issues since we use those tools for diagnosing certain systems.
                Last edited by Linus; May-24-2024, 07:51 AM.

                Comment


                • #9
                  Does BITLinux display anything in the Temperature tab?

                  Can you try running this build in debug mode - made a change that may address the issue:
                  https://www.passmark.com/downloads/t...0240527.tar.gz
                  And send us the log.

                  Comment


                  • #10
                    The Temperature tab shows CPU and disk temperature but I didn't check that during the test so I can't give you any numbers I'm afraid but I don't think the disk is overheating.

                    I'm currently running the attached debug-build with the debug-flag, running CPU, RAM, 2D, 3D and Disk-test at 50% and the amount of open FDs used by BurnInTest hovers around ~19 and does not increase. I've done a couple of reboots as well since the issue is a bit intermittent and doesn't appear on every boot but so far it seems to stay around 18-20. I'll keep testing and see if it will start increasing.

                    Comment


                    • #11
                      So far it seems like you fixed the issue. BIT now stays around 16-20 open files and no longer seems to multiply. At the moment I just have access to one system for testing but as soon as I get some more systems I'll test them as well but I feel positive that the issue I resolved. I'll report back with a debug-log if the problem should show up when I test some other systems.

                      Comment

                      Working...
                      X