Announcement

Collapse
No announcement yet.

Network failures not detected.

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Network failures not detected.

    Hello, we have found that if we disconnect the cable 11 seconds into a 12 hour burnintest, the test will STILL pass. We are using the standard network test and have the "high bad packet ratio" checkbox checked. Ratio is set to .2% and timeout set to 2000ms. Since your program uses UDP packets, setting "every bad packet generates an error" is not possible. (I have yet to see a system pass a 12 hour network test with that box checked)

    In fact, I ran a 15 minute test, and after the first 15 seconds used device manager to disable the NIC. The test ran to completion, and had no failures.

    I know that there are problems with dual NICs as I have posted about that issue before, but this is a single NIC system! Passing a system that for 11 hours, 59 minutes, and 49 seconds of a 12 hour burnin that was disconnected from the network is pretty alarming, I should think.

    Here is the test log, thanks for your help.

    PassMark BurnInTest Log file - http://www.passmark.com
    ================================================== ======

    BurnInTest V5.1 Pro 1014
    Logging detail level: Result summary

    ******************
    SYSTEM INFORMATION
    ******************

    Machine type: 51-ASI06-001
    Machine serial #:
    Network Name: C80002466

    Date: 04/04/07 13:24:19
    Operating system: Windows XP Professional build 2600
    Number of CPUs: 1
    CPU manufacturer: GenuineIntel
    CPU type: Intel(R) Pentium(R) 4 CPU 3.00GHz
    CPU features: MMX SSE SSE2 SSE3 PAE
    CPU1 speed: 2992.6 MHz
    CPU L2 Cache: 1 MB (L3 Cache: 0 KB)
    RAM: 1007 MB
    Video card: Intel(R) 82865G Graphics Controller (Resolution: 1024x768x32)
    Disk drive: Model TOSHIBA MK4032GAX (Size: 37.3GB)


    **************
    RESULT SUMMARY
    **************
    Test Start time: Wed Apr 04 13:24:19 2007
    Test Stop time: Thu Apr 05 01:24:24 2007
    Test Duration: 012h 00m 05s

    Test Name Cycles Operations Result Errors Last Error
    CPU - Maths 9315 1.519 Trillion PASS 0 No errors
    CPU - SIMD 7505 2.140 Trillion PASS 0 No errors
    Memory (RAM) 144 213 Billion PASS 0 No errors
    Disk (C: ) 28 23.166 Billion PASS 0 No errors
    Network 1 0 3360 PASS 0 No errors
    Parallel Port 320 96.154 Million PASS 0 No errors
    Video Playback 8268 36934 PASS 0 No errors
    Serial Port 1 999 57.560 Million PASS 0 No errors
    TEST RUN PASSED

    Notes:
    Tester __________ Date __________

    -----------------------------------------------------------------------------------------------------

    ****************************
    SCRIPT SERIOUS ERROR SUMMARY
    ****************************
    SCRIPTED TESTS PASSED

    ================================================== ================================================== =
    Last edited by Comark Corp; Apr-05-2007, 03:34 PM.
    Jay W.
    Diagnostic Engineer
    Comark Corporation
    93 West St.
    Medfield, MA 02052
    http://www.comarkcorp.com

  • #2
    I think there might be an explaination based on your settings.

    Your cycles count & operations count of 1 / 3360 for the network test is very low. It indicates that less that 200 packets were sent during the entire test.

    The generation of an error based on a ratio of bad packets requires a certain number of packets (good or bad) to be sent.

    For example if you have 1 bad packet after sending 3 packets, we are not able to really know if this met your ratio of 0.2%, becuase we don't have enough samples as yet.

    To measure a accurate ratio in the 1% range, > 100 samples are needed.

    To measure a accurate ratio in the 0.2% range, > 500 samples are needed. And you didn't get to that level.

    So you need to look at why not many packets were sent. This might be becuase your duty cycle is low for the network test, or becuase the timeout is too high, or some other reason we are not aware of.

    Comment


    • #3
      Erm... The reason than not many packets were sent was due to the LAN cable disconnecting 12 seconds into the test, or disabling the NIC chip! I find it hard to believe that the test should pass this system with this type of total failure! The network part of the test is, for all intents and purposes, useless. I could pry the network transformer on the motherboard off with a screwdriver 10 seconds into a 12 hour test and it would STILL pass!!! To me this is a bug, and a major one at that. Please, isn't there some way this flaw can be fixed? We have purchased over 40 licenses so far, so switching to another diagnostic package would be a bit painful.
      Jay W.
      Diagnostic Engineer
      Comark Corporation
      93 West St.
      Medfield, MA 02052
      http://www.comarkcorp.com

      Comment


      • #4
        I don't see any evidence that it is a bug. I think it is just a consequence of your test settings and the test scenario.

        Removal of the cable should not prevent packets being sent. Nor should disabling the network connection in windows. I don't know how you 'disabled the NIC chip' so I can't comment on that.

        So the packet set count should continue to rise in these error cases, until you have enough data for a check to be made against your required ratio. An error is then generated. I just tested that here, and confirmed this behaviour. With a required ratio of 0.5% it took ~3min to generate an error, after 200 packets were sent with the cable disconnected. Which is the correct behaviour.

        There is also three different scenarios based on what network address you are using to test against.

        If you use the local internal loop back address, 127.0.0.1, then you will miss some error cases because nothing ever gets transmitted on the cable. We don't recommend this.

        If you can use an IP address on a local LAN segment, this is usually the best option. A cross over cable can be is good for this. Using an IP address of a machine on the Internet is not a good idea. It will be too unreliable.

        But you can also use a domain name instead of an IP address. This forces an extra step of a DNS look-up. (which will fail if the cable if disconnected, even if no packets have yet been sent).

        So you need to look at why not many packets were sent. This might be becuase your duty cycle is low for the network test, or becuase the timeout is too high, or some other reason we are not aware of.

        Comment


        • #5
          Just as backgorund to why it works this way. Imagine that you only wanted an error after a ratio of 0.1%. (1 packet in 1000 bad).

          But due to bad luck packet #2 fails, but the next 5000 packets are OK.

          So the overall ratio after packet #5000 is 1 in 5000. Which means no error.

          But the ratio after packet #2 is 50%. If we did as you suggest, we would generate an error for the machine at this point, but the machine doesn't really have an error. Overall it would be a pass, by the end of the test. So we think it is better not the report the error.

          An improvement in the behaviour might be to include an additional check, right at the end of the test. So that we generate an ratio error desipte not having enough data to really know the true ratio.

          Comment


          • #6
            Originally posted by passmark View Post
            I don't see any evidence that it is a bug. I think it is just a consequence of your test settings and the test scenario.

            Removal of the cable should not prevent packets being sent. Nor should disabling the network connection in windows. I don't know how you 'disabled the NIC chip' so I can't comment on that.
            Ok. I disabled the NIC using Windows device manager. (right-click, disable.)

            Originally posted by passmark View Post

            So the packet set count should continue to rise in these error cases, until you have enough data for a check to be made against your required ratio. An error is then generated. I just tested that here, and confirmed this behaviour. With a required ratio of 0.5% it took ~3min to generate an error, after 200 packets were sent with the cable disconnected. Which is the correct behaviour.
            I don't get the same results. For the sake of shorter test cycles, I specified 1 minute for the total runtime.

            1. I start the test.
            2. 10 seconds into the test, I disconnect the network cable.
            3. The packets sent increments slightly and stop at 30; the packets received stay at 27.
            4. At the end of the minute, The Error: column is at 106%.

            BurninTest reports the test as a pass.

            Does this help clarify what I'm saying?

            Originally posted by passmark View Post
            There is also three different scenarios based on what network address you are using to test against.

            If you use the local internal loop back address, 127.0.0.1, then you will miss some error cases because nothing ever gets transmitted on the cable. We don't recommend this.
            No. We are using a centrally located server and using a host name (this would require calling gethostbyname) for a remote server, not the internal loopback address.

            Originally posted by passmark View Post
            If you can use an IP address on a local LAN segment, this is usually the best option. A cross over cable can be is good for this. Using an IP address of a machine on the Internet is not a good idea. It will be too unreliable.
            We are not using crossover cables, we are using 16 and 24 port 10/100 switches and DHCP.

            Originally posted by passmark View Post
            But you can also use a domain name instead of an IP address. This forces an extra step of a DNS look-up. (which will fail if the cable if disconnected, even if no packets have yet been sent).
            See above.

            Originally posted by passmark View Post
            So you need to look at why not many packets were sent.
            I think this is due to the air gap between the NIC port and the cable.

            Originally posted by passmark View Post
            This might be becuase your duty cycle is low for the network test, or becuase the timeout is too high, or some other reason we are not aware of.
            The duty cycle is set to 50%.

            Thanks again.
            Jay W.
            Diagnostic Engineer
            Comark Corporation
            93 West St.
            Medfield, MA 02052
            http://www.comarkcorp.com

            Comment


            • #7
              Maybe you have a long DNS lookup timeout. So the timeout failure only happens after the end of your 1 min test.

              Can you repeat the test with a IP address instead of a domain name and a duty cycle of 100%

              And I reinterate, air gap or not, removal of the cable should not prevent the attempt to send packets to the NIC.

              Comment


              • #8
                Originally posted by passmark View Post
                Maybe you have a long DNS lookup timeout. So the timeout failure only happens after the end of your 1 min test.

                Can you repeat the test with a IP address instead of a domain name and a duty cycle of 100%

                And I reinterate, air gap or not, removal of the cable should not prevent the attempt to send packets to the NIC.

                OK, Using the same test conditions I mentioned above, I set the server's IP address instead of using DNS, and increased duty cycle to 100%. At the end of the test, 51350 packets were sent, 51,347 were received, and the error % was .132. The test passed.

                I ran the test again, but pulled the NIC cable after 5 seconds instead of 10. Sent packets stopped at 23,708 and received packets stopped at 23,705 the error % was .202. The test failed.

                SO, as in the test I posted about in an earlier post where I stated:
                " 1. I start the test.
                2. 10 seconds into the test, I disconnect the network cable.
                3. The packets sent increments slightly and stop at 30; the packets received stay at 27.
                4. At the end of the minute, The Error: column is at 106%."

                Since the error column is at 106% (discounting as how it's even numerically possible to have 106% of anything!) why didn't the test fail?

                Other questions and concerns:
                1. If we have 40 systems running simultaneously, would any server be able to keep up? (especially at Gb rates)
                2. I would worry about using 100% duty cycle in that other tests would not receive the attention they need, most notable disk testing - we like to see good coverage for that, as we've had many disk failures.
                3. You say above that "removal of the cable should not prevent the attempt to send packets to the NIC." If this is true, why is it that everytime I disconnect the cable, the packets increment only slightly (2 or 3) for a few seconds then stop altogether - long before all the testing is completed? Maybe that's what's going on here?

                Thanks again, I appreciate your patience in this.
                Jay W.
                Diagnostic Engineer
                Comark Corporation
                93 West St.
                Medfield, MA 02052
                http://www.comarkcorp.com

                Comment


                • #9
                  <bump>

                  So, any ideas? Please and thanks.
                  Jay W.
                  Diagnostic Engineer
                  Comark Corporation
                  93 West St.
                  Medfield, MA 02052
                  http://www.comarkcorp.com

                  Comment


                  • #10
                    I think the key problem is that your hardware (or device driver) behaves differently from our test systems. Our assumption, based on our testing, was that you could attempt to send data from an application even if there was a physical connection problem. So this assumption seems to be wrong, as it seems to be more hardware dependant.

                    So you could switch the settings to throw an error after any failure, but the real solution is for use to make a code change and add a new warning message at the end of a test. Something like, "Warning: Not enough packets sent to verify your requested error ratio". Or maybe a mechanism that requires a certain number of packets per minute to be sent.

                    In either case it might be a couple of weeks before we have a new version available.

                    Comment


                    • #11
                      This post was been answered via email. For the information of others, this problem was corrected in BurnInTest V5.3.1005.

                      Regards,
                      Ian (PassMark)

                      Comment

                      Working...
                      X