Announcement

Collapse
No announcement yet.

BIT causes system reboot

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BIT causes system reboot

    Hi,

    I have 20 over 30 system pass BIT.
    BIt ver 5.3, test OS 2003R2
    motherboard supermicro X7SBI
    http://www.supermicro.com/products/motherboard/Xeon3000/3210/X7SBi.cfm
    cpu core 2 Quad.

    among other tests,
    RAM test set at the last mode ( Address...), 100% cycle. CPU load about 37-45%
    After under one hour, the system reboots not likely because of application crash but like a loss of power since setting crashoption in registry won't work.
    I set up a trace log level 2 but cannot find anything useful at the last log.
    RAM test with memtest and Prime95 found no failure.

    I changed 2 motherboards and run BIT again, one passed and the other failed.
    On the four failed systems, I set memory at level 4 torture (10 % x 9), the CPU utilizes at 100% but it runs for 5 hours without rebooting.
    What is a possible cause?
    If a new BIT setting run pass after 12 hours, what conclusion should I learn from setting RAM at "address extensions" (failed) and "torture test" (passed)?

    Overall, what should I do? replace motherboard of all failed system and retest?

    We used to run the same BIT test on a hundred systems of the same configuration without any issue.

    Thanks,
    T

  • #2
    First thing you should do if upgrade to V6 of BurnInTest.
    The version you are using is probably around 3 years old.

    If you can reproduce the problem in V6 then I would start trying to debug it. With sudden machine restarts there is often no quick way to find the cause, except swapping out componets one at a time.

    If you have 10 bad machines, then I would think it is a design flaw in a device driver, or hardware and you should get the manufacturer invovled once you narrow the problem down (after checking it isn't a power supply faults, overheating, etc..).

    Do you get a blue screen of death? If yes, what are the details?

    Comment


    • #3
      Originally posted by passmark View Post
      First thing you should do if upgrade to V6 of BurnInTest.
      The version you are using is probably around 3 years old.

      If you can reproduce the problem in V6 then I would start trying to debug it. With sudden machine restarts there is often no quick way to find the cause, except swapping out componets one at a time.
      I changed to BIT version 6.1028 and keep the failed configuration to see if any reboot occurs.

      Originally posted by passmark View Post
      If you have 10 bad machines, then I would think it is a design flaw in a device driver, or hardware and you should get the manufacturer invovled once you narrow the problem down (after checking it isn't a power supply faults, overheating, etc..).
      Thanks.

      Originally posted by passmark View Post
      Do you get a blue screen of death? If yes, what are the details?
      No, not even a windows log

      Why 10 systems failed BIT test at RAM test mode 3 but mode 2?

      Thanks,

      Comment


      • #4
        latest BIT : no problem

        I changed to BIT version 6.1028 and keep the failed configuration .
        It works.

        why?

        Thanks

        Comment


        • #5
          Glad to hear the problem was fixed or at least went away.
          I am not aware of any issue that was in the old software that would have caused a sudden reboot. But we don't really have enough details to comment with any authority. (Don't know how much RAM you had, if you are using 32bit / 64bit, what was the likelihood that is was just a co-incidence like power spikes not being present today, etc..)

          Comment


          • #6
            issue again on different order

            I got 2 failed systems out of 10.
            BIT version 5.3 passed 6
            BIT version 6.0.1028 passed 2 out of 4.

            The 2 failed systems were rebooted during BIT test ( CPU 5%, RAM 100% with last option, NIC 1% ping to a real IP, HD 100%)
            Motherboard X7BSi BIOS 1.3a (latest version), 8GB memory.
            Tested OS is 2003R2 x86.

            How can I catch the cause in BIT 6.0?

            Thanks,
            T
            Last edited by thanh; Feb-22-2011, 03:49 PM.

            Comment


            • #7
              I assume you are saying that you started with 10 machines.
              4 then failed with BurnInTest V5.3.
              Then then took that 4 and tested them with BurnInTest V6.0 and 2 of those 4 failed for a 2nd time. Is this correct?

              You refer to "last option" but I don't really know what the last option is.

              How long did the failure take to occur?

              Did you collect any logs?

              Did you attempt to narrow down the problem by running for example, just the HDD test, or just the RAM test?

              Did you check the Windows event logs for any errors around the time of the reboot?

              Are you sure the problem is not purely random. For example do the same machines always fail, or might it be an external event causing the problem. e.g random power spikes, EMI, etc..

              Comment


              • #8
                Originally posted by passmark View Post
                I assume you are saying that you started with 10 machines.
                4 then failed with BurnInTest V5.3.
                Then then took that 4 and tested them with BurnInTest V6.0 and 2 of those 4 failed for a 2nd time. Is this correct?
                yes.
                You refer to "last option" but I don't really know what the last option is.
                RAM test set to "Address Windowing Extensions"
                How long did the failure take to occur?
                around 2-3 hours
                Did you collect any logs?
                I re-run with setting to trace log level 2 and try to collect the burnIn log, trace log and dump file if any but I haven't had a chance to collect them since the failed systems sat on a RMA rack when I came to work today.

                Did you attempt to narrow down the problem by running for example, just the HDD test, or just the RAM test?
                yes, I test RAM only and it is OK.

                Did you check the Windows event logs for any errors around the time of the reboot?
                yes and found nothing.

                Are you sure the problem is not purely random. For example do the same machines always fail, or might it be an external event causing the problem. e.g random power spikes, EMI, etc..
                It is not purely random.
                In the same test condition, only 2 out of 10 failed and it will fail again if retest with the same setting.

                changing motherboard fixes the issue.

                The motherboards will be RMAed and I send the manufacture BIT link, so they may want to try BIT to find out what went wrong.


                Thank you so much,
                T
                Last edited by thanh; Feb-23-2011, 07:29 PM.

                Comment


                • #9
                  If you,
                  1) repeated the tests, and the same couple of machines failed all the time
                  and
                  2) the other machines in the batch never failed
                  and
                  3) the machines were all identical
                  and
                  4) you switched the motherboard with an identical board and the failure disappeared.

                  Then I think it safe to say the motherboard was at fault. Although it might be a complex interaction between some component (e.g. the RAM) which is just on the edge of being out of specification, or a specific incompatibility with some hardware. e.g. there is a small chance the MB might be OK if used with different components.

                  Comment


                  • #10
                    Originally posted by passmark View Post
                    If you,
                    1) repeated the tests, and the same couple of machines failed all the time
                    and
                    2) the other machines in the batch never failed
                    and
                    3) the machines were all identical
                    and
                    4) you switched the motherboard with an identical board and the failure disappeared.

                    Then I think it safe to say the motherboard was at fault. Although it might be a complex interaction between some component (e.g. the RAM) which is just on the edge of being out of specification, or a specific incompatibility with some hardware. e.g. there is a small chance the MB might be OK if used with different components.
                    When the system reboots, there is still a temporary folder in C drive which BIT created.
                    I will follow your advice.
                    Many thanks,
                    T

                    Comment

                    Working...
                    X