No announcement yet.

Testing CPU Power Features

  • Filter
  • Time
  • Show
Clear All
new posts

  • Testing CPU Power Features


    I am currently having an extremely difficult time to figure out an hardware issue on one of our servers.
    So far I was able to pin this down to the 'CPU Power Management' feature, according to the documentation this is responsible for changing the CPU frequency.

    The bad thing is, that the only reliably way to determine this faulty behavior is to perform a SSL connection test.
    All other stability tests I have run so far have shown no issue.
    We also had problems with Linux and Windows updates on the system because of this, but there is no good way to reproduce the issue that way.

    Any chance BurnInTest might be helpful on that? I should give the trial a go tomorrow

    These power features can really be a nasty thing, in the past I also had experienced a system which showed stability problems with turbo boost enabled, but that is no issue on this one.

  • #2
    You didn't actually detail what the problem is. So it is hard to comment on if our software (or any software) might help.

    I think it is pretty unlikely that a SSL connection test would expose a hardware problem that all other testing methods miss.


    • #3
      Oh, you are right about that. Somehow missed the most important information!

      The Linux updates and the openssl test show the same behaviour.
      In some cases padding errors are encountered which result in a bad signature, therefore preventing the update to be installed.
      Servers with the same hardware perform without trouble - no matther which CPU management settings was selected.
      Already tried another slot in the chassis (it's a blade server).
      Motherboard and memory have also been replaced already.

      The issue looks like this:
      rsa routines:RSA_padding_check_PKCS1_type_1:block type is not 01:../deps/openssl/openssl/crypto/rsa/rsa_pk1.c:100:
      rsa routines:RSA_EAY_PUBLIC_DECRYPTadding check failed:../deps/openssl/openssl/crypto/rsa/rsa_eay.c:721:
      SSL routines:SSL3_GET_KEY_EXCHANGE:bad signature:../deps/openssl/openssl/ssl/s3_clnt.c:1831:
      All information on this I can find is refering to configuration issues, but they all have this on every connection.
      In my case just a certain number of handshakes are affected.

      With Power Management Enabled the error is encountered in 1 of 15 connection attempts.
      (Is it just a coincidence, that this is about the same as the core count?)
      With Power Management Disabled there seems to be no issue at all, even when trying it several hundred of times.

      Never seen anything like that...
      If I get no better idea what is going on, my next step will be to test the 2 CPU's independently.

      At one point it was virtualized (currently native installation, RHEL5.9 and Server2012R2 for testing).
      When I moved the VM to a different server the issue was gone as well.
      Last edited by orioon; Nov-25-2014, 09:11 PM.


      • #4
        The ticket you linked to just seems to be a software issue. (i.e. a software bug / software configuration issue). There isn't any hint of it being hardware related as best I can tell.

        So I still think it is unlikely that hardware testing packages are going to detect / fix / diagnose your SSL key exchange issue.

        Any effect that you observe from changing the CPU power management setting might be either,
        A) Random (i.e. there is no difference in behaviour if you repeated the test 100s of times).
        B) Due to timing. With the software bug only being provoked when the CPU is very slightly slower / faster compared to other instances. The 1 in 15 error also speaks to timing issues.

        But in any case I think this is out of scope of the testing that BurnInTest is doing.


        • #5
          I verified today, that it was indeed a bad CPU.

          BurnInTest was sadly not helpful either, but I already expected that.
          No Tool whatsoever was able to detect this error.

          Removed CPU from the second socket and the problem persisted. Replaced CPU in Socket1 with the one which was in Socket2 before and all is working well since then.

          These power settings have caused me trouble in the past already.
          I hope there will be released a tool one day, which focuses on detecting issues when changing frequency and so on, this can be very troublesome.
          Currently you are pretty much on your own if you encounter such an issue.
          Last edited by orioon; Dec-12-2014, 04:56 PM.