Announcement

Collapse
No announcement yet.

CPU Test Scaling with multiple CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • CPU Test Scaling with multiple CPUs

    We had a query recently about why 4x Opteron 6272 processors (32 cores total) were not scoring much better than 2x Opteron 6272 processors (16 cores total).

    While most of the tests scored 50 to 100% better, 3 tests in particular, Physics, Prime Numbers & Single Threaded, stood were holding the 4x system back from achieving a higher score.

    The single threaded test shouldn't improve with more cores, but in this case the score actually seemed to get worse. We speculate that there may be some inefficiency in the task scheduling of systems with a high number of cores. The prime number and physics tests were a bit stranger, showing no improvement at all when they should scale nearly linearly with more cores.

    We did some queries on our database to see if this was a common problem. Systems with multiple CPUs are somewhat of a minority to begin with, however multi-cpu systems with such high core counts are rarer again and we didn't actually have much data available to come to any solid conclusion. In fact 4x Opteron 6272 was the only 32-core system we have in the DB.

    Cores Prime Numbers Physics Single Threaded CPU Rating
    1x Opteron 6234 6 28.21879768 523.6332 827.8508301 6971.315
    2x Opteron 6234 12 26.18594551 508.7367 1034.158569 10504.77
    2x Opteron 6272 16 20.93978214 453.0519 937.5236816 10202.96
    4x Opteron 6272 32 21.23987579 494.5319 765.5333252 11028.8
    1x Xeon E5-2640 6 36.29001617 638.4076 1540.534302 9844.799
    2x Xeon E5-2640 12 49.94214797 816.2755 1523.405045 14692.37
    1x Xeon E5-2650 8 40.21228447 697.6804 1265.354785 9857.421
    2x Xeon E5-2650 16 53.15676792 856.5676 1355.614652 14301.72
    The above is a sample of what we pulled from the DB, scores are averages across all data for the CPU types.

    As can been seen both the AMD examples seemed to exhibit the same problem with the physics and prime number tests, although in the 6234 didn't seem to exhibit the same single threaded issue the 6272 did. This may mean the single threaded issue is something to do with 4x CPU configurations. These were the only AMD CPUs were we had data for multiple CPU count configurations, and even in these cases we only had a small number of results.

    For Intel we had quite a bit more data to choose from, however the two CPU types above are fairly representative of the rest. In Intel's case the physics and prime number tests showed a more reasonable improvement from increasing the number of CPUs, it's still not a doubling as might be expected although it's possible that there is a bottle neck elsewhere. The physics case in particular is heavy on memory use and may be saturating the memory bandwidth. Intel CPUs also didn't show any issue with single threaded performance, although we don't have another 4x configuration for comparison.

    As for the overall CPU mark, although most of the other tests scored nearly double in all these cases, the algorithm for calculating the overall score is designed to punish low scores in individual tests and prevent a single high scoring test from giving the CPU a really high overall score. A CPU must perform well across the board in order to increase its score.
    Last edited by Michael (Passmark); Mar-06-2013, 05:47 AM.

  • #2
    okay i did a search and ran across this.

    i've got exactly this issue with the cpu marks running on my SuperMicro H8QGi-F quad opteron motherboard populated with two Opteron 6180SE (12 cores each so 24 cores total).

    i've run the test multiple times... while experimenting with different hardware. i have run across configurations where those benchmark tests that you specified were what might be expected... other times not. If you do a search on Opteron 6180SE in the DB those are my submissions... you can see some are in the 11700 range, others 8800. Same two processors and motherboard.

    It has to be a BIOS setting of some sort. i'll do some testing and post back when i find the answer to this.

    Comment


    • #3
      Yes, if you have been adjusting BIOS setting and swapping video cards & drive controllers, then you probably need to work through the permutations methodically to work out what causes the difference in performance.

      Comment


      • #4
        hi Dave,

        The strange thing is that for one test (i think it is prime #s) the performance score scales upwards linearly until i hit 12 threads (which happens to be the # of cores in each Opteron). After that there is virtually no gain. Whether i set things (via preferences) to 24, 64 or 256... it makes no difference... virtually no gain.

        Its almost like a full CPU goes missing with that test when you go past 12 cores.

        i look at the task manager performance tab and it does load up all 24 cores though.

        i wish i had the code to experiment with why its doing this.

        i did just add another 8GB to the system... didn't make any substantive difference.

        i wonder if it might have something to do with the NUMA memory subsystem on these Opterons... however i have tried both node interleaving off and on... still the same.

        PS: this system does run wPrime32 in just under 4 seconds flat... while loading up all the cores...

        Originally posted by David (PassMark) View Post
        Yes, if you have been adjusting BIOS setting and swapping video cards & drive controllers, then you probably need to work through the permutations methodically to work out what causes the difference in performance.
        Last edited by rvborgh; Sep-19-2014, 06:57 PM.

        Comment


        • #5
          If you have 24 cores in the machine, then one would normally expect a performance gain going from 12 to 24 processes. But adding on more processes (64, 256) should not further increase the performance as you can't load up the machine beyond 100%.

          There are some possible reasons for you seeing no gain going from 12 to 24 processes. It might be the CPUs getting hot and throttling, it might be limited by memory bandwidth, or it might be our algorithm not scaling in a linear fashion beyond 12 cores. (Finding prime numbers isn't something that is intrinsically linear due to the fact that each new prime number is spaced further and further apart as the numbers get bigger).

          I see if I can find some similar numbers for the Intel CPUs.

          Comment


          • #6
            Hi Dave,

            If it helps any... i'm running Noctua UD9 coolers... the processors never get much over 50C during these tests...

            It would definitely be interesting to see how the Xeon's fare... i have a friend with a similiar SuperMicro system, but with 64 core Xeon E5 4650s that i could ask to run the tests...

            Comment


            • #7
              Here are some results from the Xeon E5-2697 v2, 12 cores (+ hyperthreading). Single CPU and Dual CPUs.


              Cores Prime Numbers Physics Single Threaded CPU Rating
              1x Xeon E5-2697 v2 12 (24) 74.2 1,390 1,666.5 17,510
              2x Xeon E5-2697 v2 24 (4 96.9 1,659 1,697.5 24,066

              So going from 24 to 48 processes can increase the performance. For tests like the Integer maths test and floating point tests scaling is almost linear. But for prime numbers and physics and the incremental gains are much smaller.

              Note that the new Intel chips typically have much higher memory bandwidths than the older AMD chips.

              I think this is OK, as this is more or less what happens in real life. As you add cores you eventually you hit a bottle neck somewhere that results in less than linear performance increases or no increase.

              Comment


              • #8
                i can see what you are saying... on the other hand... i am seeing no gains...

                Now the interesting thing is... that about a month back i submitted a run, and the results were was what might be expected. The system scored 11698 i think... not 9100 as it does now due to those specific results being lower. So there is something strange going on.

                Is the benchmark written in a NUMA aware way?

                An interesting article on NUMA effects.

                http://www.cs.uchicago.edu/files/tr_...TR-2011-02.pdf

                Comment


                • #9
                  As far a bandwidth goes... these 61xx Magny Cours systems run Stream at over 50 GB/s...

                  https://www.youtube.com/watch?v=x9gE5jARsGw

                  Comment


                  • #10
                    i posted the fix for this here:

                    http://www.passmark.com/forum/showth...Opteron-6180SE

                    Comment

                    Working...
                    X