Announcement

Collapse
No announcement yet.

strange performance with dual Opteron 6180SE

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • strange performance with dual Opteron 6180SE

    Hi Folks,

    i've been testing out different configurations for my home PC... its a SuperMicro H8QGi-F quad opteron motherboard populated with two Opteron 6180SEs - basically 24 K10 cores running at 2.5 GHz stock.

    My specific issue is with the cpu mark. Sometimes it scores low - in the 8800 range, while others it scores what might be expected... in the 11700 range or thereabouts.

    My other machine is a Tyan S2927E running dual Opteron 8439SE (6 cores each - 12 K10 cores at 2.8 GHz). This machine scores in the 8200 range.

    As far as i know i'm the only one that has submitted 6180SE results to the database... but here are the test #s...

    #271022, #266017, #290583

    in #271022 you can see that the cpu physics, prime numbers, and encryption results are drastically different. The hardware is pretty much the same... although at different times i have fooled around with different graphics cards, and hard drive controllers.

    Passmark is showing that a 24 core machine running 4 Istanbul chips, is only 10% faster than a 12 core machine running 2 Istanbul chips... and it seems to be due to these skewed results on certain tests.

    There has got to be a setting that is affecting this. i just do not know what it is... the machine has consistently run 11700 at times... i just do not know exactly why - perhaps its due to a BIOS setting? Any help would be great.

  • #2
    I see in a subsequent post you came across some analysis of similar Opteron behaviour that we found last year.

    We don't know the exact reason, but if we have anything new to add we do it in the existing post linked to above.

    Comment


    • #3
      Hi David,

      After much methodical testing... i think i have figured out the issue or at least have some progress and a bit better understanding.

      In the SuperMicro BIOS there are 4 settings for how memory is handled. Bank Interleaving. Node Interleaving, Channel Interleave, and another relevant one called Bank Swizzle.

      In order for the Passmark benchmarks to come out proper for these multi processor rigs- ie as one might reasonably expect scaling wise for Physics, Prime Number and encryption benchmark tests the following settings must be set as follows in the BIOS:

      Bank Interleaving needs to be Auto (ie enabled)
      Node Interleaving needs to be Auto (ie enabled)
      Channel Interleaving needs to be Auto (ie enabled)
      Bank Swizzle needs to be Disabled.

      When i do this... then the benchmarks come out proper... and as a result on my dual Opteron 6180SE... cpumark scores jump from 8900 to 11700.

      Now right before i did this... i installed the following Microsoft HotFix that deals with core parking:

      http://support.microsoft.com/kb/2534356

      link to hotfix (this is for server 2008 R2):

      http://hotfixv4.microsoft.com/Window...tl_x64_zip.exe

      i had previously installed the hot fix prior to going through my test matrix... prior to finding the fix with the same bad results - but when i switched node interleaving to auto (enabled) then the benches were proper. i do not know whether this hotfix had an effect or not... i just mention my steps for anyone that might need to duplicate this in the future.

      Node Interleaving enabled as i understand it gets rid of the NUMA functionality (the OS doesn't have an SRAT telling it which portions of memory are physically associated with each processor node).

      The downside to getting rid of NUMA... memory latency is basically doubled... from 67ns up to around 113 or so. There is a 34% penalty on the memory mark because of this vs with node interleaving disabled (NUMA on).

      At any rate... i'll keep on testing... but it would appear that having the OS NUMA aware is completely screwing up those 3 benchmarks for multiprocessor Opteron rigs.

      Perhaps the prep code for the benchmark allocates it on a core that happens to be in a specific NUMA node... and the threads that work on that data are executed on cores that lie within a different NUMA node... or somesuch. i know that on Stream... for me to get good results... 45 GB/s... i have to set thread affinity... otherwise bandwidth drops down to 5GB/s.

      Hope this helps someone.
      Last edited by rvborgh; Sep-20-2014, 06:37 AM.

      Comment

      Working...
      X