Announcement

Collapse
No announcement yet.

scaling with high core counts

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • scaling with high core counts

    Hello, I have a question about how the PassMark benchmark works with some of these very high core count AMD processors.

    First of all, I want to preface by saying that I've searched the forums and read some of the older threads on this issue, which were helpful, but I still have some further clarification questions. I also see how often you guys get berated by fanboys and I have no interest in doing that here. My question is really AMD vs AMD anyway, so we can leave Intel out of the discussion. This question is also not meant to criticize or complain. I'm simply looking for some information to clear up my understanding about how some of this works.

    As many others do, I use the PassMark benchmark to at least get a rough starting point of where to go with a particular system build. I know it will not be a 100% match to my specific use case, but unless I can take advantage of things like the new Xeon accelerators, etc., I should be able to get in the ballpark so I know which hardware to start with, or narrow it down to a smaller number of options, since I can't buy them all.

    In the process of analyzing hardware options for several new servers, I got into looking at the difference between the dual EPYC 9654​, the single EPYC 9654, and the Ryzen 9 7950X. I might not go with the absolute top end EPYC, but this comparison is useful to express my question. And yes, I'm comparing EPYC to Ryzen, as there's actually a SuperMicro server chassis coming out soon that uses blades with regular Ryzen chips, which seemed interesting.

    I know that some of the EPYCs have a very low number of samples reported, and that adding cores has diminishing returns, and other bottlenecks can arise due to inter-socket communication, NUMA nodes, memory latency/throughput, etc. I'm also aware that in the case of EPYC vs Ryzen, the clock speed is not the same, and the architecture is very similar, both should be Zen 4, but obviously not identical. However, despite that, I'm still surprised by the respective benchmarks between these chips. Again -- I'm intrigued by this and asking for clarification, not criticizing or claiming that the PassMark result is wrong.

    My use case is lots of identical little VMs or LXC containers on a Proxmox host. This use case should therefore scale very nicely with cores, and with the older hardware I have already, it does in fact scale nicely. I would assume the PassMark benchmark scales nicely too, as from what I read, the benchmark spins up a bunch of individual threads that don't communicate with each other. Though I did see one comment from a couple years ago that PassMark scaled well to 64 cores, which is obviously a problem for a dual EPYC 9654. But that could be an outdated limitation.

    Dual EPYC 9654 (192 cores, 384 threads, 3.55GHz all core turbo) -- 149,727
    Single EPYC 9654 (96 cores, 192 threads, 3.55GHz all core turbo) -- 126,045
    Ryzen 9 7950X (16 cores, 32 threads, 5.1GHz all core turbo) -- 63,625

    So what I see is that the 16 core Ryzen gets to 42% of the performance a dual EPYC 9654 with 192 cores. And the dual EPYC only gets 19% better performance than the single EPYC. From what I can fathom, this is either true due to the basic nature of scaling across cores and the architecture of the chips themselves, or it's not true and is an artifact of the way PassMark itself scales across cores. But I thought PassMark scaled well across cores, which is why I'm confused. That, or I'm just interpreting these scores wrong and I need help understanding what I need to be doing differently.

    To break down the Ryzen comparison better, it should have roughly the same IPC as the Epyc, as they're both Zen 4 based. Maybe not, as I'm sure cache sizes are different, etc. Anyway, I'm going to go ahead and assume it's the same ballpark. EPYC has 30% lower all core turbo speeds than the Ryzen, so let's reduce the Ryzen result by that amount. That puts the modified Ryzen score at about 44,500. Multiply by 6x (96c/16c=6) = 267,000. Multiply by 2 again for the dual EPYC and we get 534,000. I get that this is oversimplified, but it's such an immense difference (I expected 3.6x what I actually got), that this is very relevant for helping me decide which direction to go. Is core scaling really that bad? If so, then two Ryzen blades gives me nearly as much as a dual EPYC 9654, and that's an awesome way to go! But I don't want to trick myself into that. I've seen other benchmarks where the performance at least seemed to scale pretty linearly with cores on those EPYCs, so I'm just trying to figure out what I might be missing. How should I be interpreting these results?

  • #2
    There is no benchmark that will exactly match your use case. So the best benchmark is your normal use case (i.e. you run 100 VMs on each system and measure the performance). Of course this is not an easy task when you don't have the hardware and a lot of time.

    A few points that you didn't already cover,
    - EPYC CPUs are more likely to be using slower registered (buffered) ECC RAM.
    - Our CPUMark figure does contain a small single threaded component. By design. Very few real life applications thread perfectly.
    - Ryzen 9 7950X is way quicker in low thread count tasks. Turbo is up to 5.7Ghz.
    - The EPYC CPUs might not be hitting there all core turbo speeds. Each CPU might be using 300+ watts in this condition (600W+ for dual CPUs). Hard to dump this much heat.
    - Adding CPU cores doesn't add memory bandwidth (and might not add cache either, depending on model). Generally 3 or 4 CPU cores is enough to max out the memory bandwidth. So depending on the application it is all downhill after that (or at least diminishing returns).
    - Windows has a thing call "process groups" for high core count systems. Nearly no Windows software takes advantage of processor groups.​ Same for NUMA.

    In short, these super high core count systems only make sense for very specific software. In many cases you would better off with multiple systems, each with a lower core count.

    Comment


    • #3
      Some links for others who find this post, here is additional info.

      NUMA performance hit (between 20% and 60% performance lost with NUMA)
      https://forums.passmark.com/performa...d-threadripper

      The Apparent Uselessness of a Second AMD EPYC 7742
      https://forums.passmark.com/pc-hardw...-amd-epyc-7742








      Comment


      • #4
        And I'll add this graph as it nicely illustrates the memory bandwidth and cache congestion / thrashing problem when a task needs to use a lot of RAM from many threads. This scaling graph is from the Advanced CPU Test in PerformanceTest.

        Click image for larger version  Name:	fetch?id=46663&d=1606088353.png Views:	12 Size:	31.3 KB ID:	55147

        Comment


        • #5
          Originally posted by David (PassMark) View Post
          There is no benchmark that will exactly match your use case. So the best benchmark is your normal use case (i.e. you run 100 VMs on each system and measure the performance). Of course this is not an easy task when you don't have the hardware and a lot of time.

          A few points that you didn't already cover,
          - EPYC CPUs are more likely to be using slower registered (buffered) ECC RAM.
          - Our CPUMark figure does contain a small single threaded component. By design. Very few real life applications thread perfectly.
          - Ryzen 9 7950X is way quicker in low thread count tasks. Turbo is up to 5.7Ghz.
          - The EPYC CPUs might not be hitting there all core turbo speeds. Each CPU might be using 300+ watts in this condition (600W+ for dual CPUs). Hard to dump this much heat.
          - Adding CPU cores doesn't add memory bandwidth (and might not add cache either, depending on model). Generally 3 or 4 CPU cores is enough to max out the memory bandwidth. So depending on the application it is all downhill after that (or at least diminishing returns).
          - Windows has a thing call "process groups" for high core count systems. Nearly no Windows software takes advantage of processor groups.​ Same for NUMA.

          In short, these super high core count systems only make sense for very specific software. In many cases you would better off with multiple systems, each with a lower core count.
          Thank you, this is very helpful.

          For one thing, I had not considered the typically slower RAM speeds with Epyc, that's an excellent point. As for RAM bandwidth, that is also a good point, but these Epycs are 12 channel, after all, compared to the Ryzen's 2 channel RAM. 6x the cores, 6x the memory channels.

          Interestingly, your chart on primes/sec peaks at 32 threads, which is exactly what the Ryzen 9 7950X has. Probably not a surprise then that the 7950X seems to be at a sweet spot in PassMark scores.

          Whether to go high core count, or lower core count with multiple systems, is the exact tradeoff I'm trying to decide on. I had assumed that PassMark would actually scale right up with cores, as it's a synthetic, multithreaded benchmark. I'm sure some benchmarks on these 96 core Epycs do in fact scale up the way I am expecting. But maybe my workload won't.

          I had previously assumed that at least most of the loss of performance in a multicore system is due to Amdahl's law and the fact that a single application either may not be optimized past a certain number of threads, or worse, it simply has a substantial part of its logic which simply cannot be run in parallel. In a case where the system already runs many copies of itself in multiple VMs, I was expecting that to scale more or less with cores x frequency x IPC. Again, I never expected this to be purely one-for-one, but my naive expectations are nearly *four times* that of reality, which is a way larger gap than I was prepared for, and which would make the world's of a difference in deciding whether to go high or low core count.

          Since my use case is a ton of VMs (i.e. one VM per core), is that the type of situation where you might expect that I would see good scaling with these high core count systems? My VMs are Linux, and my virtualization platform is Linux. For example, if I ran PassMark in a bunch of VMs and then added together their multicore scores, would I expect a better total score? Or is the type of load PassMark is presenting the hardware running into limits that wouldn't be resolved by running a bunch of copies of PassMark rather than one big run of PassMark?

          Thanks again for your explanations.

          -BJ

          Comment


          • #6

            Different tasks (algorithms) scale very differently. Most algorithms make use of disk, RAM, networking or locking semaphores (e.g. databases). All of which quickly become bottlenecks for CPUs with a large core count. You might be adding 300 CPU cores, but are you also adding 300 PCIe lanes with 300 SSDs connected?

            Here is another example of algorithm that uses RAM (i.e. can't be entirely kept in the CPU's cache).

            Click image for larger version

Name:	image.png
Views:	197
Size:	21.5 KB
ID:	55163

            Scaling from 1 thread to 2 is perfect (98%). Scaling from 2 to 3 is pretty good (41%). Scaling above 3 doesn't add too much more throughput. Eventually once you get into the hyperthreaded virtual cores performance actually goes backwards.

            Table above was done on Ryzen 5 5600X with 32GB DDR4 3600MHz RAM, dual channel (16-19-19-39 timings). The Physics test uses a lot of RAM and the memory controller and RAM module itself is maxed out and quickly becomes a bottleneck. The scaling of the integer test looks a lot better however.

            Yes, you could run a bunch of VMs with an instance of PerformanceTest in each one. I don't see how this will avoid memory bandwidth limits however as it is a hardware bottleneck.

            Again, these super high core count systems only make sense for very specific software.

            Comment

            Working...
            X