Announcement

**David (PassMark)** · May-23-2023, 12:34 AM

There is no benchmark that will exactly match your use case. So the best benchmark is your normal use case (i.e. you run 100 VMs on each system and measure the performance). Of course this is not an easy task when you don't have the hardware and a lot of time.

A few points that you didn't already cover,
- EPYC CPUs are more likely to be using slower registered (buffered) ECC RAM.
- Our CPUMark figure does contain a small single threaded component. By design. Very few real life applications thread perfectly.
- Ryzen 9 7950X is way quicker in low thread count tasks. Turbo is up to 5.7Ghz.
- The EPYC CPUs might not be hitting there all core turbo speeds. Each CPU might be using 300+ watts in this condition (600W+ for dual CPUs). Hard to dump this much heat.
- Adding CPU cores doesn't add memory bandwidth (and might not add cache either, depending on model). Generally 3 or 4 CPU cores is enough to max out the memory bandwidth. So depending on the application it is all downhill after that (or at least diminishing returns).
- Windows has a thing call "process groups" for high core count systems. Nearly no Windows software takes advantage of processor groups. Same for NUMA.

In short, these super high core count systems only make sense for very specific software. In many cases you would better off with multiple systems, each with a lower core count.

**David (PassMark)** · May-23-2023, 12:40 AM

Some links for others who find this post, here is additional info.

NUMA performance hit (between 20% and 60% performance lost with NUMA)
https://forums.passmark.com/performa...d-threadripper

The Apparent Uselessness of a Second AMD EPYC 7742
https://forums.passmark.com/pc-hardw...-amd-epyc-7742

**David (PassMark)** · May-23-2023, 12:43 AM

And I'll add this graph as it nicely illustrates the memory bandwidth and cache congestion / thrashing problem when a task needs to use a lot of RAM from many threads. This scaling graph is from the Advanced CPU Test in PerformanceTest.

**bjquinn** · May-23-2023, 11:43 PM

Originally posted by David (PassMark) View Post

There is no benchmark that will exactly match your use case. So the best benchmark is your normal use case (i.e. you run 100 VMs on each system and measure the performance). Of course this is not an easy task when you don't have the hardware and a lot of time.

A few points that you didn't already cover,
- EPYC CPUs are more likely to be using slower registered (buffered) ECC RAM.
- Our CPUMark figure does contain a small single threaded component. By design. Very few real life applications thread perfectly.
- Ryzen 9 7950X is way quicker in low thread count tasks. Turbo is up to 5.7Ghz.
- The EPYC CPUs might not be hitting there all core turbo speeds. Each CPU might be using 300+ watts in this condition (600W+ for dual CPUs). Hard to dump this much heat.
- Adding CPU cores doesn't add memory bandwidth (and might not add cache either, depending on model). Generally 3 or 4 CPU cores is enough to max out the memory bandwidth. So depending on the application it is all downhill after that (or at least diminishing returns).
- Windows has a thing call "process groups" for high core count systems. Nearly no Windows software takes advantage of processor groups. Same for NUMA.

In short, these super high core count systems only make sense for very specific software. In many cases you would better off with multiple systems, each with a lower core count.

Thank you, this is very helpful.

For one thing, I had not considered the typically slower RAM speeds with Epyc, that's an excellent point. As for RAM bandwidth, that is also a good point, but these Epycs are 12 channel, after all, compared to the Ryzen's 2 channel RAM. 6x the cores, 6x the memory channels.

Interestingly, your chart on primes/sec peaks at 32 threads, which is exactly what the Ryzen 9 7950X has. Probably not a surprise then that the 7950X seems to be at a sweet spot in PassMark scores.

Whether to go high core count, or lower core count with multiple systems, is the exact tradeoff I'm trying to decide on. I had assumed that PassMark would actually scale right up with cores, as it's a synthetic, multithreaded benchmark. I'm sure some benchmarks on these 96 core Epycs do in fact scale up the way I am expecting. But maybe my workload won't.

I had previously assumed that at least most of the loss of performance in a multicore system is due to Amdahl's law and the fact that a single application either may not be optimized past a certain number of threads, or worse, it simply has a substantial part of its logic which simply cannot be run in parallel. In a case where the system already runs many copies of itself in multiple VMs, I was expecting that to scale more or less with cores x frequency x IPC. Again, I never expected this to be purely one-for-one, but my naive expectations are nearly *four times* that of reality, which is a way larger gap than I was prepared for, and which would make the world's of a difference in deciding whether to go high or low core count.

Since my use case is a ton of VMs (i.e. one VM per core), is that the type of situation where you might expect that I would see good scaling with these high core count systems? My VMs are Linux, and my virtualization platform is Linux. For example, if I ran PassMark in a bunch of VMs and then added together their multicore scores, would I expect a better total score? Or is the type of load PassMark is presenting the hardware running into limits that wouldn't be resolved by running a bunch of copies of PassMark rather than one big run of PassMark?

Thanks again for your explanations.

-BJ

**David (PassMark)** · May-24-2023, 01:29 AM

Different tasks (algorithms) scale very differently. Most algorithms make use of disk, RAM, networking or locking semaphores (e.g. databases). All of which quickly become bottlenecks for CPUs with a large core count. You might be adding 300 CPU cores, but are you also adding 300 PCIe lanes with 300 SSDs connected?

Here is another example of algorithm that uses RAM (i.e. can't be entirely kept in the CPU's cache).

Scaling from 1 thread to 2 is perfect (98%). Scaling from 2 to 3 is pretty good (41%). Scaling above 3 doesn't add too much more throughput. Eventually once you get into the hyperthreaded virtual cores performance actually goes backwards.

Table above was done on Ryzen 5 5600X with 32GB DDR4 3600MHz RAM, dual channel (16-19-19-39 timings). The Physics test uses a lot of RAM and the memory controller and RAM module itself is maxed out and quickly becomes a bottleneck. The scaling of the integer test looks a lot better however.

Yes, you could run a bunch of VMs with an instance of PerformanceTest in each one. I don't see how this will avoid memory bandwidth limits however as it is a hardware bottleneck.

Again, these super high core count systems only make sense for very specific software.

Announcement

scaling with high core counts

scaling with high core counts

Comment

Comment

Comment

Comment

Comment