Announcement

Collapse
No announcement yet.

The Apparent Uselessness of a Second AMD EPYC 7742

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • The Apparent Uselessness of a Second AMD EPYC 7742

    I am trying to wrap my head around how Multi-CPU systems score on CPUmark, and under what conditions CPUmark is a useful metric for the kinds of computations I do. As has been discussed before (https://www.passmark.com/forum/pc-ha...ame-cpu-scores), the speedup from having two CPUs is often in the neighborhood of ~1.3 times the mark for a single CPU. We would of course expect something less than 2X, but the scores are still way below what I would intuitively expect. Nowhere is this more evident than the EPYC 7742, which has a single CPU score of 48062 and a dual score of 55574, a measly 15% increase from adding a whole 64 core flagship CPU.

    It feels like adding cores to a single processor scales much more like linear. Take for example the Xeon Gold 6140 and 5218. Adding cores here looks much closer to linear (23964/1 / (20622/16) ~=1.

    So I guess my question is: Where is the bottleneck and how does it relate to adding cores/cpus? Reading the methodology, it appears that tests are run in fully independent processes that don't share any information. No network of hard drives are hit by the benchmarks. The only shared resource would be RAM, correct? Intuitively, it seems like multi-cpus should scale even better than adding cores due to additional cache and heat dissipation.

  • #2
    You need very special software to make use of 128 or 256 cores. There are lots of issues. Too many to explain them all here.

    Some of the main ones are,

    - Writing NUMA aware code. Nearly no Windows software is NUMA aware. This is a really big issue when there are multiple memory buses.

    - Writing code that can use process groups. Nearly no Windows software takes advantage of processor groups.

    - RAM (or some other resource) being a bottle neck. For our CPU test RAM can become a bottleneck. But in real life software it might be disk IO, network IO, database row locking, GPU performance, configuration limitations (e.g. max number of network connections for a web server), or semaphores in the operating system.

    - It is extremely hard to write all your code in such as way that it can use multiple cores all the time. Typically there are large sections of single threaded code. Or there are a fixed number of threads each with their assigned tasks. (e.g. for a game, there might be a sound thread, AI thread, pathfinding thread, physics thread & main thread. So the game will never use more than 5 cores).

    - Some versions of Windows have max core & RAM limits. e.g. Win10 home is max 64 cores. Even Win10 Pro is only 128 cores. Which isn't enough for dual 7742s.

    Pretty much all of the above was covered in my 2015 reply, and nothing has changed today, except that more people are going to hit these limits than before.

    Bottom line: If you have one of these very very special applications that is NUMA & process group aware, is fully threaded and has no dependences on any other resources, then your scaling might be better than our numbers suggest (but it could be far worse as well). Only way to know for sure is to test your specific application.

    Comment


    • #3
      Thank you for the response David. So would you say that RAM is the CPUmark bottleneck in the 7742 case, or is it the OS core limit? As you say, real world applications have a variety of potential bottlenecks, including the way that the application is written. I want to understand CPUmark and what it represents. The tests in the description (https://www.cpubenchmark.net/cpu_test_info.html) are embarrassingly parallel with generally small memory footprints. Except for the single thread test, I would expect these to scale extremely well with the addition of cores/cpus. This intuition is borne out when looking at how the scores change as the core counts increase, but not when the CPU count is increased. Even older cpus (e.g. 2690 v2) show only a 30-40% increase with dual cpus. In your 2015 reply you spoke of general areas that might be explanations, and I was wondering if you could speak to specifically where the bottleneck is for passmark that is keeping the cpu idling in both older (2690v2) and newer (7742) dual systems. This would help me understand whether the bottleneck is relevant for my use-case or not.

      When you kick these off in parallel, is it done via separate threads or separate processes? Would that even make a difference? Is Passmark NUMA aware or not?

      Just to be clear, I am not criticizing the CPUmark metric. No single number is going to sum up a system's performance characteristics. I am just trying to understand what it represents and whether it is relevant to my computational use-case.

      Comment


      • #4
        I haven't studied the baseline submissions that make up the 7742 average result to know exactly what has gone on. Maybe someone was dumb enough to use Win10 home on a $15K PC. Don't know.

        Systems with dual 7742 CPUs will likely have a huge amount of RAM e.g. 128GB. So yes, the CPU tests won't come close to using this much RAM, but all of them use some RAM. Especially the physics test and Prime number tests.

        There is also another factor. Many CPUs have virtual CPU cores (hyper-threading). These aren't full CPU cores and don't perform as such. So some algorithms run really well on hyper-threading and give great scaling. Some run really badly and give no benefit, and some are negative scaling. It's also really hard to know in advance by looking at the code what will scale and what doesn't on which CPUs. Generally it requires testing.

        Here are a few graphs showing scaling of some of the CPU tests on a Ryzen Threadripper 3970X CPU (32 real core, 64 virtual cores).

        Click image for larger version  Name:	Prime.png Views:	22 Size:	27.3 KB ID:	46663




        Click image for larger version  Name:	Integer.png Views:	8 Size:	24.8 KB ID:	46664
        Click image for larger version  Name:	Physics.png Views:	8 Size:	28.8 KB ID:	46665

        Note how for the Physics test it maxed out at 24 threads. Which implies a RAM bottleneck as scaling stopped before the CPU had all its physical cores fully loaded. After pushing beyond 24, things really went pear shaped. Likely the NUMA issues and caching issues started to bite. (Internally the Threadripper 3960X is 4 separate Ryzen 8 core CPUs stuck into a single package. So it is like a quad CPU system).

        Comment


        • #5
          ...Lorem, this comment of yours is the one that made me rejoin this forum after many years.


          Let me just quote some of your comment and see if that conveys the problem with your perspective.
          Which may or may not help you and others immensely, we will see.


          "....under what conditions CPUmark is a useful metric for the kinds of computations I do. ...We would of course expect something less than 2X, but the scores are still way below what I would intuitively expect. ...a measly 15% increase from adding a whole 64 core flagship CPU. ....It feels like adding cores to a single processor scales much more like linear. ...

          So I guess my question is: Where is the bottleneck and how does it relate to adding cores/cpus? Reading the methodology, it appears that tests are run in fully independent processes that don't share any information.... ...The only shared resource would be RAM, correct? Intuitively, it seems like multi-cpus should scale even better than adding cores due to additional cache and heat dissipation."

          Now, in a friendly way, I have to suggest that your assumptions, which are many, may just be flat-out wrong and beyond that, even beyond that, you may not even know why, or how, to find the right answers. If you care to find them instead of just demanding satisfaction.

          But no one can rationally talk about what is "a useful metric" for "the kinds of computations that you do" without seeing some of the code & data that you use when you "do computations".
          Of which you provide, um, nothing, in your OC. Except an indication that you have an interest in muti-threated and CPU-intensive "computation".

          Hopefully beyond that in the past 10 months you've read around enough to realize how ridiculously complicated it is to design and build benchmark software to an arbitrary performance standard that "most users" are "happy" with. Not to mention every user. Not to mention you. I don't think that your comment really deserves more of a reply than that. Try writing some of your own benchmark software and see what you learn that way.

          Now if you are building your own apps and you want to optimize them for a certain set of hardware? Then possibly take the lessons that you've learned in testing and optimization and apply them to future hardware choices and further optimization? That's a completely different story.

          The third option is that you have an app or set of apps for which you want to find the optimum hardware configuration to run them on.
          You've got enough clues here. You need to learn how to put them together without assuming that you know how they should fit together or how they do fit together when you clearly do not know.
          It's not a matter of fixing the world to your assumptions. It's a matter of your assumptions just being wrong. And you need to find out what the right ones are. And THEN leap to conclusions.

          Leaping to conclusions from faulty assumptions is a very, very painful way to go through life.
          And being wrong when you're sure that you're right will only lead to a series of spectacular failures.

          Comment


          • #6
            "There is also another factor. Many CPUs have virtual CPU cores (hyper-threading). These aren't full CPU cores and don't perform as such. So some algorithms run really well on hyper-threading and give great scaling. Some run really badly and give no benefit, and some are negative scaling. It's also really hard to know in advance by looking at the code what will scale and what doesn't on which CPUs. Generally it requires testing."

            ...it's been a while since I've done this, but a chip mfg usually puts out a "user manual" for a microprocessor. In there, even for Intel CPUs, there will be a whole bunch of notes on how it works and how well it works. For hyperthreading, this is a question of what instructions will actually work in these virtual cores and what the performance hit is likely to be on those HT cores vs the real cores. Again IIRC HT cores generally support only integer operations, and as such I expect to see decent results in the integer tests. The other two, prime and physics, will depend on what instructions are actually executed in those tests, as much as they will depend on where the instructions and data are and what the ratio of cache hits and misses, along with other relevant factors. But when you say "physics", I expect that some high-order math is involved, squares and square-roots, polynomials, perhaps some trig and a fair amount of SP maybe even DP is involved as opposed to straight integer math, at the very least integer approximations thereof. Prime perhaps a bit of looping and logic, how much percentage-wise depends on the exact algorithm used. So there are three ways to deal with this, look at the chip guide, don't look at the chip guide and go by experience with "similar" chips, or don't look at the chip guide and go by testing. To design code properly you'd look at the chip guide and the compiler guide, find some algorithms of interest, code them, test and adjust the code...to run well on something given the algorithm, compiler and hardware. That is not going to be optimum for all hardware! Ideally you'd write a "general benchmark" to run "ok" on two platforms that each run one half of it well and the other half of it poorly. Then test different apps on each system and note which apps run well and which run poorly on which system.

            This gets us back to the basic problem for individual users: do you want to optimize the code to run well on a given platform or optimize the platform to run a given set of code well. The one problem with private benchmarks (as opposed to open-source benchmarks) is that you you really have very little idea how it was developed, and as you can probably guess there are a lot of options in developing code. Even today CentOS (at least CentOS8.0) doesn't run "right" on AMD cpus. To really develop the optimum general-purpose benchmark one has to start small and work ones way up to complicated code making and noting the code AND compiler and OS and BIOS configuration choices made along the way.

            It is never a case that two cpus are worthless compared to one.

            It is more the case that a 2nd core is worthless on a given set of code when installed in parallel with the 1st cpu (as opposed to within its own box) or used in the same OS instance (as opposed to with a VM).

            And in that case, it is likely to be useless to have HT enabled.
            Because the code does not scale well with the # of cores.
            it may scale for 1, 2 or even 4 cores depending on exactly what the code does, but generally not more than 2 and even then only if the the OS is set-up properly.

            So that is one thing, the other thing is that your "computing environment" may be a high-level language or even a "computing tool" (like Python, SQL, or MathCAD) that masks the hardware and needs to be tweaked to exploit the hardware more fully.

            Case in point: Matlab R2019.
            just running the bench(N) benchmark tells me that the performance order for the cores in this comparison are in this order.

            https://www.cpubenchmark.net/compare...88vs1304vs2054

            I have the Xeon E3-1280 v6 in a Dell R230 right next to me.

            LU FFT ODE Sparse 2D 3D Relative Speed
            E5 2650 v2 0.1185 0.1282 0.0296 0.1484 0.4108 0.4211 60
            Ry7 1700 0,1852 0.1453 0.0156 0.2015 0.3245 03281 59
            E3 1280 v6 0.1275 0.0707 0.0125 0.0801 0.9803 1.7823 55
            mean of 100 runs, with variances down in the 4th decimal-place.
            however the max for the first is 0.1513 and the min is 0.0831 that's a difference of almost 50%
            in TM the CPU utilization never goes over 23%.so I know it's running on 1 of the 4 cores
            X 5650 0.2240 0.1150 0.0298 0.1584 0.5941 1.6248 39
            i5 4300u 0.2300 0.2040 0.0256 0.1525 1.2245 0.7240 39


            I have an i5-7300u laptop with 16GB (2x8GB) of ram in it, so I have to change the list to compare my laptop to the others:
            https://www.cpubenchmark.net/compare...88vs1304vs2955

            So without even looking at the scores closely, what I would care about would be

            1) do they all have "more than enough" ram to run the benchmark without the amount of ram affecting the score, likewise are they all set up dual-bank (so that bandwidth doesn't affect the score)
            and there's no way that you know this for sure without doing the tests yourself, and likewise you need to be able to switch it back and forth to see that.

            2) can HT be disabled in the OS, and likewise confirmed by running Passmark or any other suitable code in single-threated configuration, then multithreaded (1, 2, 4 & 8 threads) without HT and then with HT enabled while monitoring the CPU utilization. (this can be something as simple as a DOS or Bash shell that does some arithmetic or juggles strings, seriously try writing your own simple benchmarks and you'll begin to see what's involved!).

            3) do you have available something with a crippled cpu like a celeron or an old Pentium 2 or even Pentium, or a K5 chip, which should have markedly better or worse performance on at least one type of code (whether it's integer, fp, MMX, SSE or whatever) so you can identify that as a factor...this is where compiler & driver options become important and where you'll learn to tweak the settings to get the same overall execution time on vastly-different systems.

            4) switch the video cards, driver versions & driver-settings. So you can isolate the tests that are significantly card, driver & setting dependent.
            The same with disk i/o and the amount of ram set aside for a disk cache and the "priority" setting (foreground vs background processing).
            Two SSD drives in hardware RAID0 do a great job of masking disk i/o issues. Boot & run off a single USB thumb drive and see what happens. Try running two thumb drives in software RAID 0 and 3 in RAID 5 and see what happens. Take a DIMM out and see what happens. enable & disable the antivirus..experiment!.

            There are some simple, reliable changes in configuration that you can make to isolate the hardware and OS configuration from the cpu performance.

            Without doing that, you don't know what the bottleneck is (as in "prime numbers' or "physics" above). You're reduced to making guesses.
            The awful thing about guesses is that your guesses all make perfect sense until they are proven wrong. As in "flat wrong". And you go "...oh". And actually learn something.

            Something which you did not know before. When you thought that you knew everything even remotely relevant.
            And you did not bother to try to change something that you were sure wasn't a problem.

            And remember one other thing. If 5 minutes is fast enough? Then 1 minute is probably going to be fast enough, too. That's a 400% performance improvement.
            There are some very simple tricks that can be done in Matlab to make the code scale well on multiple CPUs. Assuming that it CAN scale well with core-count.

            But is dropping a pair of 16-core cpus into it going to make the performance scale by a factor of 32? That's a completely different question.
            Is a factor of 32 improvement in performance (in the best case) "good enough to make you happy"? Likewise a very different question.

            And by now you should understand why. Or at least know where to go to look for an answer.

            In summary it is almost never the case that your hardware is fully exploited. The question is much more whether you can exploit it fully and whether it is worth your time to do so.
            To that, the answer to part A is usually no and the answer to part B depends on how quickly you can replace it with hardware that is fast enough for the purpose when set up with the same software and overall configuration as the hardware and software that you're using right now...if you really want to get all "techie" about it.

            Otherwise you'll just wait for it to finish and do something useful in the meantime.
            It's funny how in the time that you sit there waiting for your computer to finish a set of code, you often can think of a far faster way to do the calculation.
            Maybe you don't even really need to do it. Maybe you can do it in a 10th of the time just by changing your approach.
            To me what really counts is how long it takes to write, debug, test and validate the code in the first place.
            Because no matter what the code does I can always change the performance by changing the resolution and by extension the workload.
            The hardware is only the problem when I don't have time or ability to tweak the code or resolution to change and reduce the problem that it has to compute.

            So the lesson to remember with a closed-source benchmark is that the performance on a benchmark only matters if the benchmark performance really matters a lot to you..
            And that depends on whether it is "just a benchmark" or the thing which is going to keep you employed next week.

            If it's just for bragging-rights on an Internet forum? Seriously, get a life. Try writing and running code for a useful purpose. Perhaps even something that you get paid to do.
            Then you're likely to find an i5-4300U laptop to be just fine and if not then at least you have a legitimate reason to worry about all this stuff and to do something to improve performance..

            along with a set of hardware, an algorithm, the software development tools, and a real-world performance goal. Without which, who cares?

            I'd like to see a closed benchmark have a webpage which asks "do you use your system mostly for gaming?" and if you click yes it blocks you from its site.
            I remember there used to be an old DOOM-based benchmark which would run a DOOM scene and give you the frame-rate and that was more than good enough for most people.
            Sadly it faded away when people began to hit 120fps consistently on it, i.e. when the cards were maxed-out at all resolutions and settings. I wonder if Anand and Tom still have those pages. Then they had to change games and add settings to bring the FPS down to 15 so they could get back to bragging about one card or CPU more than the others.
            Back in those days we were happy to have Pentium computers instead of 486 computers. Now a $500 laptop runs rings around the last PC that I actually built.

            Now you can buy a $500 server from Dell that will utterly destroy the last $2500 PC that I ordered.
            You people talk about Passmark benchmark scores all day long and I'm like "Why, WTF is your problem?"
            (I'm sure that you have a reason, please don't bother to tell me)

            But no. If you stick a second cpu in your computer and don't see a major performance improvement, then it is not simply the benchmark that is the problem. No.
            At the very least, try running multiple instances of the benchmark simultaneously
            Now running 4 copies of bench(100) in 4 separate instances of Matlab r2019.on the R230..cpu utilization varies from 33% to 92% with background processing enabled and the desktop response is still fine. Total memory utilization 7GB. All four threads completed in less than 10 min and still score in between the Ryzen 7 1700 and the X5650. That means if I set up a Matlab parallel cluster in one instance of Matlab I should see decent scaling across all the available cores.
            Last edited by touristguy87; 01-15-2021, 02:42 AM.

            Comment


            • #7
              I am not sure what the point of the long post was, except to say that scaling is complex and depends on many factors. But that was already pointed out earlier.

              It is never a case that two cpus are worthless compared to one.
              I disagree. It is often the case that the 2nd CPU is worthless. It is reason no high end gaming platform nor laptop comes with a 2nd CPU. More heat, more noise, more power usage, and often marginal performance improvements.

              Comment


              • #8
                "I am not sure what the point of the long post was, except to say that scaling is complex and depends on many factors. But that was already pointed out earlier."

                That's why you shouldn't try to restate what someone else has written. Let it stand on its own.
                As far as what you got out of it, that's a different issue.

                "It is never a case that two cpus are worthless compared to one.
                I disagree. It is often the case that the 2nd CPU is worthless. It is reason no high end gaming platform nor laptop comes with a 2nd CPU. More heat, more noise, more power usage, and often marginal performance improvements."

                You're an an admin of a site that promotes a benchmark tool that exploits multiple CPUs in a single x86/x64-compatible computer, which has been a common configuration since the Pentium was widely adopted. And you really believe this? You seem to confuse the concept of having multiple CPUs with having multiple CPU PACKAGES. A multicore package contains multiple CPUs in a single package. Aside from that simply having multiple cores allows the entire computer to use available resources more optimally compared to having just one core. Hairs can be split over the exact contents of a package but the basic concept remains. If any "gaming platform" or laptop comes with multicore packages, it would be a high-end platform.

                Aside from the benefits of less heat, less noise, lower power usage...and the performance improvements are the crux of the issue here.

                So aside from the other two side points, this is the main issue that owners, consumers and users have to consider. Because "performance issues" are part of a cycle. First, you have to have the hardware...if you don't have any hardware on which to run an application (at least to baseline) than the entire discussion is just talk. You have to start with some real data running real software on a real system or else you're working with theoretical calculations at best, and how often does real-world performance match theoretical performance? Second, someone has to design and enhance an application or software platform that is compatible with that hardware in order to provide "significant performance improvements". And third, someone has to design, build and test the hardware to be compatible with that software, and there's quite a bit of back and forth right there. My "point" is that users come here to find out what hardware offers what benefits on your test software, and to find out also, just as importantly, what the correlation is between the test results running Passmark software on marketed hardware is to running "real-world apps" on that same hardware in order to get a realistic and useful price-performance curve. The problem is that it is very difficult to optimize a 4-variable problem involving 3 distinct components and generally-speaking this can only be solved by running an experiment that involves many different users, owners, developers and testers. If, as the OP said, he wants to see such results for ONE user, owner, developer and tester, then he needs to put a lot more detail into the discussion or get a lot better at developing and testing and then he can find the answers to his own questions to his own satisfaction.

                Basically he can start to write his own benchmarks.
                He can look at OS benchmarks.
                He can run commercial benchmarks.

                He can compare the results.

                And if he's decently bright and well-educated on the subject, he can probably figure out all the answers to his own questions.

                One big problem here is the plethora of individuals who are not bright or well-educated enough to do this who want someone to give them "the right answer". Which, often, is to just go have lunch, to "multitask" in general, and wait for the code to finish. They get a good answer and want it explained to them...to their satisfaction. Then they still argue with it. Then they realize that what they really want they can't afford. If enough people feel that way, the market will respond. If not they will be forced to find something more cost-effective to do with their time. Such is life.

                We are talking about a market for users who are so far down the technological scale that they are little more than one step above the console market. Of course there are going to be hundreds of thousands of disgruntled users and owners who question the relevance of Passmark if it doesn't say that their $1500 rig competes well with a $10k rig. And likewise for the owners of the $10k rigs who are mad that they aren't cost-effective.

                I saw a YT video this week about a guy who claimed to have the fastest bitcoin mining rig on YT or whatever.

                "Today i show you the Mining hashrates of a $100,000 Server from Amazon with 8X NVidia NVlink Tesla V100 GPU's on board. The result is the worlds fastest miner ever seen on a single system."

                https://www.youtube.com/watch?v=t_0uHQKOqeU

                It was a PC with 8 V100s in it. His mining performance was about 15 kT/s. That's like 1 billionth of the performance of a decent Ant mining rig, around 75-100 tT/s.

                https://en.bitcoinwiki.org/wiki/Hashrate

                https://www.amazon.com/Bitmain-Antmi.../dp/B08QVQRJ9C

                And the guy spent 5x as much for the 8x V100 rig. Why? Just to have "the fastest mining PC".
                Emphasis on PC.

                Apparently this doesn't count as a "PC".
                https://www.youtube.com/watch?v=r8Uu-Ww02RY
                https://www.youtube.com/watch?v=g8cT1oaVFkc

                What do you expect from this mentality.

                I wonder if there's a medical school anywhere, maybe sponsored by RJ Reynolds, which teaches people how to smoke without getting stained teeth, emphysema and cancer. Look, you want real-world frame-rates under realistic image-quality settings? Stop playing computer-games. If you insist on buying a PC to play computer-games then deal with the frame-rates that you get from the hardware that you can afford to buy. That is where PC benchmarks like PassMark play a valid role. If you're not happy with PassMark, either use another benchmark or build your own, but the odds are that you are going to run it on hardware that you own already.

                Once they get past that, they can be introduced to the 90-10 rule. Or a slight variant of it, the 10-15-50-15-10 rule. That is to say, all good projects have idea, concept, prototype, production, upgrade and repurpose stages, each dependent on the previous stages with the cost spread out and the benefits coming between the middle and end stages. In terms of financing, companies raise investment money against these latter-stage benefits. Since the platform hasn't been developed yet all of the benefits are speculative, based on a combination of theory, prevailing market conditions and gullibility. Kinda like the shoe market. We are talking about productive uses for PCs here, when we have had PCs for 40 years already. Crypto has shown us that even when we reduce the problem to its essential elements, literally making money with our computers, it is virtually impossible for individual users to buy and own competitive hardware in an environment where corporations can throw their financial muscle at the problem. The question becomes how to be competitive as individuals without corporate backing in which case all of the benefits belong to the corporation and the individual is simply paid to do a job. i.e. Play the Game On The Gear That We Provide For You And Stop Bitching. If they want someone to tell them what gear to buy, they will ask someone who knows what they are talking about. Hopefully.

                Not someone who asks you, "why doesn't your benchmark scale well on my quad Xeon?"

                Obvious Answer #1: it wasn't designed to scale well on a quad Xeon.

                Or "why is your code faster on a 32-core Threadripper than on an 80-core quad Xeon?"

                Obvious Answer #1 : "Because the 32-core Threadripper is a better platform for our code than an 80-core quad Xeon".

                Simple answers often go a long way.
                Last edited by touristguy87; 05-25-2021, 10:28 PM.

                Comment

                Working...
                X