Announcement

Collapse
No announcement yet.

Apple M4 chips

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Simon (PassMark)
    replied
    As an update, we tested with our own M4 Mac Mini and found there was generally no difference between running the CMD and GUI versions of PT.

    Though there was a difference when running the Single Thread test on its own (score would be higher) compared to running all the CPU tests. So there seems to be some thermal limits / throttling occurring by the time PT reaches the Single Thread test. So the cooling solution used by Apple isn't optimal and the machine would perform better, more consistently, if a larger case had been used.

    Leave a comment:


  • David (PassMark)
    replied
    Please recompile both macos versions with latest xcode version
    In order to keep the results comparable between patch releases we try not to update the compiler, except for major new releases of the benchmark. We do the same on Windows and Linux as well.

    Details for PerformanceTest 11 are,
    • Windows: Visual Studio 2022 17.6.5
    • Linux arm32/arm64: gcc-linaro-4.9.4-2017.01 toolchain
    • Linux x86_64: gcc 4.9
    • Mac x86_64/arm64: XCode 12.5 (CLANG 12.0.5)
    We've started design work on PerformanceTest V12 for a release later this year, and compiler updates will happen then.

    Internally however we did a compile with XCode16.2 (SDK15.2). It didn't change much. Maybe a few percent gain on NEON code, but even that might be margin of error differences.
    (more significant difference might happen if we update some of the libraries we are using for Physics, compression and encryption).

    Here are the numbers showing how different Xcode versions impacted the results (single run only, so some margin or error).
    Click image for larger version  Name:	image.png Views:	0 Size:	889.9 KB ID:	58374

    C/C++ compilers have been around for a very long time now. Compiler updates don't often bring huge changes nowadays. The low hanging fruit has been picked. There is also a limit to how much optimization the compiler can do if the same binary code needs to run on M1, M2, M3 and M4.
    This is even worse in x86 land, as code needs to run on really old x86 CPUs.

    Leave a comment:


  • andrewelick
    replied
    Cool.
    Please recompile both macos versions with latest xcode version and let me know then available.
    I'm interested to run new versions and see the results.
    Thanks.

    Leave a comment:


  • David (PassMark)
    replied
    We modified the CMD version to allow running of just the ST test without all the other tests. We also matched the test duration.

    Gap disappeared for the M2.

    Data is below. 5 runs for each setup.

    So likely mystery is solved. CPU cooling is rather poor and CPU throttles after a period of high load.

    We'll go back and test on M1 and M4 as well. (M4 is on order and hasn't been delivered as yet)

    Click image for larger version

Name:	image.png
Views:	270
Size:	53.5 KB
ID:	58369​​​

    Leave a comment:


  • David (PassMark)
    replied
    Other things we looked at today.

    We tested with a M1 machine. Which also showed a performance gap between GUI and CMD. Larger than the M2 in fact. See graph below. Note that the Y scale exaggerates the difference as it isn't zero based axis.

    Adjusting the process scheduling priority ("Nice" value). This didn't close the CMD vs GUI gap.

    Forcing execution on P-Core (or E-Core) for ST didn't close the gap. E-Core is way slower BTW (85% loss of performance, as the E-Cores seem really weak). This doesn't mean it never switches to the E-Cores. But if it does switch it is only for a very short period. Both version only run on P-Cores as far as we can tell, this is without explicit affinity settings.

    We looked at changing the test duration.
    Longer tests didn't close the gap. But did show thermal throttling of up to 16% in the single threaded test. This is because the ST test runs after the longer multi-core tests and the CPU is already overheating pretty seriously before the start of the ST test. This also implies you are going to see rather different benchmark results depending on the duration of the benchmark, the ambient air temperature and what surface the Laptop is resting on. No real surprise really, as Apple prioritized form over function and didn't include a fan in the M2 Macbook Air.

    We compared running all the tests, medium duration, GUI, (with the ST test running last) vs just running the ST test by itself in the GUI after the machine was idle for a while. Performance difference on the M2 was 1.2%. So this very nearly matches the 1.6% gap.

    So test results depend on the order the tests are run in (due to the CPU overheating). Single thread is faster if the CPU isn't already hot. Maybe this explains everything?

    Shorter duration tests closed the gap slightly. It brought the M2 gap down to 0.7%. Which fits with the CPU heat theory.

    There is also two functional differences between the command line version and GUI version. 1) At the moment you can't run one test at a time in the command line version, but you can in the GUI. So this might have also contributed to the difference. Meaning the CPU likely won't be already hot if you just run the ST test from the GUI. But CPU will always be hot in the CMD version. We are going to explore this more. 2) The test time can be configured in the CMD line version, but not in the GUI version. So unless careful attention is paid to the setup, you can end up with a different test duration between the GUI and CMD versions.

    Click image for larger version  Name:	image.png Views:	0 Size:	65.7 KB ID:	58363

    Leave a comment:


  • David (PassMark)
    replied
    Some results from in house testing with Performancetest V11.0 build 1003
    M2 Macbook Air - Single threaded
    Command line - 4161
    GUI - 4228
    Each result is an average over 5 runs​. So a 1.6% difference between the different versions. There is some variation between runs of the same version, but this is a clean machine without much background activity, so the difference seems real and not just in the margin of error.

    We also tested on an older x86 machine and that was very slightly the opposite if anything (command line was very marginally faster).

    We'll keep digging into it. Some possibilities.
    1. Different MacOS task scheduling between command line and GUI tasks. For example, switching some load to efficiency cores for command line tasks. Or starting the command line task on E-Cores and taking a while to realize how CPU intensive it is before moving the task to a P-Core.
      This could also be influenced by the MacOS release as well.
    2. Different compiler code optimization switches accidentally used by us, even though the source code & dev environment is the same.
    3. Some extra overhead on the command line (e.g. due to NCURSES library). But this seems unlikely as x86 code should be also impacted.
    We are guessing the it is 1) as Apple claims this (paraphrased).
    "High-priority tasks (e.g., user-facing applications) are more likely to run on P-cores, while low-priority tasks (e.g., background updates or maintenance) are assigned to E-cores."

    Info from Apple hints at this.
    https://developer.apple.com/news/?id=vk3m204o

    Leave a comment:


  • David (PassMark)
    replied
    Why M4 Max slower then M3 Max in your graphs. This is simply can't be
    Don't really know for sure. We assume it is because the M4 has only been released in small form factor machines so far. Or maybe Apple went cheap on the RAM to increase profits.
    M4 Max is faster in multi-threading, than M3, but you need to ignore the older M2 Ultra, which is Apple's fastest chip.
    We do know it is the same code that runs on the M3 and M4 & that Apple declined to compare the M4 and M3 CPU performance head to head.

    M4 doubles the RAM, but that doesn't help with our CPU benchmark as 8GB was enough. This might hurt in fact as RAM on chip generates more heat. This can help a huge amount however for some memory hungry apps. But our CPU test doesn't even use 4GB. M4 claims to also have faster RAM.
    M4 has more cores, but that doesn't help with a ST CPU benchmark. This can often hurt in fact as it generally lowers clock speeds.
    M4 claims to have better neural engine. But no software uses that yet. So that doesn't help.
    Both were built on 3nm silicon.
    So no real logical reason to think M4 should be faster, except faster RAM.
    But over all still slightly strange.

    Also M4 chips uses new ARMv9.2-A instruction set. Probably need to recompile clients.
    We can't target just the M4 instruction set and have the code not run on all earlier CPUs.
    (the exception for this is the extended instruction test, where we hand code different code paths in assembly for each CPU family, so maybe there is something that can be done there)

    PC benchmark MacOS ST benchmark takes only a few secs
    We'll have a look this week, most of our engineers were on Christmas leave last week.

    Leave a comment:


  • andrewelick
    replied
    Why M4 Max slower then M3 Max in your graphs. This is simply can't be.
    I'll be glad to help you. My 285k, M3 Max and M4 Max ready to do tests if needed.
    I'm noticed that in comparison to your PC benchmark MacOS ST benchmark takes only a few secs. I think it just not enough runtime to do benchmark properly in terms of averaging.
    Also M4 chips uses new ARMv9.2-A instruction set. Probably need to recompile clients.
    Thanks.

    Leave a comment:


  • David (PassMark)
    replied
    Single threaded benchmarks from 104 different M4 Max - 16 Core MacOS systems are below, plotted as distribution.
    (and a M3 graph as well)

    M4 results are roughly a bell curve, with a tail on the low end. This is as expected and is common with real world systems. Would be nice to have a more data, but it is early days for the M4 at the moment.

    Your assertion that there are two clusters of results for M4 and M3 isn't supported by the data, but we can have a closer look next week.

    I should note that both your M4 results are in top ~2% of real world machines, with the 5038 result being higher than all other systems tested. Fastest M4 max in the world.

    Click image for larger version

Name:	image.png
Views:	893
Size:	14.1 KB
ID:	58336

    Click image for larger version

Name:	image.png
Views:	877
Size:	15.4 KB
ID:	58337​​

    Leave a comment:


  • andrewelick
    replied
    Please stop making fool out of me. I'm pro software developer for almost 35 years.
    I tried several times. Same unreliable different results between console version and AppStore version.
    Also tried on M3 Max. Same shit.
    Now I understand why M4 Max is so low in your chart...
    "Something else" is your compilations I think.

    Leave a comment:


  • David (PassMark)
    replied
    The code is identical. So any performance difference is due to something else. For example background tasks running on the machine, thermal limits / throttling, battery saving power limits, scheduling differences for the GUI process, etc... Maybe there is some additional overhead from the store, but we aren't aware of anything in particular.

    Would be more accurate to run each version multiple times (e.g. 10 times) and take the average or the max.

    Leave a comment:


  • andrewelick
    replied
    PerformanceTest M4 Max AppStore MacOS client ST result: 5038
    PerformanceTest M4 Max command line MacOS client ST result: 4740

    Why results not identical?

    Leave a comment:


  • andrewelick
    replied
    Please don't forget we're talking single core performance here. Cooling has nothing to do with it.

    Leave a comment:


  • David (PassMark)
    replied
    My M4 Max R23 single core result is 2303
    Which is slower than an average 285K.

    M4 Max even worse then M3 Max
    They are roughly the same within the margin of error. Might explain why Apple compared it to the M1 when they did the M4 release.
    Likely there is a thermal limit or RAM bandwidth bottleneck. Apple has put the priority on making the computer cases smaller and smaller (or thinner). This makes cooling much harder, which in turn leads to thermal throttling. Literal form over function.

    USB transfer speeds from a SSD aren't normally CPU limited. So it isn't a good CPU benchmark.

    I predict they M4 will look better when they put it into a Mac Pro full size case with proper cooling.

    Leave a comment:


  • andrewelick
    replied
    My M4 Max R23 single core result is 2303.
    In your Single Thread Performance chart M4 Max even worse then M3 Max that simply can't be...

    Cinebench 2024 single thread score:
    285k - 150
    M4 Max - 182​

    Blender compressed big file open time from same ssd attached to same port speed:
    285k - 1 minute 3 seconds
    M4 Max - 0 minute 47 seconds

    I can continue if you need
    Last edited by andrewelick; Jan-02-2025, 12:01 AM.

    Leave a comment:

Working...
X