As an update, we tested with our own M4 Mac Mini and found there was generally no difference between running the CMD and GUI versions of PT.
Though there was a difference when running the Single Thread test on its own (score would be higher) compared to running all the CPU tests. So there seems to be some thermal limits / throttling occurring by the time PT reaches the Single Thread test. So the cooling solution used by Apple isn't optimal and the machine would perform better, more consistently, if a larger case had been used.
Announcement
Collapse
No announcement yet.
Apple M4 chips
Collapse
X
-
Please recompile both macos versions with latest xcode version
Details for PerformanceTest 11 are,- Windows: Visual Studio 2022 17.6.5
- Linux arm32/arm64: gcc-linaro-4.9.4-2017.01 toolchain
- Linux x86_64: gcc 4.9
- Mac x86_64/arm64: XCode 12.5 (CLANG 12.0.5)
Internally however we did a compile with XCode16.2 (SDK15.2). It didn't change much. Maybe a few percent gain on NEON code, but even that might be margin of error differences.
(more significant difference might happen if we update some of the libraries we are using for Physics, compression and encryption).
Here are the numbers showing how different Xcode versions impacted the results (single run only, so some margin or error).
C/C++ compilers have been around for a very long time now. Compiler updates don't often bring huge changes nowadays. The low hanging fruit has been picked. There is also a limit to how much optimization the compiler can do if the same binary code needs to run on M1, M2, M3 and M4.
This is even worse in x86 land, as code needs to run on really old x86 CPUs.
Leave a comment:
-
Cool.
Please recompile both macos versions with latest xcode version and let me know then available.
I'm interested to run new versions and see the results.
Thanks.
Leave a comment:
-
We modified the CMD version to allow running of just the ST test without all the other tests. We also matched the test duration.
Gap disappeared for the M2.
Data is below. 5 runs for each setup.
So likely mystery is solved. CPU cooling is rather poor and CPU throttles after a period of high load.
We'll go back and test on M1 and M4 as well. (M4 is on order and hasn't been delivered as yet)
Leave a comment:
-
Other things we looked at today.
We tested with a M1 machine. Which also showed a performance gap between GUI and CMD. Larger than the M2 in fact. See graph below. Note that the Y scale exaggerates the difference as it isn't zero based axis.
Adjusting the process scheduling priority ("Nice" value). This didn't close the CMD vs GUI gap.
Forcing execution on P-Core (or E-Core) for ST didn't close the gap. E-Core is way slower BTW (85% loss of performance, as the E-Cores seem really weak). This doesn't mean it never switches to the E-Cores. But if it does switch it is only for a very short period. Both version only run on P-Cores as far as we can tell, this is without explicit affinity settings.
We looked at changing the test duration.
Longer tests didn't close the gap. But did show thermal throttling of up to 16% in the single threaded test. This is because the ST test runs after the longer multi-core tests and the CPU is already overheating pretty seriously before the start of the ST test. This also implies you are going to see rather different benchmark results depending on the duration of the benchmark, the ambient air temperature and what surface the Laptop is resting on. No real surprise really, as Apple prioritized form over function and didn't include a fan in the M2 Macbook Air.
We compared running all the tests, medium duration, GUI, (with the ST test running last) vs just running the ST test by itself in the GUI after the machine was idle for a while. Performance difference on the M2 was 1.2%. So this very nearly matches the 1.6% gap.
So test results depend on the order the tests are run in (due to the CPU overheating). Single thread is faster if the CPU isn't already hot. Maybe this explains everything?
Shorter duration tests closed the gap slightly. It brought the M2 gap down to 0.7%. Which fits with the CPU heat theory.
There is also two functional differences between the command line version and GUI version. 1) At the moment you can't run one test at a time in the command line version, but you can in the GUI. So this might have also contributed to the difference. Meaning the CPU likely won't be already hot if you just run the ST test from the GUI. But CPU will always be hot in the CMD version. We are going to explore this more. 2) The test time can be configured in the CMD line version, but not in the GUI version. So unless careful attention is paid to the setup, you can end up with a different test duration between the GUI and CMD versions.
Leave a comment:
-
Some results from in house testing with Performancetest V11.0 build 1003
M2 Macbook Air - Single threaded
Command line - 4161
GUI - 4228
Each result is an average over 5 runs. So a 1.6% difference between the different versions. There is some variation between runs of the same version, but this is a clean machine without much background activity, so the difference seems real and not just in the margin of error.
We also tested on an older x86 machine and that was very slightly the opposite if anything (command line was very marginally faster).
We'll keep digging into it. Some possibilities.- Different MacOS task scheduling between command line and GUI tasks. For example, switching some load to efficiency cores for command line tasks. Or starting the command line task on E-Cores and taking a while to realize how CPU intensive it is before moving the task to a P-Core.
This could also be influenced by the MacOS release as well. - Different compiler code optimization switches accidentally used by us, even though the source code & dev environment is the same.
- Some extra overhead on the command line (e.g. due to NCURSES library). But this seems unlikely as x86 code should be also impacted.
"High-priority tasks (e.g., user-facing applications) are more likely to run on P-cores, while low-priority tasks (e.g., background updates or maintenance) are assigned to E-cores."
Info from Apple hints at this.
https://developer.apple.com/news/?id=vk3m204o
Leave a comment:
- Different MacOS task scheduling between command line and GUI tasks. For example, switching some load to efficiency cores for command line tasks. Or starting the command line task on E-Cores and taking a while to realize how CPU intensive it is before moving the task to a P-Core.
-
Why M4 Max slower then M3 Max in your graphs. This is simply can't be
M4 Max is faster in multi-threading, than M3, but you need to ignore the older M2 Ultra, which is Apple's fastest chip.
We do know it is the same code that runs on the M3 and M4 & that Apple declined to compare the M4 and M3 CPU performance head to head.
M4 doubles the RAM, but that doesn't help with our CPU benchmark as 8GB was enough. This might hurt in fact as RAM on chip generates more heat. This can help a huge amount however for some memory hungry apps. But our CPU test doesn't even use 4GB. M4 claims to also have faster RAM.
M4 has more cores, but that doesn't help with a ST CPU benchmark. This can often hurt in fact as it generally lowers clock speeds.
M4 claims to have better neural engine. But no software uses that yet. So that doesn't help.
Both were built on 3nm silicon.
So no real logical reason to think M4 should be faster, except faster RAM.
But over all still slightly strange.
Also M4 chips uses new ARMv9.2-A instruction set. Probably need to recompile clients.
(the exception for this is the extended instruction test, where we hand code different code paths in assembly for each CPU family, so maybe there is something that can be done there)
PC benchmark MacOS ST benchmark takes only a few secs
Leave a comment:
-
Why M4 Max slower then M3 Max in your graphs. This is simply can't be.
I'll be glad to help you. My 285k, M3 Max and M4 Max ready to do tests if needed.
I'm noticed that in comparison to your PC benchmark MacOS ST benchmark takes only a few secs. I think it just not enough runtime to do benchmark properly in terms of averaging.
Also M4 chips uses new ARMv9.2-A instruction set. Probably need to recompile clients.
Thanks.
Leave a comment:
-
Single threaded benchmarks from 104 different M4 Max - 16 Core MacOS systems are below, plotted as distribution.
(and a M3 graph as well)
M4 results are roughly a bell curve, with a tail on the low end. This is as expected and is common with real world systems. Would be nice to have a more data, but it is early days for the M4 at the moment.
Your assertion that there are two clusters of results for M4 and M3 isn't supported by the data, but we can have a closer look next week.
I should note that both your M4 results are in top ~2% of real world machines, with the 5038 result being higher than all other systems tested. Fastest M4 max in the world.
Leave a comment:
-
Please stop making fool out of me. I'm pro software developer for almost 35 years.
I tried several times. Same unreliable different results between console version and AppStore version.
Also tried on M3 Max. Same shit.
Now I understand why M4 Max is so low in your chart...
"Something else" is your compilations I think.
Leave a comment:
-
The code is identical. So any performance difference is due to something else. For example background tasks running on the machine, thermal limits / throttling, battery saving power limits, scheduling differences for the GUI process, etc... Maybe there is some additional overhead from the store, but we aren't aware of anything in particular.
Would be more accurate to run each version multiple times (e.g. 10 times) and take the average or the max.
Leave a comment:
-
PerformanceTest M4 Max AppStore MacOS client ST result: 5038
PerformanceTest M4 Max command line MacOS client ST result: 4740
Why results not identical?
Leave a comment:
-
Please don't forget we're talking single core performance here. Cooling has nothing to do with it.
Leave a comment:
-
My M4 Max R23 single core result is 2303
M4 Max even worse then M3 Max
Likely there is a thermal limit or RAM bandwidth bottleneck. Apple has put the priority on making the computer cases smaller and smaller (or thinner). This makes cooling much harder, which in turn leads to thermal throttling. Literal form over function.
USB transfer speeds from a SSD aren't normally CPU limited. So it isn't a good CPU benchmark.
I predict they M4 will look better when they put it into a Mac Pro full size case with proper cooling.
Leave a comment:
-
My M4 Max R23 single core result is 2303.
In your Single Thread Performance chart M4 Max even worse then M3 Max that simply can't be...
Cinebench 2024 single thread score:
285k - 150
M4 Max - 182
Blender compressed big file open time from same ssd attached to same port speed:
285k - 1 minute 3 seconds
M4 Max - 0 minute 47 seconds
I can continue if you need
Last edited by andrewelick; Jan-02-2025, 12:01 AM.
Leave a comment:
Leave a comment: