Announcement

Collapse
No announcement yet.

Apple M4 chips

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Why M4 Max slower then M3 Max in your graphs. This is simply can't be.
    I'll be glad to help you. My 285k, M3 Max and M4 Max ready to do tests if needed.
    I'm noticed that in comparison to your PC benchmark MacOS ST benchmark takes only a few secs. I think it just not enough runtime to do benchmark properly in terms of averaging.
    Also M4 chips uses new ARMv9.2-A instruction set. Probably need to recompile clients.
    Thanks.

    Comment


    • #17
      Why M4 Max slower then M3 Max in your graphs. This is simply can't be
      Don't really know for sure. We assume it is because the M4 has only been released in small form factor machines so far. Or maybe Apple went cheap on the RAM to increase profits.
      M4 Max is faster in multi-threading, than M3, but you need to ignore the older M2 Ultra, which is Apple's fastest chip.
      We do know it is the same code that runs on the M3 and M4 & that Apple declined to compare the M4 and M3 CPU performance head to head.

      M4 doubles the RAM, but that doesn't help with our CPU benchmark as 8GB was enough. This might hurt in fact as RAM on chip generates more heat. This can help a huge amount however for some memory hungry apps. But our CPU test doesn't even use 4GB. M4 claims to also have faster RAM.
      M4 has more cores, but that doesn't help with a ST CPU benchmark. This can often hurt in fact as it generally lowers clock speeds.
      M4 claims to have better neural engine. But no software uses that yet. So that doesn't help.
      Both were built on 3nm silicon.
      So no real logical reason to think M4 should be faster, except faster RAM.
      But over all still slightly strange.

      Also M4 chips uses new ARMv9.2-A instruction set. Probably need to recompile clients.
      We can't target just the M4 instruction set and have the code not run on all earlier CPUs.
      (the exception for this is the extended instruction test, where we hand code different code paths in assembly for each CPU family, so maybe there is something that can be done there)

      PC benchmark MacOS ST benchmark takes only a few secs
      We'll have a look this week, most of our engineers were on Christmas leave last week.

      Comment


      • #18
        Some results from in house testing with Performancetest V11.0 build 1003
        M2 Macbook Air - Single threaded
        Command line - 4161
        GUI - 4228
        Each result is an average over 5 runs​. So a 1.6% difference between the different versions. There is some variation between runs of the same version, but this is a clean machine without much background activity, so the difference seems real and not just in the margin of error.

        We also tested on an older x86 machine and that was very slightly the opposite if anything (command line was very marginally faster).

        We'll keep digging into it. Some possibilities.
        1. Different MacOS task scheduling between command line and GUI tasks. For example, switching some load to efficiency cores for command line tasks. Or starting the command line task on E-Cores and taking a while to realize how CPU intensive it is before moving the task to a P-Core.
          This could also be influenced by the MacOS release as well.
        2. Different compiler code optimization switches accidentally used by us, even though the source code & dev environment is the same.
        3. Some extra overhead on the command line (e.g. due to NCURSES library). But this seems unlikely as x86 code should be also impacted.
        We are guessing the it is 1) as Apple claims this (paraphrased).
        "High-priority tasks (e.g., user-facing applications) are more likely to run on P-cores, while low-priority tasks (e.g., background updates or maintenance) are assigned to E-cores."

        Info from Apple hints at this.
        https://developer.apple.com/news/?id=vk3m204o

        Comment


        • #19
          Other things we looked at today.

          We tested with a M1 machine. Which also showed a performance gap between GUI and CMD. Larger than the M2 in fact. See graph below. Note that the Y scale exaggerates the difference as it isn't zero based axis.

          Adjusting the process scheduling priority ("Nice" value). This didn't close the CMD vs GUI gap.

          Forcing execution on P-Core (or E-Core) for ST didn't close the gap. E-Core is way slower BTW (85% loss of performance, as the E-Cores seem really weak). This doesn't mean it never switches to the E-Cores. But if it does switch it is only for a very short period. Both version only run on P-Cores as far as we can tell, this is without explicit affinity settings.

          We looked at changing the test duration.
          Longer tests didn't close the gap. But did show thermal throttling of up to 16% in the single threaded test. This is because the ST test runs after the longer multi-core tests and the CPU is already overheating pretty seriously before the start of the ST test. This also implies you are going to see rather different benchmark results depending on the duration of the benchmark, the ambient air temperature and what surface the Laptop is resting on. No real surprise really, as Apple prioritized form over function and didn't include a fan in the M2 Macbook Air.

          We compared running all the tests, medium duration, GUI, (with the ST test running last) vs just running the ST test by itself in the GUI after the machine was idle for a while. Performance difference on the M2 was 1.2%. So this very nearly matches the 1.6% gap.

          So test results depend on the order the tests are run in (due to the CPU overheating). Single thread is faster if the CPU isn't already hot. Maybe this explains everything?

          Shorter duration tests closed the gap slightly. It brought the M2 gap down to 0.7%. Which fits with the CPU heat theory.

          There is also two functional differences between the command line version and GUI version. 1) At the moment you can't run one test at a time in the command line version, but you can in the GUI. So this might have also contributed to the difference. Meaning the CPU likely won't be already hot if you just run the ST test from the GUI. But CPU will always be hot in the CMD version. We are going to explore this more. 2) The test time can be configured in the CMD line version, but not in the GUI version. So unless careful attention is paid to the setup, you can end up with a different test duration between the GUI and CMD versions.

          Click image for larger version  Name:	image.png Views:	0 Size:	65.7 KB ID:	58363

          Comment


          • #20
            We modified the CMD version to allow running of just the ST test without all the other tests. We also matched the test duration.

            Gap disappeared for the M2.

            Data is below. 5 runs for each setup.

            So likely mystery is solved. CPU cooling is rather poor and CPU throttles after a period of high load.

            We'll go back and test on M1 and M4 as well. (M4 is on order and hasn't been delivered as yet)

            Click image for larger version

Name:	image.png
Views:	36
Size:	53.5 KB
ID:	58369​​​

            Comment


            • #21
              Cool.
              Please recompile both macos versions with latest xcode version and let me know then available.
              I'm interested to run new versions and see the results.
              Thanks.

              Comment


              • #22
                Please recompile both macos versions with latest xcode version
                In order to keep the results comparable between patch releases we try not to update the compiler, except for major new releases of the benchmark. We do the same on Windows and Linux as well.

                Details for PerformanceTest 11 are,
                • Windows: Visual Studio 2022 17.6.5
                • Linux arm32/arm64: gcc-linaro-4.9.4-2017.01 toolchain
                • Linux x86_64: gcc 4.9
                • Mac x86_64/arm64: XCode 12.5 (CLANG 12.0.5)
                We've started design work on PerformanceTest V12 for a release later this year, and compiler updates will happen then.

                Internally however we did a compile with XCode16.2 (SDK15.2). It didn't change much. Maybe a few percent gain on NEON code, but even that might be margin of error differences.
                (more significant difference might happen if we update some of the libraries we are using for Physics, compression and encryption).

                Here are the numbers showing how different Xcode versions impacted the results (single run only, so some margin or error).
                Click image for larger version  Name:	image.png Views:	0 Size:	889.9 KB ID:	58374

                C/C++ compilers have been around for a very long time now. Compiler updates don't often bring huge changes nowadays. The low hanging fruit has been picked. There is also a limit to how much optimization the compiler can do if the same binary code needs to run on M1, M2, M3 and M4.
                This is even worse in x86 land, as code needs to run on really old x86 CPUs.

                Comment

                Working...
                X