No announcement yet.

64bit vs 32bit benchmarks & integer maths & PT8

  • Filter
  • Time
  • Show
Clear All
new posts

  • 64bit vs 32bit benchmarks & integer maths & PT8

    Six years ago we took a look at 64bit benchmarking and provided some examples of why 64bit can give better performance than 32bit.

    What we found at the time was that a 64bit CPU, running a 64bit O/S, executing 64bit code could in some cases be twice as quick as 32bit code.

    We are now at the point where we are doing research into PerformanceTest V8, and since the initial study many new CPUs have been released and the difference between 32bit/64bit performance has grown. In PerformanceTest V7 some CPUs now get up to 4 - 6 times the performance in 64bit integer maths, compared to 32bit.

    Six times the performance is an enormous difference. So we decided to dig a bit deeper to see what was going on.

    In PerformanceTest V7 the integer maths test is made up of 8 individual mathematical operations performed in equal numbers. These are,
    1. Addition of two 32bit numbers
    2. Subtraction of two 32bit numbers
    3. Multiplication of two 32bit numbers
    4. Division of two 32bit numbers
    5. Addition of two 64bit numbers
    6. Subtraction of two 64bit numbers
    7. Multiplication of two 64bit numbers
    8. Division of two 64bit numbers

    The first four operations are unsurprisingly executed in the same way and at the speed on a 32bit machine and a 64bit machine.

    The second four 64bit operations are executed at a much quicker speed on a 64bit machine. See the above referenced post for details.

    What we have found in more recent testing however is also interesting.

    The first interesting point is that division of 32bit numbers is pretty much always around four times slower than Add, Subtract or Multiply. This isn't news, as it is well known that division is a harder operation to do.

    What was more interesting was that 64bit division was way way slower than 32bit division. And doing 64bit division on a 32bit system was extremely expensive, showing a fourteen fold performance drop going to 64bit numbers. This more than anything else accounts for why the PerformanceTest V7 integer maths test does so well on 64bit compared to 32bit.

    The second interesting point is that some of the newest CPUs have got significantly faster in 64bit division. For example the AMD A8-3850 & A6-3650. This has really lifted their results in this test.

    The lessons in this (for us) are that V7 integer maths test places too much weight on the speed 64bit division can be performed. The same is also true for 32bit division and multiplication to a lessor extent. More weight should be give to the other operations, Add, Subtract, etc.. This would moderate the differences between CPU types and also reduce the large differences between 32bit and 64bit.

    Because the division operation is so slow compare to the other operations and the weighting of each operation was equal, the V7 integer maths test has become largely a test of how fast the CPU can do division. Making it a rather narrow unrealistic test of a CPU.

    So in V8 we plan to reduce the number of division operations performed and also introduce some additional variety into the test in the form of logic operations like bit shifting & increment instructions.

    Here are the actual V7 numbers from an Intel X9650 CPU, running both 32bit code and 64bit code. Higher numbers are better.

    You might be wondering just why doing 64bit division on a 32bit system is so slow. The reason is that A) There is no native machine code instruction for dealing with any 64bit numbers on a 32bit system and B) The calculations to perform a 64bit division are rather complex on a 32bit system. Each division takes dozens of steps to complete for the CPU.

    Update: Here is a link to the PT8 development thread.

  • #2
    How much 64 bit Division is done in the real world? If its close to equaling the amount of 64 bit Addition, Subtraction and Multiplication should 64 bit Division be weighted down?
    Main Box*AMD Ryzen 7 5800X*ASUS ROG STRIX B550-F GAMING*G.SKILL 32GB 2X16 D4 3600 TRZ RGB*Geforce GTX 1070Ti*Samsung 980 Pro 1 TB*Samsung 860 EVO 1 TB*Samsung 860 EVO 2 TB*Asus DRW-24B3LT*LG HL-DT-ST BD-RE WH14NS40*Windows 10 Pro 21H2


    • #3
      We don't have hard stats, and it would vary from one application to the next. But we are thinking that multiply and add are significantly more common.

      Some research on Google turn up the "Gibson mix". Which was based on research done by Jack C. Gibson in 1959 on a IBM 704 system running scientific applications. (Yes from 1959!!!)

      Instruction type, Percentage of use.
      Load and store, 31.2%
      Indexing, 18.0%
      Branches, 16.6%
      Floating Add and Subtract, 6.9%
      Fixed-Point Add and Subtract, 6.1%
      Instructions not using registers, 5.3%
      Shifting, 4.4%
      Compares, 3.8%
      Floating Multiply, 3.8%
      Logical, And, Or, 1.6%
      Floating Divide, 1.5%
      Fixed-Point Multiply, 0.6%
      Fixed-Point Divide , 0.2%

      See also, [Jain91] R. Jain, "The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling", Wiley- Interscience, New York, NY, April 1991.

      Besides being very old, the work load mix is only half the story. There are different addressing modes, different caching scenarios, differences in the data being used (e.g. all zeros), different options in the CPU for floating point precision, where the data is in RAM, if it is accessed sequentially, if the data is aligned and many other factors.

      More research turned up this work from the 60's and 70's. Know as the "ADP Mix". Produced by TSU. Which was the UK's Treasury Technical Support Unit (TSU). See,

      Instruction type, Percentage of use.
      Fixed Point Add/Subtract 31%
      Fixed Point Multiply 1.3%
      Fixed Point Divide 0.6%
      Branch 35%
      Compare 6.2%
      Transfer 8 characters 20.5%
      Logical 5.4%

      I am guessing there is no floating point % mentioned because many years ago, all the floating point work was done in a separate FPU chip and not in the CPU.

      One would think there must be something more recent. But I haven't found it yet.


      • #4
        Would it be possible to write a program to monitor the type of math used over time and save the results?
        Main Box*AMD Ryzen 7 5800X*ASUS ROG STRIX B550-F GAMING*G.SKILL 32GB 2X16 D4 3600 TRZ RGB*Geforce GTX 1070Ti*Samsung 980 Pro 1 TB*Samsung 860 EVO 1 TB*Samsung 860 EVO 2 TB*Asus DRW-24B3LT*LG HL-DT-ST BD-RE WH14NS40*Windows 10 Pro 21H2


        • #5
          Probably nearly impossible to do in real time. Modern CPUs might execute 12,000,000,000 instructions per second. So no program is going to be able to accumulate this much data in real time. Sampling or static analysis is maybe a better option. But either way it is a lot of work. Would also be highly task dependent. So an algorithm to search for prime numbers might use a lot of division. But sorting strings might use none at all.


          • #6
            Now that PerformanceTest V8 have been launched we have some stats about how the rebalancing of results has impacted the integer maths score.

            PerformanceTest V7
            AMD A10-5800K: 4,050 Millions of integers maths operations per sec (MOps).
            AMD Phenom II x4 965: 680 MOps
            Intel i7-2600: 2,780 MOps

            PerformanceTest V8
            AMD A10-5800K: 11,750 MOps.
            AMD Phenom II x4 965: 6,350 MOps
            Intel i7-2600: 16,670 MOps

            % Difference between V7 and V8
            AMD A10-5800K: 190% increase
            AMD Phenom II x4 965: 833% increase
            Intel i7-2600: 500% increase

            All CPUs did more operations per second in the new PT8 test. This makes sense as the instruction mix is now weighted to faster instructions. Bitwise operations are especially quick compared to division. Some CPUs will benefit at lot from this change. Others only slightly. This will move the relative rankings around in the PT8 charts compared to the PT7 chart. The Phenom should move up a bit, the A10-5800K down a bit.

            The old PT7 prime number test also used the square root function very heavily. Internally we think the square root function did a lot of division as well. In PT8 this function has been replaced the Sieve of Atkin, which is a lot more efficient and uses a broader range of CPU instructions. This has caused further reshuffling of the relative CPU's rankings.

            What's in the new integer maths test
            The new test uses a lot less division and a broader set of CPU instructions. The test is a mix of 32bit and 64bit instructions and performs the following operations. Addition, Subtraction, Multiplication, Division, Bitwise Shift, Bitwise boolean AND, Bitwise boolean OR, Bitwise boolean XOR, Bitwise boolean NOT. Weightings have also changed so that division only makes up ~1.5% of the new test.


            • #7
              Hi Dave,

              Isn't the V8 relative ranking quite weird when compared to V7?

              The A10-5800 is not down a bit, it has gone from 45% more MOps versus i7-2600 to 30% less. On the overall CPU score, it has gone from 7500 to 5200, a 30% reduction.

              In comparison, the i5-3550 score has gone from 7470 to 7080, a 5% reduction.

              Bottom line: A10-5800 was equivalent to i5-3550, now it is equivalent to i3-3225.

              Is this supposed to happen? V7 seems to be more related to real world performance than V8.

              Best regards,


              • #8
                We can compare the V8 results to other 3rd party benchmarks.

                Here is the comparison over at Anandtech between the A10-5800K and the i3-3220 with ~20 different applications.
                Broadly speaking these two CPUs are similar in most of the selected applications.

                Here is the comparison between the A10-5800K and the i5-2500K.
                The i5 is clearly a better chip in this selection of applications.

                Unfortunately Anandtech has only benchmarked a small selection of the available CPUs. So I couldn't use the exact model numbers you quoted. You could also make a valid argument that selection of applications used at Anandtech is too GPU dependant for a pure CPU test. But I think the point still stands.
                • Our old V7 tests ended up being too weighted toward doing division (which the A10/A8 excelled at doing, for reasons we don't fully understand).
                • Our old V7 tests were all fully threaded. So single threaded performance wasn't factored in at all. This has now changed in V8. The A10-5800K doesn't do all that well in single threaded applications. It gets thrashed by the i5-3550. for example.
                • The AMD chips gave inconsistent performance due to a CPU bug. Further messing up our results.
                • The A10 needed to drop a bit in our V7 charts, and this has happened in the new charts.


                • #9
                  Do the 32- and 64-bit versions of the CPU test weigh the results of 32- and 64-bit operations differently? Because theoretically, a 32-bit version of a program would process more 32-bit numbers in places where a 64-bit version of the same program would be using 64-bit numbers (e.g. integers, memory addresses, etc.).

                  If not, wouldn't the 64-bit results be somewhat skewed from real-world CPU performance, because the faster processing of those 64-bit numbers won't increase the speed of program execution vs. its 32-bit counterpart?


                  • #10
                    The weighting is the same.

                    Both 32bit and 64bit applications process 32bit and 64bit variables.
                    So 64bit applications commonly use 32bit values, and 32bit applications commonly use 64bit values.

                    For program variables like integers, characters, floats, etc.. the number of bits used is not determined by the CPU or the operating system, but it is determined by the programmer. So (in C/C++) an integer is always 32bits, even in a 64bit application. By contrast an int64 is always 64bits, even in a 32bit application.

                    So if the programmer needs to use a 64bit variable, it will result in a 2 to 5 fold speed penalty if the code is run on a 32bit system. This is what happens in real life and in our benchmark. The only difference will be that some real applications use a lot of 64bit variables and others not so many. So the effect will vary from one application to the next. There is more detailed examination of this 64bit speed difference in this old post.


                    • #11
                      Yeah, you're right that it's highly program dependent. A program that needs to use 64-bit variables (even in a 32-bit environment) might see more benefit from a 64-bit version of the same program.

                      But then that brings up another wrinkle: haven't 32-bit processors been able to operate on 64-bit floating point numbers ever since the Pentium 1 (64-bit integers are of more limited usefulness)? What I'm getting at is, since a real-world 32-bit program that makes use of 64-bit floats is likely already benefiting from 64-bit processing, how big a performance increase do you really get by running a 64-bit version of the same program? Does Passmark's benchmark have the CPU calculate 64-bit floats the same way a commercial app would (e.g. by using x87, MMX, SSE, etc.)?

                      Does Passmark keep track of separate results for 32-bit and 64-bit versions of the benchmark? It would be interesting to see how much benefit (other than more memory) a 64-bit environment will get you.


                      • #12
                        Even the Intel 8087 (from 33 years ago) offered 80 bit floating point operations. It also did 16, 32 and 64bit depending on how you used it. Didn't do it very fast however and it didn't have a 64bit address or data bus.

                        64bit integers are now common. I don't think you can make a blanket statement that they are of limited use.

                        On a 32bit system you won't be using a 64bit address and data bus. Going to a 64bit data bus will speed things up.

                        I think you are confused on the MMX for floats point. MMX was exclusively for integers. In fact it could slow down floating point, as the CPU registers were shared between MMX instructions and floating point and swapping the function was expensive.

                        For SSE there is a separate CPU benchmark test called "Extended instructions (SSE)". So there is no SSE/MMX in the integer nor floating point test.

                        The whole area is now somewhat academic. The vast majority of new systems are 64bit. For all the new CPUs are are seeing 95%+ of the baseline submissions are 64bit. So we aren't tracking 32bit separately. It will be as dead as 16bit and 8bit soon.


                        • #13
                          You know, the more I think about it, the more I realize I'm both right and wrong here. For instance, a 64-bit CPU processing 64-bit floats in a 64-bit OS shouldn't be any different than the same CPU doing the same calculations in 32-bit mode; any half-decent compiler should be able to utilize the fully-64-bit floating point registers, and there's no reason to think the registers would work any slower in 32-bit (legacy) mode than in x64 mode; the calculations are essentially identical either way.

                          You're right about MMX being for integers and not floats; they do make use of the floating point registers, but as you pointed out, the registers are essentially dual-mode - they act as floating point registers for x87 instructions, but SIMD integer registers for MMX code, hence my confusion. But I believe you're mistaken in suggesting that only 64-bit CPU's work with a 64-bit data bus (wasn't the data bus on the Pentium 1 64-bit?).

                          And I still question the usefulness of 64-bit integers. A 32-bit integer can store one of ~4 billion numbers, compared to about 18 trillion for a 64-bit int. When's the last time you saw a data set that needed to keep track of more than 4 billion different elements?

                          But here's the rub I think we're both forgetting about: the x64 instruction set has twice as many registers to work with . So even if the calculations themselves are being handled the same way, the simple fact that there are more registers to work with means the CPU can handle that much more of a workload at one time. I suspect the extra registers have more to do with performance differences than anything else.

                          Please don't take any of this the wrong way. I'm not trying to be an ass, I just want to better understand how and why these things work. Many people still use 32-bit platforms (e.g. Windows XP), often with 64-bit capable CPU's (e.g. Athlon 64), and I'm constantly curious how much benefit would come from upgrading to a 64-bit OS (such as Windows 7). I really do appreciate you taking the time to respond to my questions and concerns.

                          I do think 32-bit is gonna stay around for a while, though. As long as Microsoft keeps pumping out Windows Vista (aka "we-didn't-learn-a-thing-from-Windows-ME" edition), Windows 7 (aka "spend-6-months-relearning-things-you-already-know" edition), and Windows 8 (aka "pretend-your-PC-is-a-tablet" edition), the more people will want to stay with 32-bit XP, the last great Windows OS.

                          Thanks again for your help, and happy new year


                          • #14
                            Don't really have time to address all the points.

                            But as a brief comment.

                            Even if the bus is 64bit, data will often be loaded in 32bit chunks when running 32bit code. Again, see this old post for details. Why do the compilers do this? I assume it is to maintain backwards compatibility with old x86 CPUs. Although there are setting in compilers to break compatibility, they are rarely used.

                            > When's the last time you saw a data set that
                            > needed to keep track of more than 4 billion different elements?

                            Happens all the time. Here are few examples.
                            - Encryption keys (128bit would be even better for this)
                            - Hash values
                            - Time measurement (think nanoseconds)
                            - US budget deficient (although I bet they wish it was only 32bit)
                            - Number of people on the planet
                            - Byte offsets on a hard drive
                            - Large memory mapped files
                            - etc..

                            > But here's the rub I think we're both forgetting about:
                            > the x64 instruction set has twice as many registers to work with .

                            Again, read my post from 8 years ago.
                            The differences in registers was covered.

                            > people will want to stay with 32-bit XP, the last great Windows OS

                            There will always be some uninformed nostalgic nutters.The only real reason to use 32bit XP on any PC made in the last 5 years would be that it is too much effort to install something better. We see thousands of benchmark results come in each week. So we see the trends in what people are testing. XP and 32bit are as good as dead & Win8 is getting more popular.


                            • #15
                              Thanks for all the info. Very educational, and very much appreciated. Now if you'll excuse me, I think I've got some upgrading to do