CPU benchmarks huge changes?

David (PassMark) replied

Mar-13-2020, 04:50 AM
- show your changes side by side

Screen space is a problem for mobile devices. We prefer to leave the old data on a separate page.
Plus for 99% of users having two very similar sets of numbers on the same page just adds to confusion and doesn't help them.

as well as your algorithm change

List of changes is here
https://www.passmark.com/products/pe...st/history.php

Description of new tests can be found here
https://www.cpubenchmark.net/cpu_test_info.html
(and in the included User's Guide for PerformanceTest as well)

We've only got around 20 test machines here. So we don't know the full impact of a new release until after it is released.

alert/notify in advance a change is coming

The alpha / beta release was public on our web site site months before the release.

I've been tweeting about it since Oct 2019.
https://twitter.com/PassMarkInc/stat...348339712?s=20

Unfortunately I've only got 600 followers so it probably didn't reach a huge audience, but it isn't from lack of trying.

We've Emailed a bunch of reviewer / press people. Some replied. It is in their queue to look at (we are told).

let folks make their own decision based on information from both algorithms.

All the data is still available. Nothing has been removed. Folks can decide what data they want to use. But we would think that for the majority of users the above discussion is just techno-babble and more (slightly conflicting) data isn't going to help.
Leave a comment:
David (PassMark) replied

Mar-13-2020, 04:30 AM
i3 - 3220 (3.3 Mhz, 2 cores & 4 threads) - raiting 2278 in the new version
E3-1220 v2 (3.1 Mhz, 4 cores & 4 threads) - raiting 4667 in the new version.
E5-1620 v2 (3.7 Mhz, 4 cores & 8 threads) - raiting 6179 in the new version.

Those. according to the new test (at almost the same clock frequencies and exacly same "Ivy Bridge" CPU manufacturing technology):

1) The hyper trading mode is useless, judging by the difference between (1) and (2).

2) An increase in the number of threads by 2 (two) times, from 4 to 8, creates a change in productivity + 11%?
(The frequency of the processor E5-1620 v2 is 20% higher than that of the E3-1220 v2).

Honestly this comes across as more of a rant than a sensible argument. There is only a couple of us working on this, but we have 3M+ users. So time available to respond to each user is pretty limited. So I'll just focus on the 1st post but the comments apply generally for all CPUs.

Super briefly.
"hyper trading mode" isn't even a thing. I guess you are talking about hyper-threading. But making up terminology (or getting it wrong) doesn't help your credibility.

You've ignored turbo speeds. The i3-3200 doesn't turbo at all.

CPUs don't run at Mhz frequencies. They run at Ghz. Typo or ignorance? Either way it doesn't help your credibility

3.1Ghz" isn't "almost the same" as 3.7Ghz. It's a 20% difference.

The benefits of hyper threading (virtual cores) is very variable. In some situations it hurts performance. So 4 physical cores are significantly better than 2 physical + 2 virtual. See my screen shot below. It is also very hard to predict in advance when it will help and when it won't.

You've completely ignored the massive differences in cache. i3-3220 has just 3MB. While E5-1620V2 has 10MB

You've ignored differences in number of memory channels (2 vs 4). Probably a good chuck on those 3200 machines were on single channel.

You've ignored differences in the max memory speed (DDR3 1600 vs DDR 1866)

You've ignored instruction set differences. For example support of accelerated encryption (AES) in the Xeons.

As already stated scores are going to move around a bit until new averages are found.

Example of virtual cores not helping performance (in a machine with 32 physical cores & 64 virtual).
Leave a comment:
Ivan Ivanov replied

Mar-12-2020, 05:09 PM
Unfortunally, Passmark are no longer credible...
Leave a comment:
affinityhb replied

Mar-12-2020, 12:58 PM
Originally posted by David (PassMark) View Post

There was a few posts about it in the forum. But in the end the major impact was on those algorithms that did a lot of kernel context switches. Which means software that made a lot of short requests to the Windows O/S to do various tasks, suffered the most. So the CPU result didn't move very much as there was nearly no operations performed in the Windows kernel. But the 2D and disk operations did suffer a bit. 2D was probably the worst as it was almost entirely reliant on the O/S to perform operations (e.g. draw some text, bitblt an image).

If you were happy with the hardware last week there is no reason not to be happy today. The algorithms did need to change, some of them really were based on 20 years old code. We'd surely get more criticism over the next 5 years if we if we didn't change. People would asking why we were not using AVX512 instruction, people asking why we are using a 10 year old physics library, people asking why we are using encryption algorithms that no one uses anymore, people asking why some of the code wasn't using the available compiler optimisation flags, etc....
Various users and the CPU vendors were already asking these questions years ago in fact.

Some CPUs will look better as the result of these changes. Some won't. It is just a fact of life that CPUs have different features and architectures and different code benefits different CPUs. The actual number (the CPUMark) isn't really important. It doesn't have a meaningful unit of measurement (like MB/sec or frames per second), so the value of the number doesn't matter. Positive or negative. All that matters is relative performance. So a lower CPUMark isn't bad if all the CPUs you compare it to are also lower. As per the table in my earlier post we tried to avoid big swing for most new CPUs. Despite this being nearly mission impossible, we succeeded in this to a large degree.

There isn't one "right" benchmark number. You can't summarise the entire performance of a CPU with one number. What's right depends on how you use your computer. For some people, "right" means matching gaming performance. For others it will matching the performance of scientific applications. You can't do this with one number. So some people are always going to get upset and claim the number is wrong.

We totally get that change brings disruption. It is a enormous amount of work and pain for us as well. But we think it is necessary to remain relevant for the next 10 years.

If you want to look at the old V9 results to compare, you can find them here
https://www.cpubenchmark.net/pt9_cpu_list.php

You can also still download V9 of PerformanceTest to look at all the individual baseline submissions (all 1,200,000 of them)

...And this is why you are no longer credible...
It not whether I am happy or not that is insulting but the fact that you think explaining it away is acceptable....
you should:
- show your changes side by side
-as well as your algorithm change
- alert/notify in advance a change is coming
- let folks make their own decision based on information from both algorithms.

...nice picture
Leave a comment:
Ivan Ivanov replied

Mar-12-2020, 12:08 PM
2 moderator:
please delete two my "Unapproved" today's posts
https://www.passmark.com/forum/pc-ha...6841#post46841
https://www.passmark.com/forum/pc-ha...6844#post46844
Leave a comment:
Ivan Ivanov replied

Mar-12-2020, 10:23 AM
I understand that people don't like seeing their old CPUs made to look suddenly worse.
But this is what happens if new code is written that makes use of features that old CPUs don't have.

It makes no difference to measure the length in inches or centimeters, if the results are repeatable, correct and can be used for comparison.

Unfortunately, the new passmark v10 results is not connected with the actual hardware structure of the processor
and therefore (since March 2020) is almost useless for comparison.

An example

i3 - 3220 (3.3 Mhz, 2 cores & 4 threads) - raiting 2278 in the new version
E3-1220 v2 (3.1 Mhz, 4 cores & 4 threads) - raiting 4667 in the new version.
E5-1620 v2 (3.7 Mhz, 4 cores & 8 threads) - raiting 6179 in the new version.

Those. according to the new test (at almost the same clock frequencies and exacly same "Ivy Bridge" CPU manufacturing technology):

1) The hyper trading mode is useless, judging by the difference between (1) and (2).

2) An increase in the number of threads by 2 (two) times, from 4 to 8, creates a change in productivity + 11%?
(The frequency of the processor E5-1620 v2 is 20% higher than that of the E3-1220 v2).

The hyper trading mode is useless, judging by the 3,78% difference between G2140 and i3-3220
(with exacly same frequency & Ivy Bridge tech.)

New v10 test results has nothing to do with real hardware
(and with fact CPU speed and perfomance in usual user's tasks).

Therefore, the new results are completely devoid of physical and practical meaning in terms of comparison.

It’s a good and great thing to come up with a new one.

Please then name the results in a new Passmark 2020 / Passmark v10 and so on - and not use "classic" Passmark name !!!

And, Show v9 test results for "All threads" and "Single-core" in compare CPU mode (in compare table), please !

Agree:
If you take a ruler to measure the size of the same brand that you have been used to for 8 years, and (since March 2020) now it turns out to be rubber instead of metal, which now has 80 centimeters in the first meter and 60 centimeters in the second (or 10 inches in the first foot, and in the second foot it has 7 inches), then this is a completely different measuring ruler (not the same instrument).

Last edited by Ivan Ivanov; Mar-12-2020, 11:12 AM.
Leave a comment:
David (PassMark) replied

Mar-12-2020, 06:48 AM
new Single-Tr. raiting looks like bullshit. For E3-1650 change in 50%

No it isn't.
Single thread score was 1948. It is now 1756. So that is a difference of 9.8%.
Even if you look at the multi-threaded results the difference isn't 50%. (its closer to 30%)
But E5-1650 is an old CPU now. As pointed out in posts & table above old CPUs are going to suffer a bit with the new algorithms.

I understand that people don't like seeing their old CPUs made to look suddenly worse. But this is what happens if new code is written that makes use of features that old CPUs don't have.

You also don't need to use the Wayback machine & take screen shots. As pointed out above a couple of times, V9 results have all be saved here for your viewing pleasure,
https://www.cpubenchmark.net/pt9_cpu_list.php

I agree the single thread difference between the i3-3220 & E3-1220 v2 is a bit strange. We'll have a look at this.
Leave a comment:
Ivan Ivanov replied

Mar-11-2020, 10:41 PM
About 5955 and 1756 in new version
Leave a comment:
Ivan Ivanov replied

Mar-11-2020, 10:37 PM
For E3-1650 change in 50%
Leave a comment:
Ivan Ivanov replied

Mar-11-2020, 10:23 PM
Unfortunally, new Single-Tr. raiting looks like bullshit.

For example:
1571 for i3-3220 @ 3.3 MHz
1836 for E3-1220 v2 @ 3.1 MHz (3.4 MHz in turbo mode).

This CPU has same Ivy Bridge techology, so real difference is less 5%.

IMHO, without old database resuls (from 2012-2019 years) site passmark.com is useless!
Likes 1
Leave a comment:
David (PassMark) replied

Mar-11-2020, 09:28 PM
I don't remember any adjustments of any significance for meltdown or specter patching.

There was a few posts about it in the forum. But in the end the major impact was on those algorithms that did a lot of kernel context switches. Which means software that made a lot of short requests to the Windows O/S to do various tasks, suffered the most. So the CPU result didn't move very much as there was nearly no operations performed in the Windows kernel. But the 2D and disk operations did suffer a bit. 2D was probably the worst as it was almost entirely reliant on the O/S to perform operations (e.g. draw some text, bitblt an image).

To have such dramatic swings in results especially in the negative it almost voids some of my purchases

If you were happy with the hardware last week there is no reason not to be happy today. The algorithms did need to change, some of them really were based on 20 years old code. We'd surely get more criticism over the next 5 years if we if we didn't change. People would asking why we were not using AVX512 instruction, people asking why we are using a 10 year old physics library, people asking why we are using encryption algorithms that no one uses anymore, people asking why some of the code wasn't using the available compiler optimisation flags, etc....
Various users and the CPU vendors were already asking these questions years ago in fact.

To have such dramatic swings in results especially in the negative

Some CPUs will look better as the result of these changes. Some won't. It is just a fact of life that CPUs have different features and architectures and different code benefits different CPUs. The actual number (the CPUMark) isn't really important. It doesn't have a meaningful unit of measurement (like MB/sec or frames per second), so the value of the number doesn't matter. Positive or negative. All that matters is relative performance. So a lower CPUMark isn't bad if all the CPUs you compare it to are also lower. As per the table in my earlier post we tried to avoid big swing for most new CPUs. Despite this being nearly mission impossible, we succeeded in this to a large degree.

There isn't one "right" benchmark number. You can't summarise the entire performance of a CPU with one number. What's right depends on how you use your computer. For some people, "right" means matching gaming performance. For others it will matching the performance of scientific applications. You can't do this with one number. So some people are always going to get upset and claim the number is wrong.

We totally get that change brings disruption. It is a enormous amount of work and pain for us as well. But we think it is necessary to remain relevant for the next 10 years.

If you want to look at the old V9 results to compare, you can find them here
https://www.cpubenchmark.net/pt9_cpu_list.php

You can also still download V9 of PerformanceTest to look at all the individual baseline submissions (all 1,200,000 of them)
Leave a comment:
affinityhb replied

Mar-11-2020, 05:09 PM
[Moderator] Deleted duplicate earlier version of this post.

Editing my last post...

I have been using your site for quite some time. It was reported to be a comprehensive, legit, and unbiased site.
I have based a lot of purchases and even directed some of my customers to include these benchmark ratings/reports
in their purchases. I also run a few media servers and this site was used to help gauge a server build or upgrade.
With that said, I don't remember any adjustments of any significance for meltdown or specter patching... at least no reply to my inquiries

To have such dramatic swings in results especially in the negative it almost voids some of my purchases and consulting on purchases. (45 -60 since Nov, not much but a lot to me.)
Not to mention making the score suspect...
This site has now become nice charts and graphs.. but no longer accurate. reliable or even an authority..
Very disappointed, no notice no mention of the changes on the site. should be highlighted for the significant changes. but it has all been false reporting and bad information...

Good luck everyone.. enjoy the pictures!!

Here's a novel idea... add a V10 score column so that folks can see the disparity in scores and criteria (use a hover in the column title, to describe the major contributing factor in V10 [dated])

Respectfully,
Likes 1
Leave a comment:
David (PassMark) replied

Mar-11-2020, 09:57 AM
There is a discussion here about single thread performance.

New CPUs do a much much better job of out of order execution and there are more of them with this feature than in the past.

From Wikipedia,
"The high logical complexity of the out-of-order technique is the reason that it did not reach mainstream machines until the mid-1990s. Many low-end processors meant for cost-sensitive markets still do not use this paradigm due to the large silicon area required for its implementation"

But my statement was more of a general statement to encompass improvements in pipelines, queues, branch prediction and execution units. So for example the integer benchmark test previously didn't allow much out of order execution. This wasn't deliberate, it was just an artifact of the near 20 year code we had for the integer test. So the integer benchmark result was largely dependent on clock speeds. At the same clock speed a Pentium and i9 CPU could get kind of the same result (not exactly but hopefully you get the idea). By restructuring the code some out of order execution is now possible, if the CPU supports it and there are enough integer execution units in the CPU. So the benchmark result is now influenced by both the clock speed and the CPU architecture improvements. (i.e. a CPU with OOO & two or more integer execution unit will do better than before). Meaning newer CPUs will tend to score better.

As an example, AMD Ryzen has four 128-bit execution units for floating point and vector operations.

I think the newest version will not be accurate for a long time

Using the data we have already got. We extrapolated it out today to fill in some of the holes for the rare and old CPUs. This should give them a more reasonable starting point for their new average. So I think it is already looking better today than it was yesterday. While there will surely be a few quirks (and I am sure people will let us know about them) I am fairly confident it will be looking a lot better in the coming weeks.
Leave a comment:
BenchmarkManiac replied

Mar-11-2020, 04:09 AM
Originally posted by David (PassMark) View Post

- Have better support for out of order execution, which is a feature of newer CPUs.

Out of order execution is here for decades literally. Starting from the first Pentium in 90th on desktops. It is absolutely not a new feature.

The Single Thread performance benchmark is very suspicions. Ryzen 3800X used to be in top 10 and not it lose 500 points and placed 69.
Likes 1
Leave a comment:
DAWlife replied

Mar-11-2020, 01:16 AM
Hello, I had this same exact burning question as the OP. Thanks for all the information.
I've been comparing some older laptops to desktop over that past year which I use before selling.
I am shocked at the huge differences in 3rd generation and 4th generation CPUs scores now, for e.g. the 4910mq, and 3740qm, especially after I did some practical tests to compare them with a 3570k in a desktop PC I have. The CPU usage if I recall correctly was 24% (3740) 23% (3570) 20% (4900).

Anyway, I will stay tuned to see how this progresses, as I rely on these scores as a key indicator for my next purchase to use with my music projects.
Leave a comment:

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: