CPU benchmarks huge changes?

This topic is closed.

David (PassMark) replied

Mar-16-2020, 09:21 AM
There were a few other posts that were just fact free rants, misinformation and repetition which I have deleted.

Happy to have a discussion on the facts but it is a waste to time to just keep feeding the trolls.

Free free to contact us directly if you have a legitimate issue with the site or software.

Also: Today we did a new patch release that very slightly changes the single threaded score. So that should resolve some of the concerns on the single threaded score. Details of the change are here.

Topic is closed.
Leave a comment:
David (PassMark) replied

Mar-16-2020, 09:12 AM
Originally posted by EnterpriseNL View Post

Some time ago, User benchmark did the same thing, updated their program to support AVX512 which isn't supported by modern consumer CPU's so why did everything radically changed, well, good to know that PassMark also isn't trustworthy anymore.

I've no idea what User benchmark did. Not sure how it is relevant. We don't talk to them. Don't even know who the people behind it are.

Unfortunately as per some of the previous posts, this is poorly researched and just wrong.
AVX-512 is now in consumer CPUs. The entire Ice-lake series supports it, plus others. These are Laptops from HP, Lenovo, Acer, etc... How are these not consumer CPUs?

Allowing AVX-512 to influence the result by a couple of percent isn't a radical change. We've been adding new instruction support in each software release for the last 21 years, but always with care to ensure it doesn't dominate the results. It is amazing people are complaining about this without actually looking a the numbers to see what the influence really is. It's very hard to take people seriously when they are just parroting what some random 11 year kid said anonymously. It really isn't a big deal and anyone that actually looked at the numbers would see that.
Leave a comment:
David (PassMark) replied

Mar-16-2020, 08:42 AM
Originally posted by macros View Post

Dear David,
I have been using Passmark as my go to benchmark for years to compare processors. Thank you and your team for creating such a good software and simple interface.

Thanks for the positive feedback.

Originally posted by macros View Post

The new algorithm makes this difficult. Of course I see the need to benchmark the newest features in processors.

I don't see why it is difficult. It is no harder than it was last week.

Originally posted by macros View Post

However not many real software out there make use of them.

We disagree. Visual Studio 2019 is in wide use. Out of order code execution occurs commonly. ECC encrpytion is more common than Twofish.

Originally posted by macros View Post

See for example here for AVX512 https://en.wikipedia.org/wiki/Advanc...sions#Software
Only 8 in the list use AVX512. It sure is not complete, but it gives an idea.

There are more than 20 packages in that surely incomplete list. Plus some of those items are libraries and compilers. Which are used to build a large collection of other software that isn't on that list. I don't know where you got the 8 packages from.

Originally posted by macros View Post

It will take a long time till usual software utilizes these features.

It's already happened.

But it is important to note that the AVX512 part of the benchmark is pretty small. It is portion of the Extended Instructions Test. Which is one of eight tests. If AVX512 isn't available then FMA instructions are still used.

For those of you who haven't hand coded SSE, AVX, FMA and AVX512 instructions this is the kind of results to expect.
SSE AVX FMA AVX512

8,977 16,540 30,713 41,181

This was from a i9-7900X and is the number of matrix multiplications per second. We average these numbers in the benchmark by the way. We don't just take the best number.

So AVX512 isn't a huge gain over FMA. And AVX512 accounts for only a small part of the overall result of the CPUMark as it is combined with so many other test results. Maybe just a few percent in the final calculation. There is of course no "right" level for inclusion of new code. Different people use different software, so one benchmark can't reflect the needs of everyone.

Important: AVX512 plays NO PART AT ALL in the single threaded test.

And this also explains why AMD chips are the current leaders in the charts, despite not having AVX512.
AVX512 isn't that important, but we have to build something to last for 10 years, so we felt the need to include it to some degree.

Originally posted by macros View Post

So it would be awesome to have two entries if possible.
One showing the capabilities of the CPU,
your new algorithm seems to be suited for that as far as I as an amateur can evaluate that from your descriptions.
And one showing how current usual demanding tasks perform,
the previous algorithm seems to be good for that,

Way to confusing. You can't really show the capabilities of a CPU with a single number. If you want the old data it is still available, as posted multiple times above.
I can't imagine the pain in trying to explain this to people 20 times a day (most of whom struggle to understand even the concept of single thread / multi-thread) .
Leave a comment:
EnterpriseNL replied

Mar-16-2020, 12:12 AM
Some time ago, User benchmark did the same thing, updated their program to support AVX512 which isn't supported by modern consumer CPU's so why did everything radically changed, well, good to know that PassMark also isn't trustworthy anymore.
Leave a comment:
Ivan Ivanov replied

Mar-14-2020, 07:04 PM
So far, so good!
Murphy's Law (still) continues to apply:

Vyshkovsky's theorem:

Regardless of the units used by the supplier or buyer,
the manufacturer will use its own units converted to the supplier or buyer units using weird and unnatural conversion factors.

Approaching stage 5 - "acceptance" (an agreement with an inevitable fate).

Last edited by Ivan Ivanov; Mar-14-2020, 07:34 PM.
Likes 1
Leave a comment:
macros replied

Mar-14-2020, 11:13 AM
Dear David,

I have been using Passmark as my go to benchmark for years to compare processors. Thank you and your team for creating such a good software and simple interface.

The new algorithm makes this difficult. Of course I see the need to benchmark the newest features in processors.
However not many real software out there make use of them. See for example here for AVX512 https://en.wikipedia.org/wiki/Advanc...sions#Software
Only 8 in the list use AVX512. It sure is not complete, but it gives an idea.
It will take a long time till usual software utilizes these features.

So it would be awesome to have two entries if possible.
One showing the capabilities of the CPU,
your new algorithm seems to be suited for that as far as I as an amateur can evaluate that from your descriptions.
And one showing how current usual demanding tasks perform,
the previous algorithm seems to be good for that,
at least I could confirm its rankings with several personal tests with software like h264, 7zip, Video cutting software (kdenlive), image managing software (imagemagick, digikam).

Maybe you can also take this idea further in future versions.
Leave a comment:
David (PassMark) replied

Mar-14-2020, 04:46 AM
Originally posted by WizardOfBoz View Post

I think its important to understand that when you put a very useful database of stuff online, it's going to be a big hit to a lot of people if you arbitrarily change it

There was months of warning new software and new benchmarks were coming. Years actually. But the software was available to the general public for a couple of months.
We couldn't tell people how it would effect every CPU model in advance as we didn't know. We don't have 2000 different CPU model to test on. We have about 20 models.

Originally posted by WizardOfBoz View Post

Passmark has released a new version that is superior in evaluating some of the newer chips which are insanely powerful, but the new version gives different numbers than previous versions

More or less. The CPUMark number for some of the newer CPUs didn't actually change that much as we re-scaled the new number to match the old one. See my earlier post.

One thought woudl be to use a multiplier to scale all the data from previous versions.

See my first post which covered this.

Originally posted by WizardOfBoz View Post

I would suggest having two columns (for now) on your spreadsheet. One for the old DATA, and another for the new test

There is no spreadsheet. But this was covered in my earlier post.

Originally posted by WizardOfBoz View Post

The loss of the data was a big hit for me.

There is no loss. Nothing was deleted. We just moved the old results to a different web page.

Originally posted by WizardOfBoz View Post

before the differences between the 3820qm, 3840qm, 3920xm, and 3940xm were pretty clear (like 5 or 10 percent improvement for each jump). The currrent listings show these all having about the same CPUBenchmark!

I should quote Alberto Brandolini at this point.

Anyway spent some time looking into this (why I don't know). And the numbers don't back your assertion.

I don't know what you imagine the numbers are, but here are the real numbers

PT9 results (CPUMark, Single threaded result)
Intel Core i7-3820QM @ 2.70GHz 8,397, 1,844
Intel Core i7-3840QM @ 2.80GHz 8,759, 1,914
Intel Core i7-3920XM @ 2.90GHz 8,983, 1,963
Intel Core i7-3940XM @ 3.00GHz 9,133, 1,982

PT 10 results as at 14/Mar/2020, 3pm. (CPUMark, Single threaded result)
Intel Core i7-3820QM @ 2.70GHz 5,701, 1878
Intel Core i7-3840QM @ 2.80GHz 5,694, 1914
Intel Core i7-3920XM @ 2.90GHz 5,839, 1963
Intel Core i7-3940XM @ 3.00GHz 5,937, 1982

PT9 CPUMark differences are,
4.3%, 2.6% & 1.7% between the 4 CPUs. None of them hit the 10% you claimed. None of them even hit 5%.

PT10 CPUMark differences are,
0.1%, 2.5% & 1.7% between the 4 CPUs. Clearly they are not all identical scores as you claimed. But there is one anomaly of 4%.
This 4% anomaly is due to lack of PT10 samples at this point in time for the 3 faster & rarer CPUs in this bunch.

PT9 SingleThread differences are,
3.8%, 2.6%, 1,0%. Again, nothing like the 5% to 10% you claim.

PT10 SingleThread differences are,
1.9%, 2.6%, 1.0%. Clearly they are not all the same as you claimed. But to be fair they are super close. But they were always super close.

It is also worth noting that,
A) In real life these tiny differences are meaningless. No one is going to notice a 4% performance difference in the CPU. Other factors are way more important. Like disk speed, battery life, etc..
B) These differences are really below the margin of error for the benchmark. Especially for laptops which have a wide distribution of results due to power saving measures, thermal throttling and bad RAM setups. 100s and 100s of samples are needed before anyone could claim accuracy to around the 1% level.
C) The numbers are going the bounce around a bit for a few weeks until new averages are found, Like the 4% anomaly above. So there is every chance the numbers will be slightly different tomorrow and different again the day after that.
D) For those people who want nothing to change. The whole point of releasing a new benchmark is that the new number aren't the same as the old ones. If there were all the same, there would have been no point in releasing new software. The software needs to keep up with modern hardware to remain relevant. If that means a few weeks of minor inconvenience to get to a better place for the next 10 years, then we are prepared to wear that.

NOTE: I'm happy to address any new issues or major discrepancies, but if new posts are just re-asking the same questions, or are just deliberate (or lazy) misrepresentations I'm going to close the topic.
Leave a comment:
David (PassMark) replied

Mar-14-2020, 03:33 AM
Originally posted by WizardOfBoz View Post

Another thought would be to post the old version database (which has enormous amounts of information distilled in it) separately, so folks could access it it they want.

This has been answered in my first post and then several times more in following posts.

All the old data is here,
https://www.cpubenchmark.net/pt9_cpu_list.php

All the individual V9 baselines that make up these averages (more than a million of them) are also still available from within the PerformanceTest software.

If we get this same question again because people didn't even bother to read the first post it is going to be deleted.
Leave a comment:
Ivan Ivanov replied

Mar-13-2020, 09:45 PM
Originally posted by WizardOfBoz View Post

1) Not removing or changing posted data that has been used for years by people (put back the old numbers!)

2) Add a column for the new benchmarks. Either populate with scaled version 9 data (with a notation stating so) or leave blank. As new data pours in, add the new data in the new column.

Thanks

+1 !

If you are redirecting passenger flow from a reliable and verified aircraft (v9) to a spaceship under construction (v10),
why do you continue to display the old “reliable aircraft” sign at the entrance? It's not a truth!

Now the situation looks similar:
1) highly qualified programmers have made a super ingenious new testing system,
2) then a couple of teenagers (from Pakistan and China, working and earning money after studying for food) “scaled” the previous results (v9) ...

This causes uncertainty and distrust among users.

So - Please, Add a NEW column for the new benchmarks!

Last edited by Ivan Ivanov; Mar-13-2020, 09:51 PM.
Leave a comment:
dune.kb replied

Mar-13-2020, 08:16 PM
Originally posted by WizardOfBoz View Post

I think its important to understand that when you put a very useful database of stuff online, it's going to be a big hit to a lot of people if you arbitrarily change it. Your "I think we made a mistake" comment reflect an extraordinarily mature outlook, though. My understanding is that
1) Passmark values had been based upon aggregated hard test results from previous versions. This is DATA (important!)
2) Passmark has released a new version that is superior in evaluating some of the newer chips which are insanely powerful, but the new version gives different numbers than previous versions
3) One thought woudl be to use a multiplier to scale all the data from previous versions. This changes it from raw data directly related to tests, to an inference. Probably less useful.

I would suggest having two columns (for now) on your spreadsheet. One for the old DATA, and another for the new test. Initially you could us a multipler to get an "Inferred version 10 result". One results start coming in from the new version, the "Inferred version 10 result" would be replaced by "Actual version 10 test results". In this way, the old data would be available but the new version data would have a place to go.

In sum, suggest
1) Not removing or changing posted data that has been used for years by people (put back the old numbers!)
2) Add a column for the new benchmarks. Either populate with scaled version 9 data (with a notation stating so) or leave blank. As new data pours in, add the new data in the new column.

Thanks

very good post - thank you - PassMark team - read it carefully, please!

It's really not a good decision to change almost everything (what has been accumulated for years) just per one night.

Somebody has mentioned that the year would be required PT10 results to take full power - that's OK. My third suggestion (in addition to these two):
3) starting e.g. just 2021 (not right now) only PT10 database will be mainted/visible (only) - when PT10 will be more/less stable/debugged/finalised.
I.e, old (historical) results will remain at least for 2020.

Please, please, reconsider every pro's and contra's of such dramatic shift (so many negative signs/opinions for our friends should not be ignored, isn't it?).

Thank you.
Leave a comment:
David (PassMark) replied

Mar-13-2020, 07:54 PM
In the new Passmark v10, this ratio = +14%
Therefore, at first glance, the new Passmark v10 results wither and testify: gyper threading is almost useless.

I don't have time to check all your numbers. But,
1) You can't just look at a single CPU model and assume all the other CPUs behave in the same way. It's lazy cherry picking.
2) Even if we assume it really is +14% for all CPUs that have Hyperthreading. How is getting a +14% speed improvement useless?
3) Moving from 28% to 14% (if that was even correct) isn't "reduced by at least 2 (two) times". It's half.

Here's a proper study of Hyper-threading from Dell
http://ftp.dell.com/app/4q02-Len.pdf
And a quote from it. "Hyper-Threading can improve the performance of some MPI applications, but not all. Depending on the cluster configuration and, most importantly, the nature of the application running on the cluster, performance gains can vary or even be negative"

That study found performance gains of around +10% (at best). So by your logic would would be showing 40% higher gains than this real life scenario. But the truth is that it is highly dependent on the code.

Somewhat unrelated, but interesting, Intel also suggested turning off Hyperthreading as a result of their Spectre security bugs.
Leave a comment:
WizardOfBoz replied

Mar-13-2020, 07:03 PM
Another thought would be to post the old version database (which has enormous amounts of information distilled in it) separately, so folks could access it it they want.
Leave a comment:
WizardOfBoz replied

Mar-13-2020, 05:00 PM
I think its important to understand that when you put a very useful database of stuff online, it's going to be a big hit to a lot of people if you arbitrarily change it. Your "I think we made a mistake" comment reflect an extraordinarily mature outlook, though. My understanding is that
1) Passmark values had been based upon aggregated hard test results from previous versions. This is DATA (important!)
2) Passmark has released a new version that is superior in evaluating some of the newer chips which are insanely powerful, but the new version gives different numbers than previous versions
3) One thought woudl be to use a multiplier to scale all the data from previous versions. This changes it from raw data directly related to tests, to an inference. Probably less useful.

I would suggest having two columns (for now) on your spreadsheet. One for the old DATA, and another for the new test. Initially you could us a multipler to get an "Inferred version 10 result". One results start coming in from the new version, the "Inferred version 10 result" would be replaced by "Actual version 10 test results". In this way, the old data would be available but the new version data would have a place to go.

The loss of the data was a big hit for me. I'm trying to figure out whether to upgrade a CPU, and before the differences between the 3820qm, 3840qm, 3920xm, and 3940xm were pretty clear (like 5 or 10 percent improvement for each jump). The currrent listings show these all having about the same CPUBenchmark! Where did the approximately 20% difference from 3820qm to 3940xm go? I'm not even sure what the new numbers represent.

In sum, suggest
1) Not removing or changing posted data that has been used for years by people (put back the old numbers!)
2) Add a column for the new benchmarks. Either populate with scaled version 9 data (with a notation stating so) or leave blank. As new data pours in, add the new data in the new column.

Thanks
Leave a comment:
Ivan Ivanov replied

Mar-13-2020, 01:18 PM
My personal test results in Passmark v10 for E3-1220 v2 is bit bigger than indicated on site: 4923 vs 4667 (+5%)

Last edited by Ivan Ivanov; Mar-13-2020, 01:50 PM.
Leave a comment:
Ivan Ivanov replied

Mar-13-2020, 07:16 AM
Thank you, David!

English is not my native language, so thank you for the corrections
(although I love to think and write in English)

Of course, the frequency in GHz... and "hyper threading" is right too ...

According to Passmark v10 benchmarks, the Xeon e5-1620 v2 is 30% faster than the Xeon E3-1220 v2 (6179/4667)

Given that the clock frequency of the Xeon e5-1620 v2 is 20% higher (3700 MHz or 3.7 GHz)
compared to the Xeon E3-1220 v2 (3100 MHz or 3.1 GHz), the performance to frequency ratio is higher by 11 %

1) e5-1620 v2 has 8 threads compared to 4 for E3-1220 v2;
2) e5-1620 v2 has 1866 Mhz RAM speed compared to 1600MHz E3-1220 v2;
3) e5-1620 v2 has 4 channels of access to RAM compared to 2 channels of E3-1220 v2.

According to Passmark v9 results (9504/6760) adjusted for the difference in frequencies, the ratio of performance to frequency E5-1620 v2 is higher by 17.6%

It seems that the new Passmark v10 “does not like” hyper threading
and reduces its influence on the final result by at least 1.6 times compared to Passmark v9 in this case.

For a pair of Intel i3 3220/3230/3240 to Pentium 2120/2130/2140 and the average ratio in Passmark v9 = +28% (for hyper threading reason)
In the new Passmark v10, this ratio = +14%
The effect of hyper treading on the final result is reduced by at least 2 (two) times compared to Passmark v9 in this case.

Therefore, at first glance, the new Passmark v10 results wither and testify: gyper threading is almost useless.

I do not think that processor manufacturers will be happy to agree with this ...

And, Please:
Show (additionally) v9 test results for "All threads" and "Single-core" in compare CPU mode (in the table) !

Last edited by Ivan Ivanov; Mar-13-2020, 07:49 AM.
Leave a comment:

Previous 1 2 3 template Next

SSE	AVX	FMA	AVX512
8,977	16,540	30,713	41,181

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: