Announcement

Collapse
No announcement yet.

Inconsistencies between two baselines, both version 10.2, and on identical hardware.

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Inconsistencies between two baselines, both version 10.2, and on identical hardware.

    The baselines in question are #1950637 (20-11-23) and #1959825 (01-12-23). These are appended as PDF attachments.

    Hardware is AMD Ryzen 9 5900x, MSI MPG Carbon Wi-fi Max X570, 4 x Patriot Viper Steel 3600 CL18 DDR4, Radeon RX 6950XT, and Seagate Firecuda 530 4Tb. Cooling is not a problem in any way, and neither is power, or a MB capable of handling what is requested.

    The machine originally had an RX6600 fitted, but the 6950 XT was fitted prior to the first test, and so I added the 200 watt difference between the two graphics cards after the first test. PSU was 850 watts for first test and 1050 watts for second one; calculated peak power drawer is 690/920 for each test, and they are pessimistic numbers. I made changes to PBO between tests which basically increased TDC (21 amp over stock at socket) and EDC (5 amp over stock) while leaving PPT at around the 156-160 mark.

    This is, apparently, the sweet spot for my ticket in the silicon lottery, and rather hilariously renders the much vaunted curved optimizer utterly useless!

    My problem is this; the two baselines just don't make sense. As I didn't foresee the need to take screenshots at the time, the following is a simple table of the two so everyone can understand.
    Baseline 1950637 1959825
    Passmark 11736 11144
    CPU 43373 43840
    2D 1635 1461
    3D 37318 38993
    Memory 3734 3707
    Disk 48268 49435
    My understanding of the algorithm used to combine results is rudimentary, but my gut feeling as a system builder (and a 8088/8086/IA32/IA64 assembler programmer) of thirty years tells me that the Passmark numbers should be broadly the same, maybe the second one should even be a bit higher. I searched in vain for a simple spreadsheet for calculating a Passmark from bare category results, found nothing and so maybe I'll end up building one.

    The basic inference is that 2D carries more weight than 3D, or else memory is over-weighted by quite a bit. But that ridiculous 3D result is above the Passmark average for a RTX 4090, for gods sake!

    My back of an envelope calculations are that memory and 2D differences cost me around 420 marks, while processor, disk and 3D gains are around 560-580. So I would expect to see a result of circa 11930 for the second test (12000 would have been nice!)


    Can you confirm (or deny) my suspicions as noted, please? Note I did not witness anything abnormal at all during the second test but I always test from bottom to top, and idly note the Passmark number as categories are completed. It seemed low coming up to the CPU tests at the end.

    Several points became obvious during examination of individual test results.

    Certain 2D tests do not respond at all well to increasing EDC, namely image rendering, image filters, windows interface and PDF rendering. In more general terms reducing TDC by 6 Amps and EDC by 13 Amps (PPT held constant at 156) is clearly beneficial across the board. (One should note that when the RX 6600 was in the same rig, it recorded a 2D score of 1470 and 3d score of 20700, both 30/35% above average).

    Whereas the opposite is true for 3D tests. First I pushed the CPU until it was 11.5% over the Passmark 5900x average, then the RX 6950 XT had it's GPU and memory frequencies ramped up and a 20% power boost added (it's now drawing 363watts at peak! against 330 stock). So the GPU peaks at 2770Mz (+400), the GDDR6 at 2370Mz (+120), and it's all as stable as can be! (Most remarkably, it's a cheap Chinese card (XFX of all people) which cost me a princely £499.99; brand names in UK charge around £770-900 for the same card). No matter how much juice you throw at it, this CPU/GPU combination always finds a use for, and not as heat, either. Running HWInfo it transpired that your test suite didn't push the CPU above 67C, nor the GPU Hot Spot over 80C. Yes, I have a huge Phanteks Tower Case, but everything is still air-cooled. CPU idles at 29-31C, GPU nearer 27-28C

    As for the memory, it's CL 18, so you can't push it very far, and overvolting simply slows it down. But Patriot Viper Steel is more than reliable enough, which is why I use it. Still upping the processor draw slowed the memory controller; may be2x32 is better than 4x16 on MSI boards.

    Version 10 disk tests made me laugh. The Seagate runs 7300/6900 read/write on Crystal Disk Test and 7000/6800 on Passmark V11; on Passmark V10.2 you report 4200 read (both tests) and better write results on the second test 4700 v 5000, which more power on the CPU should give. I assume that V10 was written before Phison E-18 controllers existed in the wild? On V11 the Seagate's read mark is 7062, currently the highest recorded, so reported 4200 is clearly not correct!

    All in all it's been a very useful exercise; I've come to appreciate AMD's Ryzen design as never before (and how their documentation has improved down the years). The Intel/Nvidia strategy of designing in high power consumption in order to avoid having to think about solving problems stands in stark contrast to viewing the GPU/CPU pair as partners in a joint venture. Integrating the pair on one cheap will never work alas; the Ryzen "G" chips are limited because they generate too much heat. As specified they are perfectly satisfactory office machines, and a friend uses one as such, but you'd never want to game on one.

    The best bit? This isn't even a gaming rig. but a private FTP server, with 8 x 8 TB Toshiba N300 NAS drives in two four spindle 32 TB RAID 0 arrays; array to array backups (the purpose of the second array) exceed 55 GB a minute. So why fit a RX 6950? Because I love Flight Sim, and it lets me run with every graphics option turned on!
    Attached Files

  • #2
    Formula for the PassMark rating formula for various release is here
    Is it mostly a Weighted Harmonic Mean. The consequence of which is that all components need to perform well to get a good overall result. One weak area can pull down the overall result more than it would otherwise, as compared to doing a straight average.

    that the Passmark numbers should be broadly the same
    The results were 5% different. I think this is broadly the same, given the margin for error in a single measurement.
    To measure small differences of just a few percent, you need to run the tests maybe 10 times, and take an average, or take the max. There is too much background stuff happening on Windows to get perfectly consistent results from one run to the next. And there is also the possibility of a different test environment, as your tests were run more than a week apart.

    I assume that V10 was written before Phison E-18 controllers existed in the wild?
    On V11 the Seagate's read mark is 7062, currently the highest recorded,
    so reported 4200 is clearly not correct!
    4200MB/s is in fact correct. People are being scammed in a sense by the marketing departments of the hard drive vendors. In most usage conditions these drives won't run as fast as advertised. 4200 is a more real life usage number. 7000MB/s is more of a cherry picked scenario. But we understand people love big numbers. The sad truth is that these drives often run at more around 100MB/sec (with small files and single thread). This is why games still take 20seconds to load. If the drive could really do 7000MB/sec then game load times would be under 1 second.

    Also we don't support V10 anymore.

    Comment


    • #3
      Hi,

      Thank you for that; I'm a mathematician by qualification, so it's perfectly clear, and also easy enough to replicate. Now I can bin V10 and focus on V11; I bought the update key a month back and V10 will disappear today.

      However, the link you posted provides no separate formula for V11. Does this mean that the formula remains unchanged? Or is there another link? I note major changes to the DX10 and DX11 tests, and seemingly to the 2D ones also; the GPU compute tests are also down by 20% too.

      Your comments re disk speeds are quite correct; the latest version of Flight Sim takes 25 seconds to load even on my rig from the 4Tb Seagate. True, apart from my 500 GB Seagate system disk, which has an average file size of 135k (thank you, WinSxS), the average across the raid arrays is 163.5Mb, mainly video and a huge music library.

      But there is a way to speed things up on NTFS based systems, assuming you have disk space to spare, and it doesn't require much of it at all. Stop using the default 4K cluster allocation size and up it to 256k, or even 512K. I just saw that you've increased the block size on disk read/write in V11 fourfold, which performs some of that task. Do it within NTFS and it significantly cuts the number of API calls and interrupts. which do waste time. This shows some gain even on a system disk, but strangely enough not for the Windows Explorer copy command.

      Even on an SSD this speeds things up no end, because there is no access penalty of any consequence on an SSD other than the time required by Windows (read NTFS) to access the MFT. On a system with 32GB of RAM the entire MFT will be memory resident pretty much all the time, even with shared graphic RAM actively being used (helps also if you run with no swap file and hibernate switched off, which I do).

      On HDD storage the gains are huge, RAIDed or otherwise. My RAID speeds rose from 790-810 MB/sec to 900-930 MB/sec on four spindles after making that change, indicating that all four drives now run at full speed (230 Mb/sec as per manufacturers spec). I also use Command shell XCOPY for performing serious backups, which allows unbuffered IO and also minimizes the deleterious effects of Microsoft memory management (it's always been poor). An incremental file manager such as FreeFileSync runs the same task at around 650 MB/sec. so XCOPY clearly gains 40%. Setting RAID cache tag size to allocation size also helps.

      I test by using a completely separate system, which is stripped to minimal OS, necessary drivers and utilities, latest Windows update, and then use AOMEI backup to restore complete system disks as required, which allows one to remove things like all Metro Apps, and Comodo AV and Firewall ,completely. My test system takes all of 16 GB fully updated, the live one around 40GB including full Office Pro and a partial cut of Visual Studio 2019. Then each test gets 10 short iterations and four long ones, which seems fair enough. I noticed a post on scripting the tests, so I'll explore that facility next. BTW, the CPU scores are higher when you switch off the heating for a few hours; it's nearly 0C at 09:00 here in London and they now run around 2-3% faster than back in late September!

      Have fun,

      Robert
      Last edited by RobertGP; Dec-02-2023, 10:13 AM.

      Comment


      • #4
        No matter how much juice you throw at it, this CPU/GPU combination always finds a use for, and not as heat, either
        All the power used by a CPU and GPU gets turned into heat. With the exception of any small usage from LEDs on the GPU, or fan usage turning energy into moving air. Maybe a tiny amount gets turned into radio waves. Computations convert all the energy into heat however.

        However, the link you posted provides no separate formula for V11
        Yes it does. On the 2nd page, post #22.

        Stop using the default 4K cluster allocation size and up it to 256k ... it significantly cuts the number of API calls and interrupts
        This would not effect the number of API calls.
        When writing code to read a file, you specify the number of bytes to read. See,
        https://learn.microsoft.com/en-us/wi...leapi-readfile
        So if you ask for 100bytes, then you always get 100bytes. Changing the cluster size doesn't change the API calls.
        Maybe there are some edge case scenarios where this is slightly quicker. But I can imagine cases where it is slower as well, plus very inefficient from a disk space usage point of view.
        Reading large block sizes is faster however. So if for whatever reason the software reads block sizes that match cluster sizes, it would get faster, but only for large files. Most software doesn't work like this however. Most software either reads
        • 1 byte at a time
        • 1 line at a time
        • The entire file in one hit (buffer size matches the file size)
        • Multiple fixed size buffers in a loop (e.g. 1KB, 1MB, etc..) in a linear fashion
        • Random small buffers (e.g. 256bytes), like when reading database records
        None of these depend on cluster size.

        helps also if you run with no swap file and hibernate switched off, which I do
        Neither of these things have an effect on disk speed. Maybe there is a tiny impact on the size of the MFT, but that doesn't result in better performance.
        No swap file just means the machine crashes in high resource usage situations, instead of just running slowly.







        Comment


        • #5
          Sorry, David, I'm getting quite sloppy on my old age; unnecessary heat would have been a better choice of words. There is useful heat (the end product of instructions executed which do not trigger any thermal checks within a Ryzen processor) and unwanted heat (generated because the processor is actually executing microcode in order to limit the production of yet more heat, which we call thermal throttling). The same effect is noticeable in Radeon GPU's, though almost impossible to measure precisely. Heat is the enemy, always. Over-clockers seem to think more juice is always a good thing, but cutting TDC (amps at processor socket) while letting EDC balloon somewhat often produces better results from the memory controller, particularly with very high speed DDR4 RAM on a daisy chained motherboard, which most are nowadays. I push TDC by no more than 20% over stock, never allow EDC to exceed stock, and usually set it 4-6 amps lower. After that it's a simple task to find the optimal setting for PPT, and at that point the processor is at it's real life limit (not necessarily it's notional one). After that things simply get slower.

          As for API calls, sure you specify number of bytes to be read, but the driver call made from within the API does look at disk characteristics; that won't issue a disk read or write for more than the allocation unit, though internally the disk firmware may read or write a full track to the media. Being pedantic, reading 32k on a 4k cluster size results in 1 Windows API call and eight separate physical reads from the driver, reading 32k on a 32k cluster size just 1 API call and 1 driver call. Interrupts cost time; it wasn't until until I looked into kernel mode disk drivers in great detail that the penny finally dropped. At some point the disk geometry takes over. As a wise man once said, there is life below and beyond C.

          The storage inefficiency is also less than you might think, you lose (on average) only half the size of the last cluster in the file. So it only significantly effects small files. On a 500 Mb video file at a 512k allocation unit it's just a 0.05% loss. Bear in mind also that NTFS stores small files (<900 bytes) within the MFT, so there is no allocated space lost there at all. RAID does add another layer, but it's just a replacement, and effectively faster, disk driver. The 250k files on my C: drive, if they were all 4k, would waste 127Gb of a 500Gb system drive. But those 250k files actually occupy just over 80 Gb (real data size around 40 Gb), around one third of worst case wastage. Then again, I have a 20 Tb RAID array full of 500 Mb video files , and the wastage there is a mere 10 Gb! Given the ridiculously cheap price of storage per Gb (for a large NAS HDD, 1Tb =£20 or less), it seems a very fair trade off, though I freely admit my usage is optimal for large cluster sizes. Also, the gain is far more pronounced on HDD's.

          As for how programs read data, all well and true, and yes, my files are almost all big MP4's and FLAC, very large in other words. But we were discussing how to speed up a sequential read and write benchmark, and I get around a 2% gain, maybe a tad more, from using 512K cluster sizes against 4k; your sequential disk tests use 128k blocks, I believe, so anything above that is beneficial to some degree. And, unsurprisingly, my Seagate SSD's benchmark the V11 IOPS 32QD20 test about 1.5% quicker at 512K than 32K.

          I'm still playing with V11 (sorry, I omitted to read the second page!), but I do like what you've done with the 3D tests, they're a much better all round test of a GPU, and actually put 10C on my GPU operating temp.

          Regarding running without a swap file, all a swap file does is replace fast RAM with slow disk (in this context even SSD's can be considered slow).However, you will need to run 32Gb or more (64 GB is preferable) for safety. Windows doesn't really use that much memory on it's own account, though it's handle and thread counts can become rather excessive. I ran a 1 Tb copy via the Explorer copy command and it never used more than 9/10 Gb, it just slowed up as it wouldn't allocate any more buffers than it's maximum allocation count.

          Another consideration is that Microsoft have been insurance salesmen from the beginning, very often paging things that do not need to be paged; it is not the worlds best memory manager, not at all. You will also reduce interrupts and storage media degradation; a heavily paged system SSD really eats into flash memory burnout times, and the smaller the drive (most people use 250/500 GB SSD's for system drives) the lower that life span is to begin with. As for applications that crash without checking for such things, why on earth would one wish to use them? Sure ,any Server OS, or IIS running within Win 10 Pro, needs a swap file because demand will very often exceed supply, but in the past ten years the only swap file crash problems I've had are with certain games, written by people who have no concern for the environment they work within.

          Lastly, if running without a swap file is such a bad idea, then why did MS specifically give us the option to do so? After all, they are not noted for offering their users a choice if they can possibly help it.

          Comment


          • #6
            I don't have time for a full reply.

            But some quick points.

            the API does look at disk characteristics; that won't issue a disk read or write for more than the allocation unit,​
            NVMe protocol allows both a starting LBA (logical block address) and number of logical blocks to be read / write. Number of blocks is a 16bit number. So up to 65,536 blocks per Read command. Clusters are just a File system unit of allocation. They don't determine the I/O size. Spec if here
            https://nvmexpress.org/wp-content/up...0-Ratified.pdf

            all a swap file does is replace fast RAM with slow disk
            No. It extends the physical memory. It doesn't replace anything.

            very often paging things that do not need to be paged
            There was some true to this decades ago. It doesn't really happen anymore.

            a heavily paged system SSD really eats into flash memory burnout
            If you have a heavily paged system, then the system wouldn't run at all without a paging file. It would just crash.

            I guess it is true that if your operating system has crashed, then the disk won't wear out.

            then why did MS specifically give us the option to do so
            You can turn off the paging file to save disk space. It isn't a magic button to make the computer faster.

            Comment

            Working...
            X