Announcement

Collapse
No announcement yet.

Intermittent Freeze and Hard Reset of Windows VM - Proxmox

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Intermittent Freeze and Hard Reset of Windows VM - Proxmox

    I've been having an issue with PerformanceTest V11 (and V10) for quite a while, and I decided to finally try posting on the forum about it. I'm only a Trial user, but I struggle to imagine I'm the only person experiencing this issue.

    Quick background - I'm running a Windows VM (10 or 11, same settings) with GPU passthrough (AMD Radeon RX 6700 XT), and also passthrough of USB and Audio devices (basically a stealth Gaming VM at my desk with the "baremetal" OS, proxmox, just running some services in the background) . Pretty much the only issue I get with the VM is with PerformanceTest.

    What happens is, basically, PerformanceTest seems to have a random chance of just taking down the whole VM, locking it up for a few seconds, leading to an automatic VM reset, rebooting normally.
    • It doesn't happen every single time.
    • Typically, if PerformanceTest doesn't trigger the freeze after a few minutes, it doesn't seem to happen again until I reboot the VM.
    • It doesn't seem to happen during the initial data collection phase, usually an indeterminate amount of time after.
    • It usually seems to happen while navigating menus, but has happened during or after running tests.
    As it happens, I can hear through my speakers a Windows "device disconnected" sound, but I can't figure out for the life of me how to pin down exactly what happens in those moments, since the whole system appears to freeze, and the VM resets itself shortly after. I am no Windows expert, but Event Viewer doesn't seem to record enough information or provide a means to live-log any and all device connects/disconnects - Google results hyperfixate on "USB" connections which is not what I want.

    Futher details:
    • Unaffected by the SAFEMODE option
      • As an aside, in recent PerformanceTest, SAFEMODE seems to have started to disable Extended Instructions (SSE) benchmarks from running. I am quite sure this used to work not long ago. Is this intended?
    • Unaffected by the other flags:
      • /DontGatherGraphics
      • /DontGatherUSB
      • /DontGatherDisk
      • /DontGatherSMART
      • /DontGatherMemory
      • /DontGatherMemorySPD
      • /DontGatherWMI
      • /DontGatherSMBIOS
      • /DontGatherTemperature​
    • Using DEBUGMODE I was able to get what might be confirmation that it is the GPU itself that's unhappy (which is consistent with the fact that even though the video output freezes, the VM is still "running" enough to make the "device disconnect" sounds I get to hear). For example:
      • 110.937s - Got WM_EXITSIZEMOVE message
      • 110.937s - WM_EXITSIZEMOVE: BackBufferWidth 640 : BackBufferHeight 480 (previous: 2450 : 1334)
      • 110.953s - DEBUG 3D: Failed Reset (2289436776)
      • 110.953s - DEBUG 3D : D3DERR_DEVICELOST on reset, trying to wait for valid status
      • 110.953s - DEBUG 3D: TestCooperativeLevel returned D3DERR_DEVICELOST
      • 111.203s - Got WM_EXITSIZEMOVE message​
    • Unaffected by disabling AMD overlay and "toast" notifications
    • Same behaviour in a clean Win11 VM that I span up recently just for this purpose.
    I'm always using the most recent AMD Adrenaline drivers with pretty much no "custom" settings - certainly no fiddling with voltages or power limits.​

    This issue has been driving me up the wall, since other than this, PerformanceTest is by far my favourite "general purpose" benchmarking tool, and has been my preferred resource for benchmark results for a few years. My familiarity with the scores has allowed me to identify other VM configuration issues affecting my performance, but it would have been easier to do so if the tool didn't keep causing these lockups and reboots.

    If anyone can point me in the right direction for what to test next, or what specific information I might provide that could bring us closer to a solution, I would of course be more than willing. Thanks for your attention so far.

  • #2
    might be confirmation that it is the GPU itself that's unhappy​
    Yes, looks like the video card device driver crashed. So driver bug or an underlying hardware fault.

    Do you have a spare / different video card to try?

    Comment


    • #3
      Originally posted by David (PassMark) View Post
      Do you have a spare / different video card to try?
      I'm afraid not. The most I can do in that vein for the forseeable future would be to either access the VM via remote desktop, or disable GPU passthrough altogether and access it through the proxmox web UI. If you think either of those is worth trying, I can do so at some point this week.

      I was kind of hoping there would be some way to set things up so that I can get more useful information about what exactly fails and why. As I mentioned, I tried to pore over Event Viewer logs from the minute or two leading up to the problem occuring, but there really isn't anything useful - it doesn't even seem to log which device gets disconnected.​ Do you happen to have any ideas in that regard?

      By the way, I neglected to mention that I am currently running version 1009, but I did already say that I've had the issue for a while, so I'm reasonably sure that isn't too important.

      Also; should I make a separate thread regarding my "Extended Instructions when using SAFEMODE" query?

      Comment


      • #4
        You can find a list of DirectX error codes here
        https://learn.microsoft.com/en-us/wi...rect3d9/d3derr

        D3DERR_DEVICELOST can happen in a bunch of different ways. Crashes

        Device disconnect sound might be the video card disappearing.

        As an aside, in recent PerformanceTest, SAFEMODE seems to have started to disable Extended Instructions (SSE) benchmarks from running. I am quite sure this used to work not long ago. Is this intended?
        My guess is that with safe mode on, we don't detect that instruction set the CPU is capable of using. And attempting to use a CPU instruction that isn't available will result in a crash. So we don't do it. So this is probably normal.

        Can you try using just the /NO3D command line option. This disables the DirectX9 3D interface that runs at start up. Maybe that what triggers the error. And the interface doesn't play any part in the actual benchmarks.

        Comment


        • #5
          Originally posted by David (PassMark) View Post
          Can you try using just the /NO3D command line option (still with DEBUGMODE). This disables the DirectX9 3D interface that runs at start up. Maybe that what triggers the error. And the interface doesn't play any part in the actual benchmarks.
          I added that option to my PerformanceTest shortcut, did a clean boot of the VM, then ran PerformanceTest again. I did a quick CPU benchmark then started a 2D benchmark, and then the issue occurred again - "device disconnect" sound, full apparent lockup, VM reboots itself. This time, though, there's nothing conspicuous in the log:
          Code:
          97.625s - DEBUG: Running Test - Simple Vectors
          104.000s - DEBUG PERF: CPU mark raw 0.000566
          104.000s - GetCommonApplicationDataFolder: C:\ProgramData
          104.015s - DEBUG PERF: calc mark num3DTestRun = 0
          104.015s - DEBUG PERF: calc mark everything
          104.015s - DEBUG PERF: numtest < 5
          104.015s - DEBUG PERF: num3DTestRun 0
          104.171s - TEMP DEBUG csum 2 ok: 1
          104.171s - GetCommonApplicationDataFolder: C:\ProgramData
          104.171s - GetChartDataFromCache - C:\ProgramData\PassMark\PerformanceTest11\Chart Data\passmarkRating\all.xml
          104.171s - GetChartDataFromCache - llSecsDiff > ALL_CHART_DATA_UPDATE_INTERVAL_SECS - failed
          104.171s - Calc %-tile [passmarkRating] - bin size: 100, num bins: 100
          104.171s - Calc %-tile [passmarkRating] - Cumulative freq: 182261, Score Bin: 64
          104.171s - Calc %-tile [passmarkRating] - Percentile: 79, Score: 6491.400391
          104.171s - GetCommonApplicationDataFolder: C:\ProgramData
          104.171s - GetChartDataFromCache - C:\ProgramData\PassMark\PerformanceTest11\Chart Data\cpuRating\all.xml
          104.171s - GetChartDataFromCache - llSecsDiff > ALL_CHART_DATA_UPDATE_INTERVAL_SECS - failed
          104.171s - Calc %-tile [cpuRating] - bin size: 460, num bins: 100
          104.171s - Calc %-tile [cpuRating] - Cumulative freq: 213275, Score Bin: 56
          104.171s - Calc %-tile [cpuRating] - Percentile: 92, Score: 26227.878906
          104.171s - TEMP DEBUG csum 3 ok: 1
          104.171s - GetCommonApplicationDataFolder: C:\ProgramData
          104.171s - GetChartDataFromCache - C:\ProgramData\PassMark\PerformanceTest11\Chart Data\g2dRating\all.xml
          104.171s - GetChartDataFromCache - llSecsDiff > ALL_CHART_DATA_UPDATE_INTERVAL_SECS - failed
          104.218s - GetCommonApplicationDataFolder: C:\ProgramData
          104.218s - GetChartDataFromCache - C:\ProgramData\PassMark\PerformanceTest11\Chart Data\G2D_SIMPLE\all.xml
          104.218s - GetChartDataFromCache - llSecsDiff > ALL_CHART_DATA_UPDATE_INTERVAL_SECS - failed
          104.234s - GetCommonApplicationDataFolder: C:\ProgramData
          104.234s - GetChartDataFromCache - C:\ProgramData\PassMark\PerformanceTest11\Chart Data\G2D_SIMPLE\all.xml
          104.234s - GetChartDataFromCache - llSecsDiff > ALL_CHART_DATA_UPDATE_INTERVAL_SECS - failed
          104.234s - GetCommonApplicationDataFolder: C:\ProgramData
          104.234s - GetCommonApplicationDataFolder: C:\ProgramData
          104.234s - GetChartDataFromCache - C:\ProgramData\PassMark\PerformanceTest11\Chart Data\g2dRating\all.xml
          104.234s - GetChartDataFromCache - llSecsDiff > ALL_CHART_DATA_UPDATE_INTERVAL_SECS - failed
          104.250s - GetCommonApplicationDataFolder: C:\ProgramData
          104.250s - GetChartDataFromCache - C:\ProgramData\PassMark\PerformanceTest11\Chart Data\g2dRating\all.xml
          104.250s - GetChartDataFromCache - llSecsDiff > ALL_CHART_DATA_UPDATE_INTERVAL_SECS - failed
          104.250s - GetCommonApplicationDataFolder: C:\ProgramData
          104.281s - GetCommonApplicationDataFolder: C:\ProgramData
          104.281s - GetChartDataFromCache - C:\ProgramData\PassMark\PerformanceTest11\Chart Data\G2D_SIMPLE\all.xml
          104.281s - GetChartDataFromCache - llSecsDiff > ALL_CHART_DATA_UPDATE_INTERVAL_SECS - failed
          104.296s - GetCommonApplicationDataFolder: C:\ProgramData
          104.296s - GetChartDataFromCache - C:\ProgramData\PassMark\PerformanceTest11\Chart Data\G2D_SIMPLE\all.xml
          104.296s - GetChartDataFromCache - llSecsDiff > ALL_CHART_DATA_UPDATE_INTERVAL_SECS - failed
          104.296s - GetCommonApplicationDataFolder: C:\ProgramData​
          To be clear - I do not believe there is any significance to whichever benchmark I happen to have most recently run before the freeze occurs. Sometimes it does it without attempting to start any. I would imagine the issue is with PerformanceTest's underlying system service and some interaction with something else in my system setup.

          Originally posted by David (PassMark) View Post
          Yes, looks like the video card device driver crashed. So driver bug or an underlying hardware fault.
          ​If it were truly GPU hardware related, I would expect to experience this kind of lockup in other contexts. The only thing vaguely similar I've had was a temporary issue due to a VM misconfiguration that went away after fixing the setting. Certainly there could still be some sort of hidden issue with the way I have VM set up, but the fact remains that presently the only problematic user-facing software is PerformanceTest - which I can assure you is very frustrating even for me!
          Last edited by aphirst; Feb-27-2024, 09:49 AM. Reason: further clarification of possible cause

          Comment


          • #6
            The 2D test also uses the video card and Directx.

            Maybe try running a few 3D / DirectX apps in your VM.
            e.g. Older DirectX9 Games, DirectX12 games, apps with GPU acceleration like some parts of Photoshop & Premier, video encoders, other 3D benchmarks, etc...
            There is also a list of apps here.

            Again, it would be good to try another video card (or another machine entirely).

            Comment


            • #7
              Originally posted by David (PassMark) View Post
              Again, it would be good to try another video card (or another machine entirely).
              Trying another video card simply isn't on the "cards" presently due to limitations of space and time (and money).
              My ThinkPad X395 (Ryzen PRO 3500U, Vega doesn't seem to have the issue, and I've never noticed it on other older Intel-based ThinkPads either. I don't have regular access to any other machines.

              Originally posted by David (PassMark) View Post
              Maybe try running a few 3D / DirectX apps in your VM.
              e.g. Older DirectX9 Games, DirectX12 games, apps with GPU acceleration like some parts of Photoshop & Premier, video encoders, other 3D benchmarks, etc...
              I did already say:
              Originally posted by aphirst View Post
              Quick background - I'm running a Windows VM (10 or 11, same settings) with GPU passthrough (AMD Radeon RX 6700 XT), and also passthrough of USB and Audio devices (basically a stealth Gaming VM at my desk with the "baremetal" OS, proxmox, just running some services in the background) . Pretty much the only issue I get with the VM is with PerformanceTest.

              What happens is, basically, PerformanceTest seems to have a random chance of just taking down the whole VM, locking it up for a few seconds, leading to an automatic VM reset, rebooting normally.
              , but perhaps we would benefit from my being more specific. I've been daily driving this as a gaming VM for over a year, with emulators (PS1 through GC/Wii and Switch) as well as games as far back as Rise of Nations, Freelancer and WarCraft III, through Telltale Games of the DX9 era, 3DMark, and countless more modern titles using DX10-12 and Vulkan from all across my Steam backlog.

              There are four and exactly four circumstances under which the VM has ever hung and/or hard-rebooted on its own.
              1. Detaching a PCIE device or VMDisk from the VM (from the hypervisor host) while it's still running. (neither surprising nor relevant)
              2. When using inappropriate qemu CPU flags, especially omitting those pertaining to invtsc, similar to this proxmox forum post (also not surprising or relevant)
              3. attempting a GPU overclock or undervolt that is too aggressive (though the usual symptom was merely a driver reset, from which the VM would recover at runtime) (not relevant since I have that all disabled)
              4. Passmark PerformanceTest - exhibiting the behaviour irrespective of which benchmark is running, has run, or even whether any have yet been started


              I think it bears repeating that PerformanceTest is the only program that causes any issue, and that, even though it's what we suspect, strictly speaking we don't actually know that it's the GPU that's dropping off the system:
              • the suspicious "device lost" error present in one instance of the log is not present in all instances when the bug occurs
              • at present I have no other means of ascertaining which device(s) are disconnecting or why, other than the clear observable that it only happens when PerformanceTest is open
              Certainly after work this evening I can try running some more benchmarking software or older DX titles, but I fully expect to not be able to reproduce any issues similar to what I've been describing here.

              Is there any way at all to interrogate what it is exactly that the Performance Test background service (which is started when the tool opens, and stopped when the tool closes - presumably the reason the tool needs administrator privileges) is actually doing behind the scenes? Insofar as my opinion does count for something, I would imagine the tool is trying to do something that has ill-defined behaviour under a VM, perhaps exposing some sort of VM-related issue or timing issue or something that doesn't affect the operation of regular software. However, since it's a black box there isn't exactly anything I can investigate on my own.

              Do you know of any other means to gather reliable logs of devices being disconnected along with useful context for how or why it may have occurred?

              Something else I am going to try when I have the time to get around to it is, as I think I already mentioned, running the VM without the GPU passed-through, and then with no devices passed-through. Whether or not the issue can be triggered there too would be very useful data points.
              Last edited by aphirst; Feb-28-2024, 12:38 AM. Reason: mentioned additional further diagnosis possibilities

              Comment


              • #8
                Performance Test background service
                To be technically precise, there is no background service. (A service being a separate executable file that runs without a user interface).
                However, there is a background thread that runs to collect temperatures from the CPU, GPU and Disk. There is also a device driver that is used just at startup to collect system information, such as RAM model numbers, CPU clock speeds, etc..

                Using the /DontGatherTemperature command line parameter can turn off temperature collection however.


                Comment

                Working...
                X