Actium Posted April 20 Posted April 20 (edited) TL;DR: I ran a dedicated server benchmark on Windows 10 both natively as well as virtualized (Linux QEMU+KVM). The VM reproducibly performs worse than the native server by more than 3 orders of magnitude: The VM yields median frame times >1000 times of native execution on the same hardware. This has me doubt the validity of these results. Therefore, I'd appreciate it if someone else could independently verify these findings. I wrote Benchmark.lua to get some performance figures when running the DCS dedicated server on Linux. It will repeatedly run a simple benchmark mission for a configurable duration and log simulation frame times. I finally got around to running a benchmark on my local rig (Ryzen 9 5900X, 64G DDR4 3600) both natively on my regular Windows 10 install as well as on a fresh Windows 10 install running inside a VM (QEMU 10.0 + KVM on Linux 6.12.22). Of course, benchmark mission and server configuration are identical. I've tried my best to give the VM a fighting chance: Reserved an entire CPU CCD for exclusive use by the VM, used raw partition on same physical NVMe SSD (not just a disk image on a Linux file system), and pre-allocated the VM memory on 1G hugepages (to avoid page faults). These optimizations do make a difference (more on that in a future post). Here's my QEMU command line for reference and peer review: qemu-system-x86_64 -enable-kvm -machine q35 -smp 6 -bios /usr/share/ovmf/OVMF.fd -mem-prealloc -mem-path /dev/hugepages/qemu -m 24576 -vga virtio -device qemu-xhci -device usb-tablet -device usb-kbd -nic user,model=virtio-net-pci -drive file=/dev/nvme0n1p6,aio=io_uring,index=0,media=disk,format=raw The results are catastrophic (note the logarithmic scaling). First (left) plot shows the time series of the measured frame times (you'll notice the 10 benchmark iterations). Second (right) plot shows the frame time distribution as exclusive percentiles (read: xx% of the benchmark duration had frame times worse than yy). I'd indeed expect the VM to perform worse than running the server natively. However, the VM repeatedly exhibits frame times in excess of dozens of seconds, even exceeding the configured benchmark duration (5 min per round) as a consequence. Looking at CPU utilization, the native server simply shrugged off the benchmark (busiest core averaged ~30% load), whereas the VM constantly maxed out a single core. As an auxiliary performance metric, I measured the system and user CPU time consumed by DCS_server.exe (see benchmark.ps1). System System CPU Time (s) User CPU Time (s) Windows 10 (native) 63.9375 1135.78125 Windows 10 (Linux QEMU+KVM) 2674.03125 1995.234375 Running virtualized, the system CPU time increased to 31.2 times of the native value. I have no reasonable explanation. The benchmark runs an isolated dedicated server without clients, which should thus require a minimal amount of syscalls (no clients -> no network). This assumption holds true when running DCS_server.exe natively, but not on the VM. Only educated guess I have right now is that the multi-threaded dedicated server might be subject to a significantly higher thread synchronization overhead on VMs. I'd appreciate feedback from anyone with experience running the dedicated server on VMs: Can you reproduce a significant performance degradation between native and virtualized execution of the dedicated server on the same hardware? Do you see any systematic or other possible sources of error within these benchmarks? Can multi-threading be forcibly disabled in the (other than just pinning DCS_server.exe to a single CPU core)? I've attached the dcs.log an benchmark.log files if anyone wants to poke at the data. You can find the Python script that generated the plots here. Benchmark-2.9.15.9408-W10-20250419-085425Z.dcs.log.gz Benchmark-2.9.15.9408-W10-20250419-085425Z.log.gz Benchmark-2.9.15.9408-W10_KVM+hugepage+isol-20250419-200019Z.dcs.log.gz Benchmark-2.9.15.9408-W10_KVM+hugepage+isol-20250419-200019Z.log.gz Edited April 20 by Actium
Actium Posted April 20 Author Posted April 20 Did a quick sanity check on the fundamental performance of the VM vs. the native Windows 10 install via PowerShell: (Measure-Command { for ($i = 0; $i -lt 10000000; $i++) {} }).TotalSeconds Just run 10 million loop iterations and measure the total duration. On the native Win 10 it takes about 6.0 seconds and on the VM 7.0 seconds. Well within the performance margin I'd expect. So nothing's outright erroneous with the VM setup itself that'd explain the huge discrepancy of the dedicated server performance.
Actium Posted April 20 Author Posted April 20 As promised, a comparison of QEMU+KVM with and without optimization (1G hugepages and CPU isolation). While the server performance is unusable in either configuration, the optimization up to halves the peak frame time. Unoptimized QEMU commandline: qemu-system-x86_64 -enable-kvm -machine q35 -smp 6 -bios /usr/share/ovmf/OVMF.fd -m 24576 -vga virtio -device qemu-xhci -device usb-tablet -device usb-kbd -nic user,model=virtio-net-pci -drive file=/dev/nvme0n1p6,aio=io_uring,index=0,media=disk,format=raw Benchmark-2.9.15.9408-W10_KVM+hugepage+isol-20250419-200019Z.log.gz Benchmark-2.9.15.9408-W10_KVM-20250418-143902Z.log.gz
Special K Posted April 25 Posted April 25 (edited) a) Are you sure that DCS.getModelTime() has a proper resolution to do frame counts? If that is using os.clock() somewhere in the backend, please keep in mind that the os.clockl() implementation is significantly different on Windows and Linux systems. (EDIT: just checked with the developers - it is not using os.clock()). b) QEMU setup It is very likely that your significantly higher time to process privileged instructions is a result of your QEMU configuration. It is at least nothing that DCS can affect in any way. I would suggest the following: - add prealloc=on to your memory allocation for better performance - add -mem-path /dev/hugepages if you have that configured - add cache=none,io=native to your IO setup to enable direct access, add discard=unmap, if your storage system supports TRIM - your smp 6 does not take the CPU architecture into consideration. If you end up on the wrong cores, you wil have a significantly worse performance. Add something like sockets=1,cores=3,threads=2 (needs to be adjusted to your CPU architecture) to avoid that. - Try adding -cpu host,topoext,kvm=on to help the system pass through more CPU features I know a lot of groups that run their servers on proxmox for instance, without any significant performance issue. In all fairness, the DCS server is usually even a bit faster on Linux based systems than on plain Windows. Edited April 25 by Special K
ED Team c0ff Posted April 25 ED Team Posted April 25 4 hours ago, Special K said: a) Are you sure that DCS.getModelTime() has a proper resolution to do frame counts? The proper API is Sim.getRealTime() or DCS.getRealTime() (Sim. is the new API namespace, DCS. is its alias). 1 Dmitry S. Baikov @ Eagle Dynamics LockOn FC2 Soundtrack Remastered out NOW everywhere - https://band.link/LockOnFC2.
Special K Posted April 25 Posted April 25 vor 6 Stunden schrieb c0ff: The proper API is Sim.getRealTime() or DCS.getRealTime() (Sim. is the new API namespace, DCS. is its alias). Yeah, it still only has a resolution of a second. I am not sure, if I would rely on that for any benchmarking.
okopanja Posted April 25 Posted April 25 10 hours ago, Special K said: a) Are you sure that DCS.getModelTime() has a proper resolution to do frame counts? If that is using os.clock() somewhere in the backend, please keep in mind that the os.clockl() implementation is significantly different on Windows and Linux systems. (EDIT: just checked with the developers - it is not using os.clock()). b) QEMU setup It is very likely that your significantly higher time to process privileged instructions is a result of your QEMU configuration. It is at least nothing that DCS can affect in any way. I would suggest the following: - add prealloc=on to your memory allocation for better performance - add -mem-path /dev/hugepages if you have that configured - add cache=none,io=native to your IO setup to enable direct access, add discard=unmap, if your storage system supports TRIM - your smp 6 does not take the CPU architecture into consideration. If you end up on the wrong cores, you wil have a significantly worse performance. Add something like sockets=1,cores=3,threads=2 (needs to be adjusted to your CPU architecture) to avoid that. - Try adding -cpu host,topoext,kvm=on to help the system pass through more CPU features I know a lot of groups that run their servers on proxmox for instance, without any significant performance issue. In all fairness, the DCS server is usually even a bit faster on Linux based systems than on plain Windows. On some hypervisors there is a known issue that even if the VM gets initial time from host, over the time it leads to time drift. So first thing would be to make sure that there is no drift betwean host and vm.
Special K Posted April 26 Posted April 26 vor 10 Stunden schrieb okopanja: On some hypervisors there is a known issue that even if the VM gets initial time from host, over the time it leads to time drift. So first thing would be to make sure that there is no drift betwean host and vm. They are doing the measures on the VM itself, so I doubt any time drift between the host and the VM would be of relevance. But the VM itself just has too much overhead (here), which causes the longer kernel times.
Actium Posted April 26 Author Posted April 26 23 hours ago, c0ff said: The proper API is Sim.getRealTime() or DCS.getRealTime() (Sim. is the new API namespace, DCS. is its alias). I'll adapt my benchmarking code to use .getRealTime(). Real-time sounded too much like wallclock time and therefore not monotonic. Any documentation available on the new API namespace? I read about DCS.getModelTime() in DCS_Client/API/DCS_ControlAPI.html, which does not mention the Sim namespace. Neither does the Singletons page. 17 hours ago, Special K said: Yeah, it still only has a resolution of a second. I am not sure, if I would rely on that for any benchmarking. How did you arrive at that conclusion? Both getRealTime() and getModelTime() definitively have sub-second resolution. DCS.getModelTime() returns 3 non-zero decimal places and DCS.getRealTime() returns 6 non-zero decimal places. The attached log files contain the raw difference of getModelTime() return values, see Benchmark.lua: function benchmark.onSimulationFrame() benchmark.logfile:write(string.format("%.0f\n", (now - benchmark.prev_frame) * 1e3)) benchmark.prev_frame = now end The plots in my first post (native Windows performance) clearly show that the values have millisecond resolution and no higher level of quantization. See the steps in the distribution function. On 4/25/2025 at 12:32 PM, Special K said: It is very likely that your significantly higher time to process privileged instructions is a result of your QEMU configuration. Thank you for the suggestions. Already had prealloc (on 1G hugepages), KVM, and asynchronous IO enabled (see first post). I'm using IO_uring instead of native AIO, as their performance is comparable and I have experience with io_uring. I added the other options you suggested. Here both for comparison: # old QEMU options (r2) qemu-system-x86_64 -bios /usr/share/ovmf/OVMF.fd \ -enable-kvm -machine q35 -smp 6 \ -mem-prealloc -mem-path /dev/hugepages/qemu -m 24576 \ -vga virtio -device qemu-xhci -device usb-tablet -device usb-kbd \ -nic user,model=virtio-net-pci \ -drive file=/dev/nvme0n1p6,aio=io_uring,index=0,media=disk,format=raw # new QEMU options (r3) qemu-system-x86_64 -bios /usr/share/ovmf/OVMF.fd \ -enable-kvm -machine q35,accel=kvm -cpu host,topoext,kvm=on -smp 6,sockets=1,cores=3,threads=2 \ -mem-prealloc -mem-path /dev/hugepages/qemu -m 24576 \ -vga virtio -device qemu-xhci -device usb-tablet -device usb-kbd \ -nic user,model=virtio-net-pci \ -drive file=/dev/nvme0n1p6,aio=io_uring,cache=none,index=0,media=disk,format=raw Reran the benchmark (with the same DCS version, obviously). Results attached. Presumably, the most important change was passing through the host CPU. Thanks again, I missed that entirely. Unfortunately, the performance improved only slightly. However, the CPU usage within the VM went down noticeably: Total CPU usage did not change significantly, particularly the system CPU time: System System CPU Time (s) User CPU Time (s) Windows 10 (native) 63.9375 1135.78125 Windows 10 (KVM, r2) 2674.03125 1995.234375 Windows 10 (KVM, r3) 2524.5625 1625.78125 Found this deep dive on QEMU+KVM resource isolation. Will do some more digging once I find the time. On 4/25/2025 at 12:32 PM, Special K said: I know a lot of groups that run their servers on proxmox for instance, without any significant performance issue. In all fairness, the DCS server is usually even a bit faster on Linux based systems than on plain Windows. Do you have any additional info on that, preferably comparable benchmark results? I believe my QEMU config should be fairly comparable to an optimized proxmox setup, so I'd expect similar results. Even when avoiding the virtualization overhead and running the DCS dedicated server on Linux via Wine, the performance is a little worse than native Windows 10 as of upstream Wine 10.0 (see benchmark results). Benchmark-2.9.15.9408-W10_KVM_r3-20250426-125439Z.log.gz
Special K Posted April 27 Posted April 27 No, I do not know of anybody that did such a deepdive into their configuration, mainly because they had no major issues. One of our testers did some performance comparisions Linux vs Windows and came to very good results. I can ask them what exactly they did or ask them to join in here. But I would rather suggest to move this thread from Multiplayer Bugs to somewhere else, as this is a setup issue or VM issue, but not any bug of DCS. If kernel times are high, then the API that DCS uses by itself is slow already, so DCS can not magically speed that up.
ED Team c0ff Posted April 28 ED Team Posted April 28 On 4/26/2025 at 7:48 PM, Actium said: I'll adapt my benchmarking code to use .getRealTime(). Real-time sounded too much like wallclock time and therefore not monotonic. It is monotonic. It gives the (fractional) number of real time seconds passed since the process start. Not affected by OS clock adjustments. On 4/26/2025 at 7:48 PM, Actium said: Any documentation available on the new API namespace? The documentation needs a correction, indeed. Thanks for pointing this out. On 4/26/2025 at 1:18 AM, Special K said: Yeah, it still only has a resolution of a second. I am not sure, if I would rely on that for any benchmarking. The resolution is at least 1 us (1 microsecond, 0.000001 sec). 2 Dmitry S. Baikov @ Eagle Dynamics LockOn FC2 Soundtrack Remastered out NOW everywhere - https://band.link/LockOnFC2.
_UnknownCheater_ Posted April 28 Posted April 28 (edited) DO not use QEMU/KVM with Windows VM,if u using CPU passthrough(CPU Type = HOST),due to VM missing some CPU flags,performance degraded.KVM + Linux VM no this issue. I am using Vmware + Win10 LTSC host my own server on my HomeLab without any issues. Edited April 28 by _UnknownCheater_ 1 1 GamingPC: Ryzen 5950X + 64G RAM + Nvidia 4090 + 1T Dedicated SSD For DCS HOTAS: WingWin F15EX Throttle + VKB Gunfighter Mk.III Joystick + SN2 Rudder + TrackIR Pro HomeServer: Dell R7515 (EPYC 7402 + 1 T RAM + 48T SSD Raid10 + Nvidia A40 Network: Google Fiber 2G
Rocketeer88 Posted May 1 Posted May 1 (edited) I currently have a Windows 10 VM through proxmox. As with Actium, one CPU pegs to 100% and the DCS server eventually crashes or freezes (with only me connected and spectating) CPU is in host with 12 vcpu, 32gib of ram. Edited May 1 by Rocketeer88 1
Actium Posted May 4 Author Posted May 4 So just to clarify, the dedicated server does work for me on a Linux QEMU+KVM VM. Simple mission with a few dozen units are no problem whatsoever. However, once throwing too many units at each other, the server will perform far worse on the VM than natively on Windows. After noticing that, I started digging, i.e., benchmarking, and figured that a performance degradation by more than two orders of magnitude is far beyond the performance overhead of a VM one would reasonably expect. Because another KVM user (thru Proxmox) also has performance issues, this may be a KVM-related like @_UnknownCheater_ said. Unfortunately, I don't have the time right now to set up objectively comparable benchmarks with other hypervisors.
HellSpawn Posted May 8 Posted May 8 (edited) Hi! we have an proxmox server (8.4.1) with an AMD Ryzen 5 5600G, 128GB RAM. Barely an load, exept DCS. The DCS VM is Win10 with 6 cores (host mode) and 32GB RAM, 800GB SSD What I read here 100% applies to our server. The performance is terrible. An empty map runs fine. If you spawn single jets you can see the CPU load rises about 5-10% (on one core) every time. Much worse are SAMs, if you spawn 2-3 long range SAM sites (eg. SA10), the server becomes unresponsive. Removing all mods/reinstalling has no effect. Thanks actium for you effort! Edited May 8 by HellSpawn 1
_UnknownCheater_ Posted May 9 Posted May 9 20 hours ago, HellSpawn said: Hi! we have an proxmox server (8.4.1) with an AMD Ryzen 5 5600G, 128GB RAM. Barely an load, exept DCS. The DCS VM is Win10 with 6 cores (host mode) and 32GB RAM, 800GB SSD What I read here 100% applies to our server. The performance is terrible. An empty map runs fine. If you spawn single jets you can see the CPU load rises about 5-10% (on one core) every time. Much worse are SAMs, if you spawn 2-3 long range SAM sites (eg. SA10), the server becomes unresponsive. Removing all mods/reinstalling has no effect. Thanks actium for you effort! change to Hyper-V or try vmware. GamingPC: Ryzen 5950X + 64G RAM + Nvidia 4090 + 1T Dedicated SSD For DCS HOTAS: WingWin F15EX Throttle + VKB Gunfighter Mk.III Joystick + SN2 Rudder + TrackIR Pro HomeServer: Dell R7515 (EPYC 7402 + 1 T RAM + 48T SSD Raid10 + Nvidia A40 Network: Google Fiber 2G
HellSpawn Posted May 11 Posted May 11 I can only use proxmox... We changed machine type from i440fx (default) to q35. That helped with the performance but its still not perfect.
Recommended Posts