memory reserved by spawning units is never deallocated leading to server crashes

ruprecht · April 11

TL;DR: when a mission spawns or clones a unit, memory is allocated. This memory is never deallocated, even on group/unit destroy, explode or removeJunk. Thus over time any mission that creates units will crash either from memory starvation or from hitting the RegMapStorage cap of 4094 groups.

This investigation started after we noticed the excellent Pretense mission was dying after a few hours on our hosted server. By profiling the memory usage of the DCS_Server process using Process Lasso logging, we noticed that the VM (private bytes memory) continued to increase regardless of unit destruction, client join/leave or any other factor.

The control was performed with a basic ME mission with a single client helo and no AI. VM usage was flat.
The logical next step was then to write a script that spawned new units at a constant rate, and destroyed them on a cadence. If VM was deallocated, we would expect to see a sawtooth pattern. If it wasn't, we would expect to see a relatively straight incline up.

The mission starts with 2 minutes of no spawning to establish a baseline with all scripts loaded. It is virtually flat. Once spawning starts (1 group of 5 vehicles every 3 seconds), VM rises steadily and does not reduce with destroy() being called on all ground objects every 60 seconds. When the spawn rate was increased to 1 group of 10 units every second, as expected the rate of VM accrual increased and VM was never deallocated. With a mission (not server) restart, no (effective) deallocation was performed as the VM restarts higher than when the mission was restarted.

The next step was to also removeJunk around the spawn Zone to clean up as much as is permitted in the scripting environment. The same spawn parameters were used, with an additional step of calling removeJunk on a sphere 2x the size of the spawn zone every 5 minutes.

In this test, VM continued the trend of accruing with no reduction due to destroy() or removeJunk(). It continued to accrue VM until the server crashed with the following error:

2024-04-11 03:15:28.743 WARNING EDOBJECTS (Main): RegMapStorage start cycle to find empty space in <viColumn>
2024-04-11 03:15:28.743 ERROR EDOBJECTS (Main): RegMapStorage has no more IDs (4094 max) in <viColumn>
2024-04-11 03:15:28.743 ERROR EDOBJECTS (Main): Failed assert `false` at Projects\edObjects\Source\Registry\RegMapStorage.cpp:124

This effectively caps the life of any mission that creates units on the fly based on the server RAM and the RegMapStorage cap of 4094. At 1 group per second being created, we'd expect to hit this cap at ~4094 seconds, or about 1.1 hours. This is basically exactly what we saw, despite all units in those groups being both destroyed and junk removed.

A working theory is that this behaviour was introduced with the Apache FCR in order to allow destroyed vehicles to still be seen by the FCR. That said, others have said anecdotally that this predates the Apache. The effect is that dynamic missions require a regular server (not mission) restart to prevent this VM saturation, on a cadence that depends on the rate at which new units are spawned by the mission, and the total server RAM.

Request that ED advise or investigate a solution that allows mission editors to purge the VM allocation of "dead but still there objects" in areas that are no longer relevant to the mission to prolong the life of the mission/server. While this may have other effects e.g. rendering them invisible to FCR, the alternative (a server crash or restart) is surely not better.

Log data and charts:
https://docs.google.com/spreadsheets/d/1p0tKoeipHJOaChhKnzNKjLD30maU3xKCvJH5o4quBY0/edit?usp=sharing

(edit: if anyone wants to help me replace the MOOSE functions with stock to eliminate that potential source of error, please do!)

edit2 another data point using trigger.action.explosion rather than destroy(). Same issue.

RnR_Memtest_Syria.miz

update: tested a different miz with no MOOSE (thanks cfrag), which creates 7 groups per second and destroys the group rather than the individual vehicles. Same VM accrual, and right on cue at ~590 sec (4094 / 7) it starts spamming RegMapStorage full.

Edited April 11 by ruprecht

cfrag · April 11

Thank you for this very interesting and detailed analysis - it confirms what I have speculated about (and is the reason why I tend to have missions save and then re-start. I'm hoping that a mission re-start frees the memory occupied by units). I'm looking forward to ED handling this quite serious bug (a massive memory leak).

2 hours ago, ruprecht said:

This effectively caps the life of any mission that creates units on the fly

For the life of me I can't figure out why you filed this important bug as a Mission Editor bug - it doesn't affect ME but, much more importantly, the game core engine itself.

Great work, thank you, and hopefully this is going to be tackeld soon.

Edited April 11 by cfrag

ruprecht · April 11

8 minutes ago, cfrag said:

For the life of me I can't figure out why you filed this important bug as a Mission Editor bug

In my head, it's in a category at the top of the page for visibility but it won't get buried as quickly as it might in some other performance/general thread.

*shrug*

ruprecht · April 11

29 minutes ago, cfrag said:

I'm hoping that a mission re-start frees the memory occupied by units)

Unfortunately a mission restart doesn't seem to. You can see it halfway through the second chart. Only a server restart cures it.

Maverick87Shaka · April 11

2 hours ago, ruprecht said:

The working theory is that this behaviour was introduced with the Apache FCR in order to allow destroyed vehicles to still be seen by the FCR. The effect is that dynamic missions require a regular server (not mission) restart to prevent this VM saturation, on a cadence that depends on the rate at which new units are spawned by the mission, and the total server RAM.

This is the behavior since DCS exist, or at least since I have memory when I start to make server back in the 2017, nothing related to the Apache FCR. That's why we are usual to perform a complete stop/start of our DCS server instances every 6 hours.

Below you can see memory trend of our DCS server back in 2022:

It's a good analysis, something interesting for a deep dive for ED team ( @c0ff ), but I guess it will not something that will be solved in short time.

cfrag · April 11

Here's a mission that, once per SECOND (pls ignore the silly misleading name) destroys (if they exist) and then allocates 7 groups of 17 ground vehicles.

This will exhaust the storage after a while (some 580 seconds until RegMapStorage runs out, which can only store 4094 entries). After a while, another table (with 65534 limit) runs out.

ERROR   EDOBJECTS (Main): RegMapStorage has no more IDs (4094 max) in <viColumn>

and

EDOBJECTS (Main): RegMapStorage has no more IDs (65534 max) in <viWorldHeavyObject>

Here's the miz that reliably re-creates the issue on local server:

clone once a minute.miz

Edited April 11 by cfrag

ruprecht · April 11

6 hours ago, Maverick87Shaka said:

This is the behavior since DCS exist, or at least since I have memory when I start to make server back in the 2017, nothing related to the Apache FCR. That's why we are usual to perform a complete stop/start of our DCS server instances every 6 hours.

Maybe, though it definitely seems more extreme lately. Just shrugging and accepting a 6, or 4, or 2 hourly restart isn't something everyone is relaxed about.

In any case, there's hard data here that the problem is directly affecting customers of commercial hosting providers so I'd suspect there is some incentive to want this finally hunted down and shacked. Big persistent dynamic sandbox missions like these are the raison d'etre for commercial hosting.

If it's a leak because some dev in 2008 missed a pointer delete somewhere, it's about time it was fixed. If it's an intentional architectural decision to persist these groups, the impact needs to reconsidered and alternative designs explored.

6 hours ago, Maverick87Shaka said:

Below you can see memory trend of our DCS server back in 2022

It's interesting, but without knowing any details about the mission running and how it is spawning and destroying units, it's hard to draw any conclusions from it.

Edited April 11 by ruprecht

SpeedDemonAdam · April 23

My group has had similar trouble with our regular servers and our event server crashing. Here is a snip of the log data from a liberation that was ran a few days ago.

This should probably be moved to https://forum.dcs.world/forum/486-multiplayer-bugs/ @BIGNEWY

Moezilla · May 23

On 4/11/2024 at 1:04 PM, cfrag said:
Here's a mission that, once per SECOND (pls ignore the silly misleading name) destroys (if they exist) and then allocates 7 groups of 17 ground vehicles.

This will exhaust the storage after a while (some 580 seconds until RegMapStorage runs out, which can only store 4094 entries). After a while, another table (with 65534 limit) runs out.
ERROR   EDOBJECTS (Main): RegMapStorage has no more IDs (4094 max) in <viColumn>
and
EDOBJECTS (Main): RegMapStorage has no more IDs (65534 max) in <viWorldHeavyObject>
Here's the miz that reliably re-creates the issue on local server:

clone once a minute.miz 174.6 kB · 1 download

@Flappie I think you were not around when this one was posted but it would be great if you can try to reproduce the issue with cfrag's miz.

Sign In

memory reserved by spawning units is never deallocated leading to server crashes

Recommended Posts

ruprecht

Link to comment

Share on other sites

cfrag

Link to comment

Share on other sites

ruprecht

Link to comment

Share on other sites

ruprecht

Link to comment

Share on other sites

Maverick87Shaka

Link to comment

Share on other sites

cfrag

Link to comment

Share on other sites

ruprecht

Link to comment

Share on other sites

SpeedDemonAdam

Link to comment

Share on other sites

Moezilla

Link to comment

Share on other sites

Recently Browsing 0 members