There's been much talk of a "memory leak" type of bug in squaremap. This issue intends to prove the existence of such a bug, explain the exact cause of the bug, and suggest a few ways to address the bug. I'm going to make this as thorough as possible for the avoidance of any doubt that this is the true issue.
The proof
On a buddy's server running squaremap, some information redacted for privacy reasons:
Before unloading squaremap:
After unloading squaremap:
Heap dump (here, Image objects account for 5GB of memory, the details of which will be explained below, this is on a totally different server from the above, but caused by the exact same thing):
The heap dump and pterodactyl screenshots were both observed on machines with high-end NVMe SSDs -- for the avoidance of any doubt that the problem is related to I/O -- it is not I/O related.
What's actually going on?
The allocation of the object we are interested in happens on Line 61 of BackgroundRenderer. This will be the focus of much of the bug report.
See also the allocation triggered here, which calls this, which then allocates the large 2D int array, before sticking the whole thing back in the queue, where it may wait for several hours.
So what's actually going on here? Prior to the job being submitted to the queue, an Image object is allocated. This object is cheap, but only temporarily. After the first chunk is rendered, it allocates a massive 2D int array to store the raw image data. Immediately following this, it sticks the whole thing BACK AT THE END OF THE QUEUE, where it waits in line all over again for the next chunk to be ready.
This becomes problematic as the background render queue grows in size. In testing, this is not an issue as the render queue can easily keep up with any actions a small number of players can make. However, on a large server, there are many many more actions being taken on chunks in the world, which means many more background render submissions are required. If these submissions are made faster than the tiles can be rendered, then the queue grows in size. This is expected behavior. The problem here is that the 2D int array is allocated in its entirety PRIOR TO all of the data actually being ready, and it's left there re-entering the end of the queue. This process repeats for hours on large servers with lots of activity.
But this wasn't an issue on Pl3xMap!
You're right, that's because I patched it in Pl3xMap just over a year ago. The way this patch works is by blocking the thread calling the render function until the render has actually finished. By blocking the thread calling the render function, no new allocations will be made as the render function won't be entered again until the previous invocation has fully completed. Additionally, back then, this render function rendered the entire tile, all in one go, rather than re-submitting it to the queue to render it one piece at a time. Admittedly this wasn't the best patch, but given that the software was not being well-maintained and I wanted a patch that was as simple to implement and as low-touch as possible, this was the method of patching I decided upon.
On a side note: this is actually a little funny because my original patch did not wait for I/O to complete. If the problem were that the I/O queue were backed up instead of the render queue, then this patch wouldn't have worked.
Squaremap has code that looks very similar to the code in the patch. So why doesn't the patch work anymore? Well, at some point in the development of squaremap, the threading model behind the BackgroundRenderer changed. Blocking the render function is no longer effective in preventing reentry into the function, which means that the image object allocation can proceed.
So how can this be fixed?
There are several ways that this issue can be resolved in Squaremap. Any of the below should do the trick:
- Revert the threading changes involved in calling the BackgroundRenderer. This will allow the original patch, which still exists in Squaremap, to function correctly.
- Re-prioritize the queueing somehow so that tiles mid-render don't have to get through the entire queue again
There are probably several more ways to fix this, but those are just a few that I came up with.