This repository was archived by the owner on Mar 1, 2024. It is now read-only.
Grid Server Arbiter Backlogging System. #10
Closed
mfdlabs-ops
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Notice
This has been pushed into the tickets backlog and will be revisited on a later date, so please be patient.
Notice No. 2: There was previous security vulnerabilities discovered by ops,
;execute
may be disabled at times.Description
Within the next week (from the date 12/03/2022), we will be beginning to implement a backlog system into the bot's grid server arbiter.
What this entails is the following.
Reasoning.
The main reasoning behind this is hardware availability, with the more grid servers that are opened, the machine will start to run out of memory, or the paging file can become empty. This will cause major drops in usage and major QoS issues where the bot will just crash, we want to avoid this by limiting the amount of instances open at once.
On start we normally will set up an instance pool for better HA (High Availability) capabilities, if we have this pool then an instance can be cherry picked at random and used. After they are doing executing (Only for script execution) it will lock the instance and prevent it's reusage by other users or even the original user. It will also open another instance regardless of if we already have enough in the pool after any of their leases expire.
How this affects you as a user?
This will affect users in the following ways:
Implementation
We will try to implement either of the following solutions, this pull request just generalizes the idea, it won't necessarily be just a backlogging system.
1. Arbiter Queue Backlogging.
Arbiter Queue backlogging will involve creating a seperate queue of instances waiting to either be opened, or waiting to do work, this queue will only be backlogged if a specific condition is met, such as the total memory usage of all the currently open instances reaches past a specific number or the number of instances open reaches a specific number.
The plan for this is to limit the amount open, with a threshold, but also having a percentage invoker determine if a random instance will be opened or queued, this would improve HA (High Availability) capabilities because it will either use an already opened instance or the new instance, there would also have to be some checks that determine if it actually worth opening it.
The reason this seems slightly ideal is because it will limit the amount of memory usage on the machine, but comes with the downside of potentially long wait times for script executions to begin.
2. Shared usage of instances.
Right now every time you use a script execution command, it will try to get an instance that hasn't been used at all, so you will have a fresh instance to use, and it doesn't affect other users if you crash the instance, or cause timeouts.
Sharing instances can bring down instance count because there will be less of a reason to open more instances if users can bounce between already pooled instances, and paired with backlogging this can push more users into less used instances with some math tricks and AB experiments.
The downsides to this are obviously crash and timeout exploits, there's no way of recovering instances if they crash, or a user times out the instance, which may affect others who have data on the output of the instance. Another downside is output flooding, because instances are shared, people can log and flood the output of people's instances, while it's random, it can affect users if they executed something and want to check outputs of their code.
3. Reusage of the same instance, renewing leases.
The final solution could be to reuse instances, and lock these to specific users, this will also bring down the instance count and allow users to reuse the outputs of previously executed commands. And this will improve the QoS for users as random people cannot cause crash exploits.
We can also do something where another instance can randomly be allocated, and the user will bounce between each instance respectively.
If the user gets blacklisted it will attempt to purge all of the data related to their instances.
The downsides of this are that more instances will be open at once, more code and factories will need to implemented to track ownership of instances which means larger code base.
Closing Notes.
While we've given 3 options here and explained them in detail, we will probably implement all 3 and periodically swap between them to monitor how it affects traffic overall.
If you can, please reply to this issue with either of the following numbers, or react to this issue with the reactions supplied, either thumbs up if you agree with any of the options or thumbs down if you don't agree with any of them.
1
-> If you want arbiter backlogging.2
-> If you want shared instances.3
-> If you want the reusage of instances.You may combine these numbers like
1, 2 because blah.. blah..
if you would like 2 or more, you may also supply a priority for each.If you also could, if you reply with a vote, please supply a brief explanation on why you would like this implemented.
Questions
If you have any questions, please ask below.
Images
As development continues, images will be attached here to show our progress.
Beta Was this translation helpful? Give feedback.
All reactions