Skip to content

Add support for module to defrag incremental #372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 31 commits into
base: active-defrag
Choose a base branch
from

Conversation

sundb
Copy link
Owner

@sundb sundb commented Feb 13, 2025

based on #371

yveslb and others added 4 commits February 13, 2025 08:42
…ext (redis#13794)

Help text modified for -i, --count, --pattern.
…cond (redis#13793)

```
if (server.aof_fsync == AOF_FSYNC_EVERYSEC &&
    server.aof_last_incr_fsync_offset != server.aof_last_incr_size &&
    server.mstime - server.aof_last_fsync >= 1000 &&
    !(sync_in_progress = aofFsyncInProgress())) {
    goto try_fsync;
```
In redis#12622, when when
appendfsync=everysecond, if redis has written some data to AOF but not
`fsync`, and less than 1 second has passed since the last `fsync `,
redis will won't fsync AOF, but we will update `
fsynced_reploff_pending`, so it cause the `WAITAOF` to return
prematurely.

this bug is introduced in redis#12622,
from 7.4

The bug fix
redis@1bd6688
is just as follows:
```diff
diff --git a/src/aof.c b/src/aof.c
index 8ccd8d8..521b304 100644
--- a/src/aof.c
+++ b/src/aof.c
@@ -1096,8 +1096,11 @@ void flushAppendOnlyFile(int force) {
              * in which case master_repl_offset will increase but fsynced_reploff_pending won't be updated
              * (because there's no reason, from the AOF POV, to call fsync) and then WAITAOF may wait on
              * the higher offset (which contains data that was only propagated to replicas, and not to AOF) */
-            if (!sync_in_progress && server.aof_fsync != AOF_FSYNC_NO)
+            if (server.aof_last_incr_fsync_offset == server.aof_last_incr_size &&
+                !(sync_in_progress = aofFsyncInProgress()))
+            {
                 atomicSet(server.fsynced_reploff_pending, server.master_repl_offset);
+            }
             return;
```
Additionally, we slightly refactored fsync AOF to make it simpler, as
redis@584f008
The reason why master sends PING is to keep the connection with replica
active, so master need not send PING to replicas if already sent
replication stream in the past `repl_ping_slave_period` time.

Now master only sends PINGs and increases `master_repl_offset` if there
is no traffic, so this PR also can reduce the impact of issue in
redis#13773, of course, does not resolve
it completely.
> Fix ping causing inconsistency between AOF size and replication offset
in the future PR. Because we increment the replication offset when
sending PING/REPLCONF to the replica but do not write data to the AOF
file, this might cause the starting offset of the AOF file plus its size
to be inconsistent with the actual replication offset.
Copy link

codecov bot commented Feb 13, 2025

Codecov Report

Attention: Patch coverage is 20.00000% with 28 lines in your changes missing coverage. Please review.

Project coverage is 68.97%. Comparing base (3e68663) to head (051705f).

Files with missing lines Patch % Lines
src/defrag.c 20.00% 24 Missing ⚠️
src/module.c 20.00% 4 Missing ⚠️
Additional details and impacted files
@@                Coverage Diff                @@
##           active-defrag     #372      +/-   ##
=================================================
- Coverage          70.04%   68.97%   -1.07%     
=================================================
  Files                118      118              
  Lines              73362    68046    -5316     
=================================================
- Hits               51383    46938    -4445     
+ Misses             21979    21108     -871     
Files with missing lines Coverage Δ
src/server.h 100.00% <ø> (ø)
src/module.c 9.24% <20.00%> (-0.06%) ⬇️
src/defrag.c 86.69% <20.00%> (-2.81%) ⬇️

... and 18 files with indirect coverage changes

ShooterIT and others added 2 commits February 13, 2025 17:31
### Background 
AOF is often used as an effective data recovery method, but now if we
have two AOFs from different nodes, it is hard to learn which one has
latest data. Generally, we determine whose data is more up-to-date by
reading the latest modification time of the AOF file, but because of
replication delay, even if both master and replica write to the AOF at
the same time, the data in the master is more up-to-date (there are
commands that didn't arrive at the replica yet, or a large number of
commands have accumulated on replica side ), so we may make wrong
decision.

### Solution
The replication offset always increments when AOF is enabled even if
there is no replica, we think replication offset is better method to
determine which one has more up-to-date data, whoever has a larger
offset will have newer data, so we add the start replication offset info
for AOF, as bellow.
```
file appendonly.aof.2.base.rdb seq 2 type b
file appendonly.aof.2.incr.aof seq 2 type i startoffset 224
```
And if we close gracefully the AOF file, not a crash, such as
`shutdown`, `kill signal 15` or `config set appendonly no`, we will add
the end replication offset, as bellow.
```
file appendonly.aof.2.base.rdb seq 2 type b
file appendonly.aof.2.incr.aof seq 2 type i startoffset 224 endoffset 532
```

#### Things to pay attention to
- For BASE AOF, we do not add `startoffset` and `endoffset` info, since
we could not know the start replication replication of data, and it is
useless to help us to determine which one has more up-to-date data.
- For AOFs from old version, we also don't add `startoffset` and
`endoffset` info, since we also don't know start replication replication
of them. If we add the start offset from 0, we might make the judgment
even less accurate. For example, if the master has just rewritten the
AOF, its INCR AOF will inevitably be very small. However, if the replica
has not rewritten AOF for a long time, its INCR AOF might be much
larger. By applying the following method, we might make incorrect
decisions, so we still just check timestamp instead of adding offset
info
- If the last INCR AOF has `startoffset` or `endoffset`, we need to
restore `server.master_repl_offset` according to them to avoid the
rollback of the `startoffset` of next INCR AOF. If it has `endoffset`,
we just use this value as `server.master_repl_offset`, and a very
important thing is to remove this information from the manifest file to
avoid the next time we load the manifest file with wrong `endoffset`. If
it only has `startoffset`, we calculate `server.master_repl_offset` by
the `startoffset` plus the file size.

### How to determine which one has more up-to-date data
If AOF has a larger replication offset, it will have more up-to-date
data. The following is how to get AOF offset:

Read the AOF manifest file to obtain information about **the last INCR
AOF**
1. If the last INCR AOF has `endoffset` field, we can directly use the
`endoffset` to present the replication offset of AOF
2. If there is no `endoffset`(such as redis crashes abnormally), but
there is `startoffset` filed of the last INCR AOF, we can get the
replication offset of AOF by `startoffset` plus the file size
3. Finally, if the AOF doesn’t have both `startoffset` and `endoffset`,
maybe from old version, and new version redis has not rewritten AOF yet,
we still need to check the modification timestamp of the last INCR AOF

### TODO
Fix ping causing inconsistency between AOF size and replication
offset in the future PR. Because we increment the replication offset
when sending PING/REPLCONF to the replica but do not write data to the
AOF file, this might cause the starting offset of the AOF file plus its
size to be inconsistent with the actual replication offset.
MEMORY USAGE on a List samples quicklist entries, but does not account
to how many elements are in each sampled node. This can skew the
calculation when the sampled nodes are not balanced.
The fix calculate the average element size in the sampled nodes instead
of the average node size.
sundb and others added 5 commits February 14, 2025 22:02
This PR adds three new hash commands: HGETDEL, HGETEX and HSETEX. These
commands enable user to do multiple operations in one step atomically
e.g. set a hash field and update its TTL with a single command.
Previously, it was only possible to do it by calling hset and hexpire
commands subsequently.

- **HGETDEL command**

  ```
  HGETDEL <key> FIELDS <numfields> field [field ...]
  ```
  
  **Description**  
  Get and delete the value of one or more fields of a given hash key
  
  **Reply**  
Array reply: list of the value associated with each field or nil if the
field doesn’t exist.

- **HGETEX command**

  ```
   HGETEX <key>  
[EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT
unix-time-milliseconds | PERSIST]
     FIELDS <numfields> field [field ...]
  ```

  **Description**
Get the value of one or more fields of a given hash key, and optionally
set their expiration

  **Options:**
  EX seconds: Set the specified expiration time, in seconds.
  PX milliseconds: Set the specified expiration time, in milliseconds.
EXAT timestamp-seconds: Set the specified Unix time at which the field
will expire, in seconds.
PXAT timestamp-milliseconds: Set the specified Unix time at which the
field will expire, in milliseconds.
  PERSIST: Remove the time to live associated with the field.

  **Reply** 
Array reply: list of the value associated with each field or nil if the
field doesn’t exist.

- **HSETEX command**

  ```
  HSETEX <key>
     [FNX | FXX]
[EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT
unix-time-milliseconds | KEEPTTL]
     FIELDS <numfields> field value [field value...]
  ```
  **Description**
Set the value of one or more fields of a given hash key, and optionally
set their expiration

  **Options:**
  FNX: Only set the fields if all do not already exist.
  FXX: Only set the fields if all already exist.

  EX seconds: Set the specified expiration time, in seconds.
  PX milliseconds: Set the specified expiration time, in milliseconds.
EXAT timestamp-seconds: Set the specified Unix time at which the field
will expire, in seconds.
PXAT timestamp-milliseconds: Set the specified Unix time at which the
field will expire, in milliseconds.
  KEEPTTL: Retain the time to live associated with the field.

  
Note: If no option is provided, any associated expiration time will be
discarded similar to how SET command behaves.

  **Reply**
  Integer reply: 0 if no fields were set
  Integer reply: 1 if all the fields were set
Remove DENYOOM flag from hexpire / hexpireat / hpexpire / hpexpireat
commands.

h(p)expire(at) commands may allocate some memory but it is not that big.
Similary, we don't have DENYOOM flag for EXPIRE command. This change
will align EXPIRE and HEXPIRE commands in this manner.
@@ -1305,7 +1306,7 @@ REDISMODULE_API int *(*RedisModule_GetCommandKeys)(RedisModuleCtx *ctx, RedisMod
REDISMODULE_API int *(*RedisModule_GetCommandKeysWithFlags)(RedisModuleCtx *ctx, RedisModuleString **argv, int argc, int *num_keys, int **out_flags) REDISMODULE_ATTR;
REDISMODULE_API const char *(*RedisModule_GetCurrentCommandName)(RedisModuleCtx *ctx) REDISMODULE_ATTR;
REDISMODULE_API int (*RedisModule_RegisterDefragFunc)(RedisModuleCtx *ctx, RedisModuleDefragFunc func) REDISMODULE_ATTR;
REDISMODULE_API int (*RedisModule_RegisterDefragCallbacks)(RedisModuleCtx *ctx, RedisModuleDefragFunc start, RedisModuleDefragFunc end) REDISMODULE_ATTR;
REDISMODULE_API int (*RedisModule_RegisterDefragCallbacks)(RedisModuleCtx *ctx, RedisModuleDefragEventFunc start, RedisModuleDefragEventFunc end) REDISMODULE_ATTR;
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because this API doesn't release, it can't be considered a breaking change.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure about that.
in RE we backported that to 7.4.
@MeirShpilraien do you happen to know if the internal modules we use rely on cross version ABI compatibility nowadays?
i.e. if the JSON that's packed in RE is pinned to the redis version and they're always upgraded together?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.s. considering the new "Event" type is exactly the same as the old type. this is actually an API break that's not an ABI break.
but considering my other comment about not breaking RedisModuleDefragFunc and all it's callers, and introducing a new API, this one won't require a change.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MeirShpilraien do you happen to know if the internal modules we use rely on cross version ABI compatibility nowadays?
i.e. if the JSON that's packed in RE is pinned to the redis version and they're always upgraded together?

I believe that now they are pin.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even if we are, i'm still afraid to rely on that.
in any case with the recent updates to this PR, there's no worry now.
feel free to comment if you think the API is ok.

@@ -839,7 +839,8 @@ typedef struct RedisModuleDefragCtx RedisModuleDefragCtx;
/* Function pointers needed by both the core and modules, these needs to be
* exposed since you can't cast a function pointer to (void *). */
typedef void (*RedisModuleInfoFunc)(RedisModuleInfoCtx *ctx, int for_crash_report);
typedef void (*RedisModuleDefragFunc)(RedisModuleDefragCtx *ctx);
typedef int (*RedisModuleDefragFunc)(RedisModuleDefragCtx *ctx);
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please note that this is a break change.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, this is an ABI breaking change (and maybe an API break, depending on the type check of the caller).
we must avoid that.

interesting, that depending on the calling convention, if we negate the return value, maybe this would not be an ABI break, since 0 is implicitly returned in the old one?

but anyway, we need a new interface, mark the old one as deprecated, and add a new one.
both should be operational (not necessarily at the same time for the same module).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.s. this change requires a documentation update.
i.e. we now expect the module to call RM_DefragShouldStop and store / load a cursor. as well as return the right return value. right?
so even if we would be ok breaking ABI, we should have documented the new usage.

@@ -839,7 +839,8 @@ typedef struct RedisModuleDefragCtx RedisModuleDefragCtx;
/* Function pointers needed by both the core and modules, these needs to be
* exposed since you can't cast a function pointer to (void *). */
typedef void (*RedisModuleInfoFunc)(RedisModuleInfoCtx *ctx, int for_crash_report);
typedef void (*RedisModuleDefragFunc)(RedisModuleDefragCtx *ctx);
typedef int (*RedisModuleDefragFunc)(RedisModuleDefragCtx *ctx);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, this is an ABI breaking change (and maybe an API break, depending on the type check of the caller).
we must avoid that.

interesting, that depending on the calling convention, if we negate the return value, maybe this would not be an ABI break, since 0 is implicitly returned in the old one?

but anyway, we need a new interface, mark the old one as deprecated, and add a new one.
both should be operational (not necessarily at the same time for the same module).

@@ -1305,7 +1306,7 @@ REDISMODULE_API int *(*RedisModule_GetCommandKeys)(RedisModuleCtx *ctx, RedisMod
REDISMODULE_API int *(*RedisModule_GetCommandKeysWithFlags)(RedisModuleCtx *ctx, RedisModuleString **argv, int argc, int *num_keys, int **out_flags) REDISMODULE_ATTR;
REDISMODULE_API const char *(*RedisModule_GetCurrentCommandName)(RedisModuleCtx *ctx) REDISMODULE_ATTR;
REDISMODULE_API int (*RedisModule_RegisterDefragFunc)(RedisModuleCtx *ctx, RedisModuleDefragFunc func) REDISMODULE_ATTR;
REDISMODULE_API int (*RedisModule_RegisterDefragCallbacks)(RedisModuleCtx *ctx, RedisModuleDefragFunc start, RedisModuleDefragFunc end) REDISMODULE_ATTR;
REDISMODULE_API int (*RedisModule_RegisterDefragCallbacks)(RedisModuleCtx *ctx, RedisModuleDefragEventFunc start, RedisModuleDefragEventFunc end) REDISMODULE_ATTR;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure about that.
in RE we backported that to 7.4.
@MeirShpilraien do you happen to know if the internal modules we use rely on cross version ABI compatibility nowadays?
i.e. if the JSON that's packed in RE is pinned to the redis version and they're always upgraded together?

@@ -1305,7 +1306,7 @@ REDISMODULE_API int *(*RedisModule_GetCommandKeys)(RedisModuleCtx *ctx, RedisMod
REDISMODULE_API int *(*RedisModule_GetCommandKeysWithFlags)(RedisModuleCtx *ctx, RedisModuleString **argv, int argc, int *num_keys, int **out_flags) REDISMODULE_ATTR;
REDISMODULE_API const char *(*RedisModule_GetCurrentCommandName)(RedisModuleCtx *ctx) REDISMODULE_ATTR;
REDISMODULE_API int (*RedisModule_RegisterDefragFunc)(RedisModuleCtx *ctx, RedisModuleDefragFunc func) REDISMODULE_ATTR;
REDISMODULE_API int (*RedisModule_RegisterDefragCallbacks)(RedisModuleCtx *ctx, RedisModuleDefragFunc start, RedisModuleDefragFunc end) REDISMODULE_ATTR;
REDISMODULE_API int (*RedisModule_RegisterDefragCallbacks)(RedisModuleCtx *ctx, RedisModuleDefragEventFunc start, RedisModuleDefragEventFunc end) REDISMODULE_ATTR;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.s. considering the new "Event" type is exactly the same as the old type. this is actually an API break that's not an ABI break.
but considering my other comment about not breaking RedisModuleDefragFunc and all it's callers, and introducing a new API, this one won't require a change.

@@ -839,7 +839,8 @@ typedef struct RedisModuleDefragCtx RedisModuleDefragCtx;
/* Function pointers needed by both the core and modules, these needs to be
* exposed since you can't cast a function pointer to (void *). */
typedef void (*RedisModuleInfoFunc)(RedisModuleInfoCtx *ctx, int for_crash_report);
typedef void (*RedisModuleDefragFunc)(RedisModuleDefragCtx *ctx);
typedef int (*RedisModuleDefragFunc)(RedisModuleDefragCtx *ctx);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.s. this change requires a documentation update.
i.e. we now expect the module to call RM_DefragShouldStop and store / load a cursor. as well as return the right return value. right?
so even if we would be ok breaking ABI, we should have documented the new usage.

src/defrag.c Outdated
Comment on lines 1520 to 1526
while ((de = dictNext(di)) != NULL) {
struct RedisModule *module = dictGetVal(de);
if (module->defrag_cb) {
defragModuleCtx *ctx = zmalloc(sizeof(defragModuleCtx));
ctx->module = module;
ctx->cursor = 0;
addDefragStage(defragModuleGlobals, ctx);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we add a defrag stage per module, right?
and we re-create the stages on each sweep we start?

what happens if a module is un-loaded during a sweep?

i wonder if we have any observability feature to know when one is slow, or hang, and which one is it.
probably not mandatory.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we add a defrag stage per module, right?

yes

and we re-create the stages on each sweep we start?

we only recreate all the stages when start a new defragmentaion.

what happens if a module is un-loaded during a sweep?

good catch, we can't store the pointer of the module in the context, instead we need find a way to get the module by id or by name.

i wonder if we have any observability feature to know when one is slow, or hang, and which one is it.
probably not mandatory.

not now, I think in the future we can count the elapsed time of each stage.

@oranagra
Copy link

to sum up: we need is to add RM_RegisterDefragCallbacks2, keep two callbacks vars in redis, and call the appropriate one when set, so we don't break anything, in the old (6.x?) defrag callbacks, and we also don't have to change the ones introduced in 8.0 milestones.
but we add new interfaces for whoever needs to use the incremental defrag callbacks.

for (int i = 0; i < global_strings_len; i++) {
RedisModuleString *new = RedisModule_DefragRedisModuleString(ctx, global_strings[i]);
global_attempts++;
unsigned long cursor = 0;
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oranagra i don't use a global index here, i think it's better to tell module developer another way to use the cursor.

@sundb
Copy link
Owner Author

sundb commented Feb 17, 2025

@oranagra we can't mark RedisModule_RegisterDefragFunc as deprecated, because we have the flag -Werror=deprecated-declarations in makefile.

In file included from server.h:68,
                 from tls.c:11:
redismodule.h: In function ‘RedisModule_Init’:
redismodule.h:1684:5: error: ‘RedisModule_RegisterDefragFunc’ is deprecated [-Werror=deprecated-declarations]
 1684 |     REDISMODULE_GET_API(RegisterDefragFunc);
      |     ^~~~~~~~~~~~~~~~~~~
redismodule.h:1309:23: note: declared here
 1309 | REDISMODULE_API int (*RedisModule_RegisterDefragFunc)(RedisModuleCtx *ctx, RedisModuleDefragFunc func) REDISMODULE_ATTR __attribute__ ((deprecated));
      |                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

src/defrag.c Outdated
Comment on lines 1222 to 1224
sds module_name = sdsnew(defrag_module_ctx->module_name);
RedisModule *module = moduleGetHandleByName(module_name);
sdsfree(module_name);
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only a temporary fix, and I'm going to let stage support its own free callback.

@oranagra
Copy link

@oranagra we can't mark RedisModule_RegisterDefragFunc as deprecated, because we have the flag -Werror=deprecated-declarations in makefile.

i meant just to document it as obsolete. like we did for RM_KeyAtPos?
actually, looking at other things we replaced, like mem_usage2 instead of mem_usage, and unlink2 instead of unlink, we didn't even bother to say they're deprecated, just that this new version is more powerful. so i suppose that's enough too.

sundb and others added 7 commits February 18, 2025 15:26
Co-authored-by: Oran Agra <[email protected]>
closes redis#13797, just fix syntax
issue in comments instead of real code.
redis#13804)

the `dictGetSignedIntegerVal` function should be used here,
because in some cases (especially on 32-bit systems) long may
be 4 bytes, and the ttl time saved in expires is a unix timestamp
(millisecond value), which is more than 4 bytes. In this case, we may
not be able to get the correct idle time, which may cause eviction
disorder, in other words, keys that should be evicted later may be
evicted earlier.
In case a replica connection was closed mid-RDB, we should not send a \n
to that replica, otherwise, it may reach the replica BEFORE it realizes
that the RDB transfer failed, causing it to treat the \n as if it was
read from the RDB stream
This PR is based on: valkey-io/valkey#1462

## Issue/Problems

Duty Cycle: Active Defrag has configuration values which determine the
intended percentage of CPU to be used based on a gradient of the
fragmentation percentage. However, Active Defrag performs its work on
the 100ms serverCron timer. It then computes a duty cycle and performs a
single long cycle. For example, if the intended CPU is computed to be
10%, Active Defrag will perform 10ms of work on this 100ms timer cron.

* This type of cycle introduces large latencies on the client (up to
25ms with default configurations)
* This mechanism is subject to starvation when slow commands delay the
serverCron

Maintainability: The current Active Defrag code is difficult to read &
maintain. Refactoring of the high level control mechanisms and functions
will allow us to more seamlessly adapt to new defragmentation needs.
Specific examples include:

* A single function (activeDefragCycle) includes the logic to
start/stop/modify the defragmentation as well as performing one "step"
of the defragmentation. This should be separated out, so that the actual
defrag activity can be performed on an independent timer (see duty cycle
above).
* The code is focused on kvstores, with other actions just thrown in at
the end (defragOtherGlobals). There's no mechanism to break this up to
reduce latencies.
* For the main dictionary (only), there is a mechanism to set aside
large keys to be processed in a later step. However this code creates a
separate list in each kvstore (main dict or not), bleeding/exposing
internal defrag logic. We only need 1 list - inside defrag. This logic
should be more contained for the main key store.
* The structure is not well suited towards other non-main-dictionary
items. For example, pub-sub and pub-sub-shard was added, but it's added
in such a way that in CMD mode, with multiple DBs, we will defrag
pub-sub repeatedly after each DB.

## Description of the feature

Primarily, this feature will split activeDefragCycle into 2 functions.

1. One function will be called from serverCron to determine if a defrag
cycle (a complete scan) needs to be started. It will also determine if
the CPU expenditure needs to be adjusted.
2. The 2nd function will be a timer proc dedicated to performing defrag.
This will be invoked independently from serverCron.

Once the functions are split, there is more control over the latency
created by the defrag process. A new configuration will be used to
determine the running time for the defrag timer proc. The default for
this will be 500us (one-half of the current minimum time). Then the
timer will be adjusted to achieve the desired CPU. As an example, 5% of
CPU will run the defrag process for 500us every 10ms. This is much
better than running for 5ms every 100ms.

The timer function will also adjust to compensate for starvation. If a
slow command delays the timer, the process will run proportionately
longer to ensure that the configured CPU is achieved. Given the presence
of slow commands, the proportional extra time is insignificant to
latency. This also addresses the overload case. At 100% CPU, if the
event loop slows, defrag will run proportionately longer to achieve the
configured CPU utilization.

Optionally, in low CPU situations, there would be little impact in
utilizing more than the configured CPU. We could optionally allow the
timer to pop more often (even with a 0ms delay) and the (tail) latency
impact would not change.

And we add a time limit for the defrag duty cycle to prevent excessive
latency. When latency is already high (indicated by a long time between
calls), we don't want to make it worse by running defrag for too long.

Addressing maintainability:

* The basic code structure can more clearly be organized around a
"cycle".
* Have clear begin/end functions and a set of "stages" to be executed.
* Rather than stages being limited to "kvstore" type data, a cycle
should be more flexible, incorporating the ability to incrementally
perform arbitrary work. This will likely be necessary in the future for
certain module types. It can be used today to address oddballs like
defragOtherGlobals.
* We reduced some of the globals, and reduce some of the coupling.
defrag_later should be removed from serverDb.
* Each stage should begin on a fresh cycle. So if there are
non-time-bounded operations like kvstoreDictLUTDefrag, these would be
less likely to introduce additional latency.


Signed-off-by: Jim Brunner
[[email protected]](mailto:[email protected])
Signed-off-by: Madelyn Olson
[[email protected]](mailto:[email protected])
Co-authored-by: Madelyn Olson
[[email protected]](mailto:[email protected])

---------

Signed-off-by: Jim Brunner [email protected]
Signed-off-by: Madelyn Olson [email protected]
Co-authored-by: Madelyn Olson [email protected]
Co-authored-by: ShooterIT <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants