Skip to content

chore: flush after serializing big string #5401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 17, 2025
Merged

chore: flush after serializing big string #5401

merged 4 commits into from
Jul 17, 2025

Conversation

kostasrim
Copy link
Contributor

Serializing entries of a bucket during snapshot might contain large strings which potentially cross the serialization threshold imposed by FlushIfNeeded(). If the next element in the bucket is a big value (let's say a hash table), we will first serialize its first element of the container and then Flush which can lead to memory pressure. To avoid that, we call FlushIfNeeded() after we serialize each string.

Resolves #5394

@kostasrim kostasrim self-assigned this Jul 2, 2025
// We flush here because if the next element in the bucket we are serializing is a container,
// it will first serialize the first entry and then flush the internal buffer, even if
// crossed the limit.
FlushIfNeeded(FlushState::kFlushEndEntry);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adiholden This is a general issue not only limited to strings right ? Same applies for other data types as well that are not chunked (treated as big values)? So for example, we don't flush after we serialize a json. What if, all the elements in the bucket are json data types ? Shall I take care of that as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.s, I could also add a test but 🤷

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why here and not in RdbSerializer::SaveEntry at the end of the function? this way you can be sure its actually after you finished saving the entry and not in the middle if you have some other serialization after SaveValue which I see we dont have today but might in the future. By moving to the end of save entry it will apply also to all datatypes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.s add a test

@@ -64,7 +64,7 @@ size_t SliceSnapshot::GetThreadLocalMemoryUsage() {
}

bool SliceSnapshot::IsSnaphotInProgress() {
return tl_slice_snapshots.size() > 0;
return !tl_slice_snapshots.empty();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not resist, I will refactor all those cases eventually

@kostasrim kostasrim requested a review from adiholden July 2, 2025 09:06
await wait_for_replicas_state(c_replica)

# make sure capacity hasn't changed after seeding
new_capacity = await get_memory(c_master, "prime_capacity")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets move this right after line 3419

@dfly_args({"proactor_threads": 1})
async def test_big_strings(df_factory):
master = df_factory.create(proactor_threads=1, serialization_max_chunk_size=1)
replica = df_factory.create(proactor_threads=1, serialization_max_chunk_size=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : serialization_max_chunk_size no affect in replica

# inbetween. This is not a great test for this but we are limited because we can't fill
# a bucket with big strings as the memory in the gh runner is fairly limited. We at
# least check for correctness and *some* improvement in the memory foot print.
assert peak_memory < used_memory + five_mb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this is not very stable. I think that if we want to check that we actually flushed the data we can have a statistic for serializer peak bytes (updated every time we call SerializerBase::FlushToSink) which will be printed at the end of replication , part of the statistics and you can read it in this test to make sure the bytes in no more than a single entry size

@kostasrim kostasrim requested a review from adiholden July 17, 2025 07:28
@kostasrim kostasrim merged commit da59e40 into main Jul 17, 2025
10 checks passed
@kostasrim kostasrim deleted the kpr10 branch July 17, 2025 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Big value serialization - Add Flushing Support While Iterating Bucket Entries
2 participants