Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using ObjectReader in the SqlSegmentsMetadataManager.doPollSegments #17732

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

umisan
Copy link

@umisan umisan commented Feb 16, 2025

Description

This PR aims to speed up metadata reading, improving performance during metadata polling.

This patch changes the code to use ObjectReader instead of ObjectMapper when reading multiple JSON objects. Since ObjectReader is slightly faster in this scenario, this change should improve the performance of metadata polling.
jackson document

Release note

Improved: You can now load newly added segments more quickly.


Key changed/added classes in this PR
  • SqlSegmentsMetadataManager

This PR has:

  • been self-reviewed.

@kfaraz
Copy link
Contributor

kfaraz commented Feb 16, 2025

Thanks for creating the PR, @umisan !

Have you been able to compare the performance of this code while polling real segments before and after the change?

In my experience, most of the time in the poll is actually spent in the IO itself rather than the Jackson deserialization.

@umisan
Copy link
Author

umisan commented Feb 16, 2025

@kfaraz
Thanks for reviewing my PR.

Sorry, I haven't tested this change on our Druid cluster yet.

I completely agree with your point that most of the time spent during polling is due to I/O, and the improvement from deserialization optimization might be negligible.

Our Druid cluster has about 1 million segments and takes several minutes to load newly added segments. Unfortunately, we don't have a staging Druid cluster, so I haven't been able to test this change in an environment with a large number of segments.

I am considering setting up a test Druid cluster to evaluate this change. However, it's possible that the results will show that this PR doesn't provide meaningful improvements.

@kfaraz
Copy link
Contributor

kfaraz commented Feb 16, 2025

I am considering setting up a test Druid cluster to evaluate this change. However, it's possible that the results will show that this PR doesn't provide meaningful improvements.

@umisan , yes, that's what I fear as well.
I had actually tried out a change to reduce the amount of deserialization done but it did not affect the total polling time at all.

FYI, we have recently merged a segment caching feature in #17653 .
The code in this patch is able to do a delta poll of ~600k segments in just a couple of seconds.
Essentially, it does not fetch segments from the metadata store which have not been updated.
While this does not currently impact the polling time in SqlSegmentsMetadataManager,
I intend to plug in the same logic there once I have had time to test out and benchmark the cache changes thoroughly.

@umisan
Copy link
Author

umisan commented Feb 16, 2025

I understand the current situation.
For now, I’m planning to set up a test Druid cluster to test this PR and also to prepare for future contributions. However, I don’t have experience building and running Druid locally, and I can only work on this during weekends, so it will take some time.
Given this, I’m thinking of closing this PR for now. If testing shows that this change provides meaningful improvements, I will reopen it.

Thank you for reviewing my PR and sharing your insights!
Thanks to this amazing open-source project, our team has been able to improve system performance while reducing costs.
Moving forward, I hope to contribute, even if only in a small way, to help improve Druid.

@kfaraz
Copy link
Contributor

kfaraz commented Feb 17, 2025

Thanks a lot, @umisan !

I am really glad to hear that you have enjoyed using Druid.
I look forward to the contributions from you!

@FrankChen021
Copy link
Member

@umisan i didn't find any information about ObjectReader from the link you gave. If the ObjectReader has better performance, what I think is we can apply it to the ingestion module which uses ObjectMapper heavily.

@umisan
Copy link
Author

umisan commented Feb 17, 2025

@FrankChen021
Sorry, my previous link was incorrect.
Here is the correct link:
https://github.com/fasterxml/jackson-docs/wiki/presentation:-jackson-performance
This document explains that ObjectReader performs better than ObjectMapper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants