Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/superset): leverage threads for superset API calls #13006

Merged
merged 11 commits into from
Apr 2, 2025

Conversation

PeteMango
Copy link
Contributor

@PeteMango PeteMango commented Mar 27, 2025

Before we had timeouts on long dataset ingestions because it would take approx 6 hours, while the lifespan of a single preset jwt was 5 hours. This means we were unable to ingest all the datasets in the lifespan of the jwt. This pr improves the speed by leveraging threads for api calls. This was able to improve the ingestion time down to 1 hour and 6 minutes.

This pr also hopes to resolve this issue, which is a similar problem with expiring access token. I plan to put up another pr after this if it is still too slow for some ingestions to refresh the jwt on large ingestions.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@PeteMango PeteMango marked this pull request as ready for review March 27, 2025 22:36
@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Mar 27, 2025
@PeteMango PeteMango changed the title infra(ingestion/superset): threads for processing, to improve ingestion by ~6x infra(ingestion/superset): threads for processing, improved ingestion speed by ~6x Mar 27, 2025
Copy link

codecov bot commented Mar 27, 2025

Codecov Report

Attention: Patch coverage is 63.93443% with 22 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ingestion/src/datahub/ingestion/source/superset.py 63.93% 22 Missing ⚠️

❌ Your patch status has failed because the patch coverage (63.93%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

📢 Thoughts on this report? Let us know!

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@PeteMango PeteMango changed the title infra(ingestion/superset): threads for processing, improved ingestion speed by ~6x infra(ingestion/superset): threads for processing ingestions Mar 28, 2025
@PeteMango PeteMango changed the title infra(ingestion/superset): threads for processing ingestions infra(ingestion/superset): leverage threads for processing ingestions Mar 28, 2025
executor.submit(process_chart, chart_data): chart_data
for chart_data in self.paginate_entity_api_results("chart/", PAGE_SIZE)
}
for future in as_completed(future_to_chart):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use our ThreadedIteratorExecutor class for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Updated to use datahub's ThreadedIteratorExecutor class

@@ -1024,17 +1050,15 @@ def construct_dataset_from_dataset_data(
return dataset_snapshot

def emit_dataset_mces(self) -> Iterable[MetadataWorkUnit]:
for dataset_data in self.paginate_entity_api_results("dataset/", PAGE_SIZE):
def process_dataset(dataset_data):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nesting methods within other methods is typically not a great practice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Moved it to a private method outside.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Mar 29, 2025
@PeteMango PeteMango requested a review from hsheth2 March 31, 2025 14:40
@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Mar 31, 2025
Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty reasonable

@hsheth2 hsheth2 changed the title infra(ingestion/superset): leverage threads for processing ingestions feat(ingest/superset): leverage threads for superset API calls Apr 2, 2025
@datahub-cyborg datahub-cyborg bot added merge-pending-ci A PR that has passed review and should be merged once CI is green. and removed needs-review Label for PRs that need review from a maintainer. labels Apr 2, 2025
@hsheth2 hsheth2 merged commit 719cc67 into datahub-project:master Apr 2, 2025
101 of 105 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Superset ingestor is inefficient to ingest large amount of data
2 participants