-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingest/superset): leverage threads for superset API calls #13006
Conversation
Codecov ReportAttention: Patch coverage is
❌ Your patch status has failed because the patch coverage (63.93%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage. 📢 Thoughts on this report? Let us know! 🚀 New features to boost your workflow:
|
executor.submit(process_chart, chart_data): chart_data | ||
for chart_data in self.paginate_entity_api_results("chart/", PAGE_SIZE) | ||
} | ||
for future in as_completed(future_to_chart): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use our ThreadedIteratorExecutor
class for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Updated to use datahub's ThreadedIteratorExecutor
class
@@ -1024,17 +1050,15 @@ def construct_dataset_from_dataset_data( | |||
return dataset_snapshot | |||
|
|||
def emit_dataset_mces(self) -> Iterable[MetadataWorkUnit]: | |||
for dataset_data in self.paginate_entity_api_results("dataset/", PAGE_SIZE): | |||
def process_dataset(dataset_data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nesting methods within other methods is typically not a great practice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Moved it to a private method outside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty reasonable
Before we had timeouts on long dataset ingestions because it would take approx 6 hours, while the lifespan of a single preset jwt was 5 hours. This means we were unable to ingest all the datasets in the lifespan of the jwt. This pr improves the speed by leveraging threads for api calls. This was able to improve the ingestion time down to 1 hour and 6 minutes.
This pr also hopes to resolve this issue, which is a similar problem with expiring access token. I plan to put up another pr after this if it is still too slow for some ingestions to refresh the jwt on large ingestions.
Checklist