-
Notifications
You must be signed in to change notification settings - Fork 3.3k
feat(dbt): enhance DBTCloud integration with bulk job ingestion #15264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
🔴 Meticulous spotted visual differences in 2 of 1067 screens tested: view and approve differences detected. Meticulous evaluated ~8 hours of user flows against your PR. Last updated for commit aa64b17. This comment will update as new commits are pushed. |
|
Loved the PR description 🔝 |
sgomezvillamor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM
My main comments/suggestions:
- give more visiblity in the UI to some info/warn/error log traces
- is there anything preventing to make auto-discovery enabled by default?
|
|
||
| logger.debug(f"Fetching jobs for account {account_id} from dbt Cloud: {url}") | ||
| response = requests.get( | ||
| url, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential user input in HTTP request may allow SSRF attack - medium severity
If an attacker can control the URL input leading into this HTTP request, the attack might be able to perform an SSRF attack. This kind of attack is even more dangerous if the application returns the response of the request to the user. It could allow them to retrieve information from higher privileged services within the network (such as the metadata service, which is commonly available in cloud services, and could allow them to retrieve credentials).
Show Remediation
Remediation - medium confidence
This patch mitigates the opening of potentially unsafe URLs by implementing validation for URLs passed to urllib_urlopen.
| url, | |
| ( | |
| url | |
| if url.startswith(("https://", "http://")) | |
| else ValueError("Invalid URL scheme") | |
| ), |
| """ | ||
| if not self._is_auto_discovery_enabled(): | ||
| return [] | ||
| assert self.config.auto_discovery is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dangerous use of assert - low severity
When running Python in production in optimized mode, assert calls are not executed. This mode is enabled by setting the PYTHONOPTIMIZE command line flag. Optimized mode is usually ON in production. Any safety check done using assert will not be executed.
Remediation: Raise an exception instead of using assert.
View details in Aikido Security
Bundle ReportChanges will increase total bundle size by 5.22kB (0.02%) ⬆️. This is within the configured threshold ✅ Detailed changes
Affected Assets, Files, and Routes:view changes for bundle: datahub-react-web-esmAssets Changed:
|
| return [], {} | ||
| run_id = None # Always use latest run in auto-discovery | ||
| else: | ||
| assert self.config.job_id is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dangerous use of assert - low severity
When running Python in production in optimized mode, assert calls are not executed. This mode is enabled by setting the PYTHONOPTIMIZE command line flag. Optimized mode is usually ON in production. Any safety check done using assert will not be executed.
Remediation: Raise an exception instead of using assert.
View details in Aikido Security
- Updated DBTCloudConfig to include optional project_id and job_id patterns for auto-discovery. - Added validation to ensure project_id is provided when job_id is specified. - Implemented job fetching and filtering logic in DBTCloudSource to support both single job ingestion and auto-discovery. - Enhanced error handling and logging for job fetching processes.
…ionality - Added new models for DBTCloud API responses to facilitate data validation. - Implemented auto-discovery configuration in DBTCloudConfig to streamline job ingestion. - Enhanced DBTCloudSource to support both explicit job ingestion and auto-discovery, improving flexibility in job management. - Updated logging and error handling for better traceability during job fetching processes.
…ion process - Simplified the validation logic in DBTCloudConfig for auto-discovery and explicit mode. - Introduced a helper method to check auto-discovery status in DBTCloudSource. - Enhanced job fetching and filtering logic to improve clarity and maintainability. - Updated logging messages for better context during job discovery and ingestion.
…ery functionality - Introduced comprehensive integration tests for the dbt Cloud auto-discovery process, covering end-to-end workflows, error handling, and job ingestion scenarios. - Added unit tests for configuration validation, API response parsing, and job/environment filtering logic in dbt Cloud. - Enhanced test coverage for both explicit and auto-discovery modes, ensuring consistent behavior and error handling across different configurations.
…modes - Introduced two operating modes for dbt Cloud source: Explicit Mode for single job ingestion and Auto-Discovery Mode for automatic metadata retrieval from all eligible jobs. - Updated documentation to reflect new modes and their configurations. - Enhanced logging to provide better insights during job processing and filtering. - Added new metrics to track jobs retrieved and processed, improving reporting capabilities.
aa64b17 to
bc31fd3
Compare
DBT Cloud Auto-Discovery Mode
Overview
This document describes the new auto-discovery mode feature for the DBT Cloud integration in DataHub. This enhancement enables automatic discovery of jobs within a specified DBT Cloud project, eliminating the need to manually specify individual job IDs.
Motivation
Previously, users had to manually specify a single
job_idto ingest DBT Cloud metadata. For organizations with multiple jobs in a project, this required creating separate ingestion configurations for each job. Auto-discovery mode solves this by automatically discovering and ingesting all qualifying jobs in a project.Features
Automatic Job Discovery
generate_docs=TrueenabledDual Mode Operation
The integration now supports two modes:
Explicit Mode (existing, backward compatible):
job_idrun_id(defaults to latest)Auto-Discovery Mode (new):
generate_docsflagConfiguration
Basic Auto-Discovery Configuration
Advanced Configuration with Job Filtering
Explicit Mode (Backward Compatible)
Future Enhancements
The
AutoDiscoveryConfigstructure is designed to support future enhancements:References