feat(dbt): enhance DBTCloud integration with bulk job ingestion #15264

askumar27 · 2025-11-10T22:57:00Z

DBT Cloud Auto-Discovery Mode

Overview

This document describes the new auto-discovery mode feature for the DBT Cloud integration in DataHub. This enhancement enables automatic discovery of jobs within a specified DBT Cloud project, eliminating the need to manually specify individual job IDs.

Motivation

Previously, users had to manually specify a single job_id to ingest DBT Cloud metadata. For organizations with multiple jobs in a project, this required creating separate ingestion configurations for each job. Auto-discovery mode solves this by automatically discovering and ingesting all qualifying jobs in a project.

Features

Automatic Job Discovery

Discovers all jobs for a specified DBT Cloud project
Filters to only production environment jobs
Only ingests jobs with generate_docs=True enabled
Supports optional regex-based job filtering

Dual Mode Operation

The integration now supports two modes:

Explicit Mode (existing, backward compatible):
- Manually specify a single job_id
- Optionally specify run_id (defaults to latest)
Auto-Discovery Mode (new):
- Automatically discovers jobs for a project
- Always uses the latest run for each job
- Filters by production environment and generate_docs flag

Configuration

Basic Auto-Discovery Configuration

source:
  type: "dbt-cloud"
  config:
    # Required fields
    account_id: 107298
    project_id: 466863
    target_platform: "snowflake"
    token: "${DBT_CLOUD_TOKEN}"

    # Auto-discovery configuration
    auto_discovery:
      enabled: true

Advanced Configuration with Job Filtering

source:
  type: "dbt-cloud"
  config:
    account_id: 107298
    project_id: 466863
    target_platform: "snowflake"
    token: "${DBT_CLOUD_TOKEN}"

    # Optional: platform_instance for DBT project identification
    platform_instance: "dbt_cloud_project_466863"

    # Optional: target_platform_instance to link to specific platform instance
    target_platform_instance: "snowflake_prod"

    # Auto-discovery with job filtering
    auto_discovery:
      enabled: true
      job_id_pattern:
        allow:
          - "96.*"      # Only jobs starting with 96
          - "148094"    # Specific job
        deny:
          - ".*test.*"  # Exclude test jobs

Explicit Mode (Backward Compatible)

source:
  type: "dbt-cloud"
  config:
    account_id: 107298
    project_id: 466863
    target_platform: "snowflake"
    token: "${DBT_CLOUD_TOKEN}"

    # Explicit job specification
    job_id: 148094
    run_id: 12345  # Optional, defaults to latest run

Future Enhancements

The AutoDiscoveryConfig structure is designed to support future enhancements:

Account-level discovery: Discover projects and jobs across an entire account
Environment filtering: Support non-production environments
Job status filtering: Filter by job success/failure status
Schedule-based filtering: Filter jobs by schedule frequency
Tag-based filtering: Filter jobs by DBT Cloud tags

References

codecov · 2025-11-10T23:00:03Z

Codecov Report

❌ Patch coverage is 99.28058% with 1 line in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...tion/src/datahub/ingestion/source/dbt/dbt_cloud.py	99.11%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

alwaysmeticulous · 2025-11-10T23:02:19Z

🔴 Meticulous spotted visual differences in 2 of 1067 screens tested: view and approve differences detected.

Meticulous evaluated ~8 hours of user flows against your PR.

_{Last updated for commit aa64b17. This comment will update as new commits are pushed.}

sgomezvillamor · 2025-11-12T15:28:32Z

Loved the PR description 🔝

metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_cloud.py

sgomezvillamor

Overall LGTM

My main comments/suggestions:

give more visiblity in the UI to some info/warn/error log traces
is there anything preventing to make auto-discovery enabled by default?

aikido-pr-checks · 2025-11-12T18:41:08Z

metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_cloud.py

+
+        logger.debug(f"Fetching jobs for account {account_id} from dbt Cloud: {url}")
+        response = requests.get(
+            url,


Potential user input in HTTP request may allow SSRF attack - medium severity
If an attacker can control the URL input leading into this HTTP request, the attack might be able to perform an SSRF attack. This kind of attack is even more dangerous if the application returns the response of the request to the user. It could allow them to retrieve information from higher privileged services within the network (such as the metadata service, which is commonly available in cloud services, and could allow them to retrieve credentials).

Show Remediation

Remediation - medium confidence
This patch mitigates the opening of potentially unsafe URLs by implementing validation for URLs passed to urllib_urlopen.

Suggested change

url,

(

url

if url.startswith(("https://", "http://"))

else ValueError("Invalid URL scheme")

),

^{View details in Aikido Security}

aikido-pr-checks · 2025-11-12T18:41:08Z

metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_cloud.py

+        """
+        if not self._is_auto_discovery_enabled():
+            return []
+        assert self.config.auto_discovery is not None


Dangerous use of assert - low severity
When running Python in production in optimized mode, assert calls are not executed. This mode is enabled by setting the PYTHONOPTIMIZE command line flag. Optimized mode is usually ON in production. Any safety check done using assert will not be executed.

Remediation: Raise an exception instead of using assert.
^{View details in Aikido Security}

codecov · 2025-11-12T19:15:35Z

Bundle Report

Changes will increase total bundle size by 5.22kB (0.02%) ⬆️. This is within the configured threshold ✅

Detailed changes

Bundle name	Size	Change
datahub-react-web-esm	28.64MB	5.22kB (0.02%) ⬆️

Affected Assets, Files, and Routes:

view changes for bundle: datahub-react-web-esm

Assets Changed:

Asset Name	Size Change	Total Size	Change (%)
`assets/index-*.js`	5.22kB	19.01MB	0.03%

aikido-pr-checks · 2025-11-12T21:31:53Z

metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_cloud.py

+                return [], {}
+            run_id = None  # Always use latest run in auto-discovery
+        else:
+            assert self.config.job_id is not None


Dangerous use of assert - low severity
When running Python in production in optimized mode, assert calls are not executed. This mode is enabled by setting the PYTHONOPTIMIZE command line flag. Optimized mode is usually ON in production. Any safety check done using assert will not be executed.

Remediation: Raise an exception instead of using assert.
^{View details in Aikido Security}

- Updated DBTCloudConfig to include optional project_id and job_id patterns for auto-discovery. - Added validation to ensure project_id is provided when job_id is specified. - Implemented job fetching and filtering logic in DBTCloudSource to support both single job ingestion and auto-discovery. - Enhanced error handling and logging for job fetching processes.

…ionality - Added new models for DBTCloud API responses to facilitate data validation. - Implemented auto-discovery configuration in DBTCloudConfig to streamline job ingestion. - Enhanced DBTCloudSource to support both explicit job ingestion and auto-discovery, improving flexibility in job management. - Updated logging and error handling for better traceability during job fetching processes.

…ion process - Simplified the validation logic in DBTCloudConfig for auto-discovery and explicit mode. - Introduced a helper method to check auto-discovery status in DBTCloudSource. - Enhanced job fetching and filtering logic to improve clarity and maintainability. - Updated logging messages for better context during job discovery and ingestion.

…ery functionality - Introduced comprehensive integration tests for the dbt Cloud auto-discovery process, covering end-to-end workflows, error handling, and job ingestion scenarios. - Added unit tests for configuration validation, API response parsing, and job/environment filtering logic in dbt Cloud. - Enhanced test coverage for both explicit and auto-discovery modes, ensuring consistent behavior and error handling across different configurations.

…per method

…modes - Introduced two operating modes for dbt Cloud source: Explicit Mode for single job ingestion and Auto-Discovery Mode for automatic metadata retrieval from all eligible jobs. - Updated documentation to reflect new modes and their configurations. - Enhanced logging to provide better insights during job processing and filtering. - Added new metrics to track jobs retrieved and processed, improving reporting capabilities.

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Nov 10, 2025

github-actions bot deployed to datahub-wheels (Preview) November 10, 2025 22:59 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) November 10, 2025 23:01 View deployment

vercel bot deployed to Preview November 10, 2025 23:07 View deployment

askumar27 marked this pull request as ready for review November 12, 2025 00:16

github-actions bot deployed to datahub-wheels (Preview) November 12, 2025 00:18 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) November 12, 2025 00:20 View deployment

datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Nov 12, 2025

askumar27 requested a review from treff7es November 12, 2025 00:29

vercel bot deployed to Preview November 12, 2025 00:32 View deployment

github-actions bot deployed to datahub-wheels (Preview) November 12, 2025 00:57 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) November 12, 2025 01:00 View deployment

github-actions bot deployed to datahub-wheels (Preview) November 12, 2025 01:10 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) November 12, 2025 01:12 View deployment

vercel bot deployed to Preview November 12, 2025 01:25 View deployment

sgomezvillamor reviewed Nov 12, 2025

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_cloud.py Show resolved Hide resolved

datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Nov 12, 2025

sgomezvillamor reviewed Nov 12, 2025

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_cloud.py Show resolved Hide resolved

sgomezvillamor reviewed Nov 12, 2025

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_cloud.py Show resolved Hide resolved

sgomezvillamor reviewed Nov 12, 2025

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_cloud.py Show resolved Hide resolved

sgomezvillamor approved these changes Nov 12, 2025

View reviewed changes

datahub-cyborg bot added pending-submitter-merge and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Nov 12, 2025

aikido-pr-checks bot reviewed Nov 12, 2025

View reviewed changes

github-actions bot deployed to datahub-wheels (Preview) November 12, 2025 18:42 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) November 12, 2025 18:45 View deployment

vercel bot deployed to Preview November 12, 2025 18:57 View deployment

github-actions bot deployed to datahub-wheels (Preview) November 12, 2025 19:07 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) November 12, 2025 19:10 View deployment

vercel bot deployed to Preview November 12, 2025 19:23 View deployment

aikido-pr-checks bot reviewed Nov 12, 2025

View reviewed changes

github-actions bot deployed to datahub-wheels (Preview) November 12, 2025 21:33 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) November 12, 2025 21:36 View deployment

vercel bot deployed to Preview November 12, 2025 21:49 View deployment

askumar27 added 9 commits November 13, 2025 11:21

fix(dbt): improve error handling and logging in DBTCloud integration

d1b2e87

fix(tests): update type hint for job_id_pattern in auto-discovery hel…

e4f31c1

…per method

log line moved to avoid printing inside loop

b933ee0

Added processes job reporting

bc31fd3

askumar27 force-pushed the feature/acr-6674/bulk-dbt-job-ingestion branch from aa64b17 to bc31fd3 Compare November 13, 2025 19:21

github-actions bot deployed to datahub-wheels (Preview) November 13, 2025 19:23 View deployment

vercel bot deployed to Preview November 13, 2025 19:38 View deployment

askumar27 merged commit 2f31635 into master Nov 13, 2025
63 checks passed

askumar27 deleted the feature/acr-6674/bulk-dbt-job-ingestion branch November 13, 2025 19:55

esteban pushed a commit that referenced this pull request Nov 17, 2025

feat(dbt): enhance DBTCloud integration with bulk job ingestion (#15264)

00b40dd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(dbt): enhance DBTCloud integration with bulk job ingestion #15264

feat(dbt): enhance DBTCloud integration with bulk job ingestion #15264

Uh oh!

askumar27 commented Nov 10, 2025 •

edited

Loading

Uh oh!

codecov bot commented Nov 10, 2025 •

edited

Loading

Uh oh!

alwaysmeticulous bot commented Nov 10, 2025 •

edited

Loading

Uh oh!

sgomezvillamor commented Nov 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sgomezvillamor left a comment

Uh oh!

aikido-pr-checks bot Nov 12, 2025 •

edited

Loading

Uh oh!

aikido-pr-checks bot Nov 12, 2025 •

edited

Loading

Uh oh!

codecov bot commented Nov 12, 2025

Assets Changed:

Uh oh!

aikido-pr-checks bot Nov 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-            url,
+            (
+                url
+                if url.startswith(("https://", "http://"))
+                else ValueError("Invalid URL scheme")
+            ),

feat(dbt): enhance DBTCloud integration with bulk job ingestion #15264

feat(dbt): enhance DBTCloud integration with bulk job ingestion #15264

Uh oh!

Conversation

askumar27 commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DBT Cloud Auto-Discovery Mode

Overview

Motivation

Features

Automatic Job Discovery

Dual Mode Operation

Configuration

Basic Auto-Discovery Configuration

Advanced Configuration with Job Filtering

Explicit Mode (Backward Compatible)

Future Enhancements

References

Uh oh!

codecov bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

alwaysmeticulous bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgomezvillamor commented Nov 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sgomezvillamor left a comment

Choose a reason for hiding this comment

Uh oh!

aikido-pr-checks bot Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aikido-pr-checks bot Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Nov 12, 2025

Bundle Report

Affected Assets, Files, and Routes:

Assets Changed:

Uh oh!

aikido-pr-checks bot Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

askumar27 commented Nov 10, 2025 •

edited

Loading

codecov bot commented Nov 10, 2025 •

edited

Loading

alwaysmeticulous bot commented Nov 10, 2025 •

edited

Loading

aikido-pr-checks bot Nov 12, 2025 •

edited

Loading

aikido-pr-checks bot Nov 12, 2025 •

edited

Loading

aikido-pr-checks bot Nov 12, 2025 •

edited

Loading