Skip to content

Conversation

@askumar27
Copy link
Contributor

@askumar27 askumar27 commented Nov 10, 2025

DBT Cloud Auto-Discovery Mode

Overview

This document describes the new auto-discovery mode feature for the DBT Cloud integration in DataHub. This enhancement enables automatic discovery of jobs within a specified DBT Cloud project, eliminating the need to manually specify individual job IDs.

Motivation

Previously, users had to manually specify a single job_id to ingest DBT Cloud metadata. For organizations with multiple jobs in a project, this required creating separate ingestion configurations for each job. Auto-discovery mode solves this by automatically discovering and ingesting all qualifying jobs in a project.

Features

Automatic Job Discovery

  • Discovers all jobs for a specified DBT Cloud project
  • Filters to only production environment jobs
  • Only ingests jobs with generate_docs=True enabled
  • Supports optional regex-based job filtering

Dual Mode Operation

The integration now supports two modes:

  1. Explicit Mode (existing, backward compatible):

    • Manually specify a single job_id
    • Optionally specify run_id (defaults to latest)
  2. Auto-Discovery Mode (new):

    • Automatically discovers jobs for a project
    • Always uses the latest run for each job
    • Filters by production environment and generate_docs flag

Configuration

Basic Auto-Discovery Configuration

source:
  type: "dbt-cloud"
  config:
    # Required fields
    account_id: 107298
    project_id: 466863
    target_platform: "snowflake"
    token: "${DBT_CLOUD_TOKEN}"

    # Auto-discovery configuration
    auto_discovery:
      enabled: true

Advanced Configuration with Job Filtering

source:
  type: "dbt-cloud"
  config:
    account_id: 107298
    project_id: 466863
    target_platform: "snowflake"
    token: "${DBT_CLOUD_TOKEN}"

    # Optional: platform_instance for DBT project identification
    platform_instance: "dbt_cloud_project_466863"

    # Optional: target_platform_instance to link to specific platform instance
    target_platform_instance: "snowflake_prod"

    # Auto-discovery with job filtering
    auto_discovery:
      enabled: true
      job_id_pattern:
        allow:
          - "96.*"      # Only jobs starting with 96
          - "148094"    # Specific job
        deny:
          - ".*test.*"  # Exclude test jobs

Explicit Mode (Backward Compatible)

source:
  type: "dbt-cloud"
  config:
    account_id: 107298
    project_id: 466863
    target_platform: "snowflake"
    token: "${DBT_CLOUD_TOKEN}"

    # Explicit job specification
    job_id: 148094
    run_id: 12345  # Optional, defaults to latest run

Future Enhancements

The AutoDiscoveryConfig structure is designed to support future enhancements:

  • Account-level discovery: Discover projects and jobs across an entire account
  • Environment filtering: Support non-production environments
  • Job status filtering: Filter by job success/failure status
  • Schedule-based filtering: Filter jobs by schedule frequency
  • Tag-based filtering: Filter jobs by DBT Cloud tags

References

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Nov 10, 2025
@codecov
Copy link

codecov bot commented Nov 10, 2025

Codecov Report

❌ Patch coverage is 99.28058% with 1 line in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...tion/src/datahub/ingestion/source/dbt/dbt_cloud.py 99.11% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@alwaysmeticulous
Copy link

alwaysmeticulous bot commented Nov 10, 2025

🔴 Meticulous spotted visual differences in 2 of 1067 screens tested: view and approve differences detected.

Meticulous evaluated ~8 hours of user flows against your PR.

Last updated for commit aa64b17. This comment will update as new commits are pushed.

@sgomezvillamor
Copy link
Contributor

Loved the PR description 🔝

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Nov 12, 2025
Copy link
Contributor

@sgomezvillamor sgomezvillamor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

My main comments/suggestions:

  • give more visiblity in the UI to some info/warn/error log traces
  • is there anything preventing to make auto-discovery enabled by default?

@datahub-cyborg datahub-cyborg bot added pending-submitter-merge and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Nov 12, 2025

logger.debug(f"Fetching jobs for account {account_id} from dbt Cloud: {url}")
response = requests.get(
url,
Copy link

@aikido-pr-checks aikido-pr-checks bot Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential user input in HTTP request may allow SSRF attack - medium severity
If an attacker can control the URL input leading into this HTTP request, the attack might be able to perform an SSRF attack. This kind of attack is even more dangerous if the application returns the response of the request to the user. It could allow them to retrieve information from higher privileged services within the network (such as the metadata service, which is commonly available in cloud services, and could allow them to retrieve credentials).

Show Remediation

Remediation - medium confidence
This patch mitigates the opening of potentially unsafe URLs by implementing validation for URLs passed to urllib_urlopen.

Suggested change
url,
(
url
if url.startswith(("https://", "http://"))
else ValueError("Invalid URL scheme")
),

View details in Aikido Security

"""
if not self._is_auto_discovery_enabled():
return []
assert self.config.auto_discovery is not None
Copy link

@aikido-pr-checks aikido-pr-checks bot Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dangerous use of assert - low severity
When running Python in production in optimized mode, assert calls are not executed. This mode is enabled by setting the PYTHONOPTIMIZE command line flag. Optimized mode is usually ON in production. Any safety check done using assert will not be executed.

Remediation: Raise an exception instead of using assert.
View details in Aikido Security

@codecov
Copy link

codecov bot commented Nov 12, 2025

Bundle Report

Changes will increase total bundle size by 5.22kB (0.02%) ⬆️. This is within the configured threshold ✅

Detailed changes
Bundle name Size Change
datahub-react-web-esm 28.64MB 5.22kB (0.02%) ⬆️

Affected Assets, Files, and Routes:

view changes for bundle: datahub-react-web-esm

Assets Changed:

Asset Name Size Change Total Size Change (%)
assets/index-*.js 5.22kB 19.01MB 0.03%

return [], {}
run_id = None # Always use latest run in auto-discovery
else:
assert self.config.job_id is not None
Copy link

@aikido-pr-checks aikido-pr-checks bot Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dangerous use of assert - low severity
When running Python in production in optimized mode, assert calls are not executed. This mode is enabled by setting the PYTHONOPTIMIZE command line flag. Optimized mode is usually ON in production. Any safety check done using assert will not be executed.

Remediation: Raise an exception instead of using assert.
View details in Aikido Security

- Updated DBTCloudConfig to include optional project_id and job_id patterns for auto-discovery.
- Added validation to ensure project_id is provided when job_id is specified.
- Implemented job fetching and filtering logic in DBTCloudSource to support both single job ingestion and auto-discovery.
- Enhanced error handling and logging for job fetching processes.
…ionality

- Added new models for DBTCloud API responses to facilitate data validation.
- Implemented auto-discovery configuration in DBTCloudConfig to streamline job ingestion.
- Enhanced DBTCloudSource to support both explicit job ingestion and auto-discovery, improving flexibility in job management.
- Updated logging and error handling for better traceability during job fetching processes.
…ion process

- Simplified the validation logic in DBTCloudConfig for auto-discovery and explicit mode.
- Introduced a helper method to check auto-discovery status in DBTCloudSource.
- Enhanced job fetching and filtering logic to improve clarity and maintainability.
- Updated logging messages for better context during job discovery and ingestion.
…ery functionality

- Introduced comprehensive integration tests for the dbt Cloud auto-discovery process, covering end-to-end workflows, error handling, and job ingestion scenarios.
- Added unit tests for configuration validation, API response parsing, and job/environment filtering logic in dbt Cloud.
- Enhanced test coverage for both explicit and auto-discovery modes, ensuring consistent behavior and error handling across different configurations.
…modes

- Introduced two operating modes for dbt Cloud source: Explicit Mode for single job ingestion and Auto-Discovery Mode for automatic metadata retrieval from all eligible jobs.
- Updated documentation to reflect new modes and their configurations.
- Enhanced logging to provide better insights during job processing and filtering.
- Added new metrics to track jobs retrieved and processed, improving reporting capabilities.
@askumar27 askumar27 force-pushed the feature/acr-6674/bulk-dbt-job-ingestion branch from aa64b17 to bc31fd3 Compare November 13, 2025 19:21
@askumar27 askumar27 merged commit 2f31635 into master Nov 13, 2025
63 checks passed
@askumar27 askumar27 deleted the feature/acr-6674/bulk-dbt-job-ingestion branch November 13, 2025 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata pending-submitter-merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants