Skip to content

Conversation

@abdullahtariqq
Copy link
Contributor

@abdullahtariqq abdullahtariqq commented Sep 5, 2025

Overview

Adds advanced filtering for dbt nodes based on their materialized database location to enable catalog consistency across multi-team dbt projects.

Changes

  • Added MaterializedNodePatternConfig with database_pattern, schema_pattern, table_pattern fields
  • Clean separation: node_name_pattern (internal dbt refs) vs source_pattern (external table location)
  • Hierarchical filtering: databaseschematable

Use Case

Problem: Multi-database dbt projects (operational_db.*, analytics_db.*, ...) need consistent catalog filtering
Solution: Match source system ingestion patterns for clean, focused data catalog

Example:

  materialized_node_pattern:
    database_pattern:
      allow: ["analytics_db"]
    schema_pattern:
      allow: ["analytics_db\\.marts.*"]
    table_pattern:
      allow: ["fct_.*", "stg_.*"]
      deny: ["int_.*", "tmp_.*"]

Benefits

  • Multi-team separation by database
  • Catalog consistency with source ingestion
  • Backward compatible - existing configs unchanged
  • Performance optimized with early-exit logic

Enables fine-grained dbt ingestion control for complex organizational data architectures.

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Sep 5, 2025
- Add database_pattern and schema_pattern config fields with AllowDenyPattern support
- Enhance _is_allowed_node() to filter nodes by database and schema in addition to node names
- Add comprehensive integration tests for new filtering capabilities
- Support combined filtering patterns for fine-grained dbt ingestion control
@abdullahtariqq abdullahtariqq force-pushed the feat-dbt-improve-filtering branch from b7cbd87 to d702148 Compare September 5, 2025 15:00
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Sep 5, 2025
@codecov
Copy link

codecov bot commented Sep 5, 2025

Bundle Report

Changes will increase total bundle size by 1.43kB (0.0%) ⬆️. This is within the configured threshold ✅

Detailed changes
Bundle name Size Change
datahub-react-web-esm 28.57MB 1.43kB (0.0%) ⬆️

Affected Assets, Files, and Routes:

view changes for bundle: datahub-react-web-esm

Assets Changed:

Asset Name Size Change Total Size Change (%)
assets/index-*.js 1.43kB 18.91MB 0.01%

@codecov
Copy link

codecov bot commented Sep 6, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Sep 8, 2025
@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Sep 8, 2025
@skrydal
Copy link
Collaborator

skrydal commented Sep 8, 2025

Hello @abdullahtariqq , thank you for sharing this contribution. I think the feature is a bit complex, due to the details of dbt references handling. To be specific, node.dbt_name is completely independent of node.schema, node.database or node.name - as the former is an internal dbt reference, while others might refer to an external table.
Therefore, I would introduce a clear separation between those two checks, to avoid confusion, for example we could have separate config field: source_pattern, it should contain trifecta "table", "schema", "database", and follow DataHub-wide pattern matching:

  1. database pattern is matched against database name
  2. schema pattern is matched against f"{database}.{schema}" (fully qualified name)
  3. table pattern is matched against f"{database}.{schema}.{table}" (fully qualified name)

There is an open question whether it should be done against nodes of type source or all external references, what do you think about it?

The _is_node_allowed function would work the following way:

  1. if dbt_name is not allowed, return False
  2. if node is of type source, then match node.name, node.schema, node.database against the patterns, as specified above

Is this aligned with the use-case you had in mind when creating it?

Of course golden files need to be aligned properly.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Sep 8, 2025
@abdullahtariqq
Copy link
Contributor Author

abdullahtariqq commented Sep 10, 2025

@skrydal Thank you for the thoughtful feedback! You're absolutely right about the architectural separation.

However, I'd like to clarify the actual use case to ensure we design the right solution:

The filtering need is for materialized nodes (tables/views) rather than just sources. Here's the scenario:

  • Multiple teams work within the same dbt project across different databases/schemas
  • Team A: operational_db.orders.*, operational_db.inventory.*
  • Team B: analytics_db.marts.*, analytics_db.reporting.*
  • Some databases contain operational data, others analytical data

The goal is catalog consistency:

  • BigQuery/Snowflake/Redshift or other source ingestion filters to analytics_db.* only
  • dbt ingestion should filter to the same analytics_db.* to maintain a clean, consistent catalog view
  • This prevents operational tables from appearing in the data catalog when they're not meant for analytical consumption

So the filtering should apply to all materialized nodes (table, view, etc.) based on where they're physically materialized, not just sources.

Proposed Configuration Example

  source_pattern:
    database_pattern:
      allow: ["analytics_db", "marts_db"]
    schema_pattern:
      allow: ["analytics_db\\.marts.*", "analytics_db\\.reporting.*"]
    table_pattern:
      allow: ["analytics_db\\.marts\\.customer_.*"]

Benefits

  • Internal vs external reference separation
  • Matches source system ingestion patterns
  • Legacy patterns still supported
  • Database-level separation for multi-team projects

Does this align with your architectural vision? I'm happy to adjust the approach based on any additional feedback!

@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Sep 10, 2025
@abdullahtariqq abdullahtariqq changed the title feat(dbt): add database and schema pattern filtering support feat(dbt): add source_pattern filtering for database and schema patterns Sep 10, 2025
@abdullahtariqq abdullahtariqq force-pushed the feat-dbt-improve-filtering branch from 45b970e to 43d7772 Compare September 10, 2025 22:59
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Sep 12, 2025
@datahub-project datahub-project deleted a comment from Newstudo Sep 12, 2025
@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Sep 12, 2025
@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Sep 12, 2025
@abdullahtariqq abdullahtariqq changed the title feat(dbt): add source_pattern filtering for database and schema patterns feat(dbt): add source_pattern(s) filtering for materialized nodes Sep 12, 2025
@abdullahtariqq
Copy link
Contributor Author

abdullahtariqq commented Sep 13, 2025

@skrydal thank you for the feedback. materialized_nodes_pattern is a good alternative and conveys proper meaning too.

@abdullahtariqq abdullahtariqq changed the title feat(dbt): add source_pattern(s) filtering for materialized nodes feat(dbt): add materialized_node_pattern(s) filtering for materialized nodes Sep 13, 2025
@abdullahtariqq abdullahtariqq changed the title feat(dbt): add materialized_node_pattern(s) filtering for materialized nodes feat(dbt): add filtering for materialized nodes based on their physical location Sep 13, 2025
@abdullahtariqq
Copy link
Contributor Author

@skrydal Thank you for the thorough review. Could you please approve it in case everything looks good?

Copy link
Collaborator

@skrydal skrydal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution to the project.

@skrydal skrydal enabled auto-merge (squash) September 17, 2025 12:04
@datahub-cyborg datahub-cyborg bot added merge-pending-ci A PR that has passed review and should be merged once CI is green. and removed needs-review Label for PRs that need review from a maintainer. labels Sep 17, 2025
@skrydal skrydal merged commit acffdce into datahub-project:master Sep 17, 2025
55 checks passed
@skrydal
Copy link
Collaborator

skrydal commented Sep 18, 2025

FYI @abdullahtariqq this change has been released in acryl-datahub version 1.2.0.10rc2.

yoonhyejin pushed a commit that referenced this pull request Oct 9, 2025
…al location (#14689)

Co-authored-by: Abdullah Tariq <[email protected]>
Co-authored-by: skrydal <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants