Skip to content

Conversation

@btkcodedev
Copy link

@btkcodedev btkcodedev commented Nov 26, 2025

Closes #12914

End-to-end Analysis:

Requirements:

One of the clients needs support for the Teradata Operator in the DataHub's Airflow plugin, and provided a DAG sample and Airflow log.

Analysis of the requirement:

-> DataHub supports Lineage extraction from several databases, but Teradata is missing from the list.
-> Teradata is a popular enterprise data warehouse, and users running Teradata queries in Airflow couldn't get automatic lineage tracking

Documentation Analysis

  1. Reviewed Airflow documentation for the Teradata operator provider
    https://airflow.apache.org/docs/apache-airflow-providers-teradata/stable/_api/airflow/providers/teradata/operators/teradata/index.html

  2. Reviewed Teradata provider from pypi, https://pypi.org/project/apache-airflow-providers-teradata/

Codebase Analysis and Action Plan

  1. After 'airflow.providers' keyword searching, it is clear that Airflow has already been a plugin (as the client mentioned) and has other operators that inherit from BaseExtractor

  2. Having another operator is very useful for making a similar operator for Teradata
    The _extractors.py within Airflow is the main file that needs to be changed,
    metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/_extractors.py

  3. Teradata is very similar to Athena and but has 2-tier naming, thus the default schema and database could be extracted from the SQL itself and given as None

  4. Mock function could be used for calling execute in DAG, similar to the snowflake operator.

  5. Basic SQL extract test, platform test, 2-tier naming test

Main Changes:

  1. [_extractors.py]
    Implemented the TeradataOperatorExtractor class that extracts lineage from Teradata SQL queries using a 2-tier naming convention.

  2. [test_teradata_extractor.py]
    Created 4 unit tests validating SQL extraction, platform configuration, and proper handling of Teradata's 2-tier architecture.

  3. [airflow.md]
    Added TeradataOperator to the list of supported operators in the documentation.

  4. [teradata_operator.py]
    Created an integration test DAG with a sample Teradata transform task following standard patterns.

  5. [setup.py]
    Added apache-airflow-providers-teradata to integration test requirements.

Key Learnings for me

  1. Came to know what the keyword lineage is - tracks how data flows, transformed
  2. Understood 2-tier naming (database.table) and 3-tier naming difference (database.schema.table)
  3. Plugin and Operators difference

Final output:

  1. DataHub connects to Airflow
  2. Finds a DAG with a TeradataOperator task
  3. Checks the airflow plugin
  4. Finds: TeradataOperator → use DefaultSqlExtractor
  5. Reads the sql field from the operator
  6. DefaultSqlExtractor parses: "SELECT * FROM customers"
  7. Extracts: upstream = [customers]
  8. Creates lineage in DataHub:
    [customers table] → [TeradataOperator task]

Unit test results:
image

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata docs Issues and Improvements to docs community-contribution PR or Issue raised by member(s) of DataHub Community labels Nov 26, 2025
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Nov 26, 2025
@btkcodedev btkcodedev changed the title feat: add teradata operator support for Airflow plugin feat(ingest/airflow): add teradata operator support for Airflow plugin Nov 26, 2025
@btkcodedev
Copy link
Author

Update: Added new tests for improving test coverage.

Reason for adding a Custom Extractor rather than using fallback on GenericSqlExtractor

  1. Two-Tier Architecture: Teradata uses database.table and not database.schema.table
    According to Teradata SQL Reference,
    -> URL: docs.teradata.com
    -> Key Point: CREATE TABLE database_name.table_name
    -> Evidence: All object references use database.table format and no schema layer between

Where the Generic extractor assumes three-tier naming, and default_schema=None must be explicitly enforced
The athena operator follows a similar pattern of non-standard naming. Thus, a custom extractor is necessary

@btkcodedev
Copy link
Author

Updated Test results:
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added integration test and generated golden file
Uses extra param as ANSI as declared in docs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integration test results:
pytest 'tests/integration/test_plugin.py::test_airflow_plugin[v2_teradata_operator]' -v
image

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Nov 27, 2025
Copy link
Contributor

@sgomezvillamor sgomezvillamor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks very consistent and great coverage in the unit tests. Kudos!

Likely missing test case for the dag integration tests. Once that is addressed, this can be merged.

Copy link
Contributor

@sgomezvillamor sgomezvillamor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks very consistent and great coverage in the unit tests. Kudos!

Likely missing test case for the dag integration tests. Once that is addressed, this can be merged.

(accidentally pressed wrong button before, sorry)

@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Nov 28, 2025
Copy link
Contributor

@sgomezvillamor sgomezvillamor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contrib and being so responsive when addressing the comments!
You can merge, as soo as PR passes CI checks

@datahub-cyborg datahub-cyborg bot removed the needs-review Label for PRs that need review from a maintainer. label Nov 29, 2025
@datahub-cyborg datahub-cyborg bot added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PR or Issue raised by member(s) of DataHub Community docs Issues and Improvements to docs ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FEATURE Request: Add Teradata as a supported operator for Airflow Plugin

2 participants