-
Notifications
You must be signed in to change notification settings - Fork 3.3k
feat(ingest/airflow): add teradata operator support for Airflow plugin #15418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat(ingest/airflow): add teradata operator support for Airflow plugin #15418
Conversation
|
Update: Added new tests for improving test coverage. Reason for adding a Custom Extractor rather than using fallback on GenericSqlExtractor
Where the Generic extractor assumes three-tier naming, and default_schema=None must be explicitly enforced |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect this dag integration test to produce some golden file in
https://github.com/btkcodedev/datahub/tree/btkcodedev/teradataOperatorExtractor/metadata-ingestion-modules/airflow-plugin/tests/integration/goldens
May adding the test case in https://github.com/btkcodedev/datahub/blob/btkcodedev/teradataOperatorExtractor/metadata-ingestion-modules/airflow-plugin/tests/integration/test_plugin.py be missed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added integration test and generated golden file
Uses extra param as ANSI as declared in docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgomezvillamor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks very consistent and great coverage in the unit tests. Kudos!
Likely missing test case for the dag integration tests. Once that is addressed, this can be merged.
sgomezvillamor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks very consistent and great coverage in the unit tests. Kudos!
Likely missing test case for the dag integration tests. Once that is addressed, this can be merged.
(accidentally pressed wrong button before, sorry)
sgomezvillamor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contrib and being so responsive when addressing the comments!
You can merge, as soo as PR passes CI checks


Closes #12914
End-to-end Analysis:
Requirements:
One of the clients needs support for the Teradata Operator in the DataHub's Airflow plugin, and provided a DAG sample and Airflow log.
Analysis of the requirement:
-> DataHub supports Lineage extraction from several databases, but Teradata is missing from the list.
-> Teradata is a popular enterprise data warehouse, and users running Teradata queries in Airflow couldn't get automatic lineage tracking
Documentation Analysis
Reviewed Airflow documentation for the Teradata operator provider
https://airflow.apache.org/docs/apache-airflow-providers-teradata/stable/_api/airflow/providers/teradata/operators/teradata/index.html
Reviewed Teradata provider from pypi, https://pypi.org/project/apache-airflow-providers-teradata/
Codebase Analysis and Action Plan
After 'airflow.providers' keyword searching, it is clear that Airflow has already been a plugin (as the client mentioned) and has other operators that inherit from BaseExtractor
Having another operator is very useful for making a similar operator for Teradata
The
_extractors.pywithin Airflow is the main file that needs to be changed,metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/_extractors.pyTeradata is very similar to Athena and but has 2-tier naming, thus the default schema and database could be extracted from the SQL itself and given as None
Mock function could be used for calling execute in DAG, similar to the snowflake operator.
Basic SQL extract test, platform test, 2-tier naming test
Main Changes:
[_extractors.py]
Implemented the TeradataOperatorExtractor class that extracts lineage from Teradata SQL queries using a 2-tier naming convention.
[test_teradata_extractor.py]
Created 4 unit tests validating SQL extraction, platform configuration, and proper handling of Teradata's 2-tier architecture.
[airflow.md]
Added TeradataOperator to the list of supported operators in the documentation.
[teradata_operator.py]
Created an integration test DAG with a sample Teradata transform task following standard patterns.
[setup.py]
Added apache-airflow-providers-teradata to integration test requirements.
Key Learnings for me
lineageis - tracks how data flows, transformedFinal output:
↓
↓
↓
↓
sqlfield from the operator↓
↓
↓
[customers table] → [TeradataOperator task]
Unit test results:
