[Reconcile] Adding Teradata connector #1275

srikanth-db · 2024-12-04T03:40:38Z

No description provided.

* Removing quotes from the exception message * During exception, no output is given for metrics, not even default values.

…#412)

… rows for reports (databrickslabs#413)

* skipped preview versions

Creates coverage tests for TSQL, in the same manner as the coverage for Snowflake. Coverage is a non-blocking test suite that allows us to measure how far along the path of syntax coverage we are, and how much of the syntax we are able to convert to DBSQL IR. We capture parsing errors and check the IR to see if contains any UnresolvedXXX, which flags constructs that we do not currently cover. Closes: databrickslabs#406 Co-authored-by: Valentin Kasas <[email protected]>

…atabrickslabs#402) We remove explicit specification of the builtIn functions, leaving only those that definitely require a special syntax. Progresses: databrickslabs#275 --------- Co-authored-by: Valentin Kasas <[email protected]>

…ckslabs#415) As we now have translation capability for them, we start out by examining the TSQL functions and providing the equivalent Databricks SQL for them. - Some functions, mainly the common ANSI defined SQL functions are directly equivalent. - Some have the same name but mean different things. - Some functions have equivalent ways to achieve the same thing. - Some functions will likely be unsupportable without changes to Databricks SQL. - Some functions may be unsupportable due to fundamental differences between TSQL and Databricks SQL engines/models. Files are supplied under the tests/resources/functional/tsql directory. The TSQL and Databricks SQL equivlent are provided and the conversion is documented before the TSQL is introduced as we may be able to automate documentation to a certain degree using this commentary. --------- Co-authored-by: Valentin Kasas <[email protected]>

…bs#407) First batch of acceptance-test-based improvements targeting missing implementations (materialized as NullPointerException during acceptance tests). The making of this change could serve as a template for future acceptance-test-based improvements: focus on a specific type of failure in the acceptance tests take measures to fix the error (can incur changes in the relevant grammar) add unit tests to fully cover the changes introduced in the previous step

Analytical windowing functions were specified in the grammar instead of our standard function lookup table. These functions are now removed from the parser grammar and specified in the standard tables. Additionally, the WITHIN GROUP syntax is supported using an additional `expression` rule alt. This makes function handling orthogonal for all but the so called `builtinFunctions` We now support analytical aggregate functions as evinced by the following examples: ```sql ... FIRST_VALUE(Salary) OVER (PARTITION BY DepartmentID ORDER BY Salary DESC) ``` ```sql ... PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY Salary) OVER (PARTITION BY DepartmentID) ``` Progresses: databrickslabs#357 Progresses: databrickslabs#275

…bs#420) More functions are added to the set of coverage tests for TSQL conversions

…atabrickslabs#414) Some functions must be translated from TSQL or Snowflake versions into the equivalent IR for Databricks SQL. In some cases the function must be translated in one dialect, say TSql but is equivalent in another, say Snowflake. Here we upgrade the FunctionBuilder system to be dialect aware, and provide a ConversionStrategy system that allows for any type of conversion from a simple name translation or more complicated specific IR representations when there is no equivalent. For example the TSQL code ```tsql SELECT ISNULL(x, 0) ``` Should translate to: ```sql SELECT IFNULL(x, 0) ``` In Databricks SQL, but in Snowflake SQL: ```snowflake SELECT ISNULL(col) ``` Is directly equivalent to Databricks SQL and needs no conversion. --------- Co-authored-by: Valentin Kasas <[email protected]>

With Python version 3.11, when trying to initialize the config classes, we are getting ValueError: mutable default <class 'databricks.labs.remorph.reconcile.recon_config.StatusOutput'> for field status is not allowed: use default_factory This change addresses this error by using default_factory.

An internal cleanup of grammar and keywords - Removes superfluous comments - Corrects hacks with "" strings - Expands keywords - for instance SELECT * FROM TABLE now works - Removes unused accumulated rules from TSQL

Here we relocate the tests so that the do not occupy one directory per test type and we add coverage for more TSQL builtin functions. --------- Co-authored-by: Valentin Kasas <[email protected]>

We implement dataType conversions such that we can implement translation of the following functions and pseudo-functions: ```tsql NEXT VALUE FOR sequence ``` ```tsql CAST(col TO sometype) ``` ```tsql TRY_CAST(col TO sometype) ``` ```tsql JSON_ARRAY(x, 42, y) JSON_ARRAY(x, 42, y NULL ON NULL) JSON_ARRAY(x, 42, y ABSENT ON NULL) ``` ```tsql JSON_OBJECT('one': 1, 'two': 2, 'three': 3 ABSENT ON NULL) JSON_OBJECT('a': a, 'b': b, 'c': c NULL ON NULL) ``` Progresses: databrickslabs#275 --------- Co-authored-by: Serge Smertin <[email protected]> Co-authored-by: Valentin Kasas <[email protected]>

We implement CTE support for TSQL as in the following example: ```tsql WITH cteTable (col1, col2, col3count) AS SELECT col1, fred, COUNT(OrderDate) AS counter FROM Table1 WHERE fred IS NOT NULL ) SELECT col2, col1, col3 FROM cteTable ```

Add entries in `SnowflakeFunctionBuilder` for functions that were used in the acceptance tests but not "defined" yet. Note that the grammar around `function_call` has been reworked, this work is still in progress. Future work will include: - completing the definition of `builtin_function` (functions that have special syntax) - improving implementation around `aggregate_function` and `ranking_windowed_function`

Removes superfluous alts defined in select list element rule in TSQL grammar. Also removes the accepts on it which were rightly not used. Note that the rule for parsing udt column declarations may still be relevant for `CREATE TYPE`, hence it is not removed.

## Changes Added CLI command `databricks labs remorph reconcile` to trigger the `Data Reconciliation` process. Logic: - It loads the `reconcile.yml` config and `recon_config.json` from the Workspace. If those are not present, it prompts the user to re-install `reconcile` module and exits the command. - It triggers the `Remorph_Reconciliation_Job` based on the Job ID stored in the `reconcile.yml`. - This feature is intended to enhance our Tool's capabilities by simplying the `reconcile` execution process. - The following are the prerequisites: * Use `databricks labs remorph install` to configure the `reconcile` module. It generates the `reconcile.yml` on the Databricks Workspace. * Use `databricks labs remorph generate-recon-config` to generate the `recon_config_<SOURCE>.json` and store it on the Databricks Workspace. [PR#249 In Progress](databrickslabs#249) - Once the prerequisites are met, the User can run `databricks labs remorph reconcile` CLI command to start the `Data Reconciliation` process. - `reconcile` CLI command invokes the `reconciliation` on Databricks Workflows. and gives the Job Run URL to check further ### Linked issues Resolves databrickslabs#249 ### Functionality - [x] added a new CLI command ### Tests - [x] manually tested - [x] added unit tests ### TODO - [ ] add integration tests - [ ] verify on staging environment (attach screenshot ) --------- Co-authored-by: Bishwajit <[email protected]> Co-authored-by: Ganesh Dogiparthi <[email protected]> Co-authored-by: SundarShankar89 <[email protected]>

- Added README.md for Reconciliation process --------- Co-authored-by: Ganesh Dogiparthi <[email protected]> Co-authored-by: ganeshdogiparthi-db <[email protected]> Co-authored-by: SundarShankar89 <[email protected]>

Change rule names and Context names to camelCase from snake_case. While this appears to be a huge PR, it is strictly confined to name changes so a review shoudl not take long.

Here we improve the SELECT statement grammar such that it resembles the Snowflake grammar other than trying to force semantic checks into the syntax. The Scala code to build the IR is essentially reused, with a few changes to accommodate the differences between TSQL and Snowflake. - TODO markers in TSqlRelationBuilder show what is left to implement --------- Co-authored-by: Valentin Kasas <[email protected]>

NB: Previous PR was merged before I was totally finished Normalizes parameter generation to always use ${} for clarity and conforming to Databricks notebook examples for widgets. Adds additional coverage test for variable refs in strings

…slabs#882) Adds a start point for useful ANTLR utilities. - Given an ANTLR ParserRuleContext : retrieve the original text from the input source

1. Updated Spark setup script to check whether `spark gzip file` exists 1. Refactored the script to remove the warnings: * `Double quote to prevent globbing and word splitting` * `Use 'cd ... || exit' or 'cd ... || return' in case cd fails.` * `Consider using 'grep -c' instead of 'grep|wc -l'.` * `Usage: sleep seconds`, converted 2m to 120 seconds 1. Tested the following Scenarios: 1. Scenario: Extracted Spark folder is already present Outcome: Directly starts the spark server using `sbin/start-connect-server.sh` 1. Scenario: Extracted Spark folder **is not present**, and Zip file (`spark-<VERSION>.tgz`) is present Outcome: Extract the zip file and start the spark server 1. Scenario: Extracted Spark folder **is not present**, and Zip file **is not present** Outcome: Download, Extract and start the spark server

…te task` (databrickslabs#864) `alter session | stream...` `create stream` `create task` and `execute task`

now it generates either Project or Deduplicates. --------- Co-authored-by: Serge Smertin <[email protected]>

sundarshankar89 · 2024-12-04T05:48:38Z

src/databricks/labs/remorph/reconcile/execute.py

@@ -821,14 +823,14 @@ def get_record_count(self, table_conf: Table, report_type: str) -> ReconcileReco
                table=table_conf.source_name,
                query=source_count_query,
                options=None,
-            ).collect()[0]["count"]
+            ).collect()[0][0]


Suggested change

).collect()[0][0]

).first()

sundarshankar89 · 2024-12-04T05:48:48Z

src/databricks/labs/remorph/reconcile/execute.py

            target_count = self._target.read_data(
                catalog=self._database_config.target_catalog,
                schema=self._database_config.target_schema,
                table=table_conf.target_name,
                query=target_count_query,
                options=None,
-            ).collect()[0]["count"]
+            ).collect()[0][0]


Suggested change

).collect()[0][0]

).first()

source_count = int(source_count_row[0]) if source_count_row is not None else 0

target_count = int(target_count_row[0]) if target_count_row is not None else 0

sundarshankar89 · 2024-12-04T05:50:16Z

src/databricks/labs/remorph/reconcile/execute.py

@@ -821,14 +823,14 @@ def get_record_count(self, table_conf: Table, report_type: str) -> ReconcileReco
                table=table_conf.source_name,
                query=source_count_query,
                options=None,
-            ).collect()[0]["count"]
+            ).collect()[0][0]
            target_count = self._target.read_data(


Suggested change

target_count = self._target.read_data(

target_count_row = self._target.read_data(

sundarshankar89 · 2024-12-04T05:54:52Z

src/databricks/labs/remorph/reconcile/compare.py

@@ -166,7 +172,7 @@ def _get_mismatch_df(source: DataFrame, target: DataFrame, key_columns: list[str
    select_expr = key_cols + source_aliased + target_aliased + match_expr

    filter_columns = " and ".join([column + "_match" for column in column_list])
-    filter_expr = ~expr(filter_columns)
+    filter_expr = ~expr(filter_columns) if filter_columns else lit(True)

    mismatch_df = (


Suggested change

mismatch_df = (

mismatch_df = (

source.alias('base')

.join(other=target.alias('compare'), on=key_columns, how="inner")

.select(*select_expr)

)

tobymao/sqlglot#4481

ganeshdogiparthi-db and others added 30 commits May 31, 2024 11:55

Removing quotes from exception message (databrickslabs#416)

969b00d

* Removing quotes from the exception message * During exception, no output is given for metrics, not even default values.

Presto approx percentile func fix (databrickslabs#411)

5117f6a

Raise exception if reconciliation fails for any table (databrickslabs…

0e5fb54

…#412)

Added validation for join columns for all query builders and limiting…

6d57216

… rows for reports (databrickslabs#413)

Updated spark version (databrickslabs#421)

7c83881

* skipped preview versions

Test merge queue (databrickslabs#424)

9ac86a8

Adds more coverage tests for functions to TSQL coverage (databricksla…

d628677

…bs#420) More functions are added to the set of coverage tests for TSQL conversions

refactor install script (databrickslabs#426)

6c771f7

Changing the secret name acc to install script (databrickslabs#432)

affdd32

Intermediate cache to volumes (databrickslabs#429)

32c0d6e

Grammar clean up and keyword expansion (databrickslabs#437)

60850b1

An internal cleanup of grammar and keywords - Removes superfluous comments - Corrects hacks with "" strings - Expands keywords - for instance SELECT * FROM TABLE now works - Removes unused accumulated rules from TSQL

Improve test coverage for TSQL remorph (databrickslabs#439)

5126eaf

Here we relocate the tests so that the do not occupy one directory per test type and we add coverage for more TSQL builtin functions. --------- Co-authored-by: Valentin Kasas <[email protected]>

Snowflake: Improve translation of FROM clauses (databrickslabs#442)

d2e7b36

TSQL: Implement WITH CTE (databrickslabs#443)

596a140

We implement CTE support for TSQL as in the following example: ```tsql WITH cteTable (col1, col2, col3count) AS SELECT col1, fred, COUNT(OrderDate) AS counter FROM Table1 WHERE fred IS NOT NULL ) SELECT col2, col1, col3 FROM cteTable ```

Feature/recon documentation (databrickslabs#395)

7c301df

- Added README.md for Reconciliation process --------- Co-authored-by: Ganesh Dogiparthi <[email protected]> Co-authored-by: ganeshdogiparthi-db <[email protected]> Co-authored-by: SundarShankar89 <[email protected]>

Snowflake: Implement SAMPLE clause (databrickslabs#446)

2dd476a

Snowflake: consider DUMMY as not a keyword (databrickslabs#447)

1ac37fb

Snowflake:: Snowflake grammar clean up (databrickslabs#449)

5153bb8

Change rule names and Context names to camelCase from snake_case. While this appears to be a huge PR, it is strictly confined to name changes so a review shoudl not take long.

jimidle and others added 18 commits August 30, 2024 16:00

Miscellaneous Snowflake coverage fixes (databrickslabs#872)

287822e

Added query history retrieval from Snowflake (databrickslabs#874)

cb217c0

[snowflake] initial support for lateral views (databrickslabs#838)

8402a18

Add basis of ANTLR utilities, starting with text extractor (databrick…

6794006

…slabs#882) Adds a start point for useful ANTLR utilities. - Given an ANTLR ParserRuleContext : retrieve the original text from the input source

unresolved commands alter session | stream... create stream `crea…

c6f780a

…te task` (databrickslabs#864) `alter session | stream...` `create stream` `create task` and `execute task`

Fix Query Generation IR for Select Distinct (databrickslabs#887)

a9ec557

now it generates either Project or Deduplicates. --------- Co-authored-by: Serge Smertin <[email protected]>

Initial Teradata files

87dd64c

fixng compile errors

3fa77b5

fixng schema related errors

af5e1bf

fixng schema related errors

4f3060e

Adding working code

aa5078e

latest code for Teradata

7dccd38

Added logic for decimal

57b626f

Removing DB details

07e4082

Various fixes for Teradata support

1f7702c

Change join hash column keys

e969207

sundarshankar89 added the do-not-merge label Dec 4, 2024

sundarshankar89 changed the title ~~adding Teradata connector~~ [Reconcile] Adding Teradata connector Dec 4, 2024

sundarshankar89 reviewed Dec 4, 2024

View reviewed changes

wajdi-db added 8 commits December 4, 2024 20:21

Add percentages to dashboard

1f59bb1

Troubleshooting steps

13d4531

Troubleshooting

fc225e3

WIP support for custom Parser

9997a16

tobymao/sqlglot#4481

Various changes

b4ded99

Remove prints

9c19b48

Update gitignore

8af8e60

Merge branch 'feature/support-translate'

3ff4a48

sundarshankar89 force-pushed the main branch from 53e5fde to 53e54f2 Compare January 3, 2025 07:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Reconcile] Adding Teradata connector #1275

[Reconcile] Adding Teradata connector #1275

srikanth-db commented Dec 4, 2024

sundarshankar89 Dec 4, 2024

sundarshankar89 Dec 4, 2024

sundarshankar89 Dec 4, 2024

sundarshankar89 Dec 4, 2024

-            ).collect()[0][0]
+            ).first()
+          source_count = int(source_count_row[0]) if source_count_row is not None else 0
+          target_count = int(target_count_row[0]) if target_count_row is not None else 0

	target_count = self._target.read_data(
	target_count_row = self._target.read_data(

-    mismatch_df = (
+    mismatch_df = (
+        source.alias('base')
+        .join(other=target.alias('compare'), on=key_columns, how="inner")
+        .select(*select_expr)
+    )

[Reconcile] Adding Teradata connector #1275

Are you sure you want to change the base?

[Reconcile] Adding Teradata connector #1275

Conversation

srikanth-db commented Dec 4, 2024

sundarshankar89 Dec 4, 2024

Choose a reason for hiding this comment

sundarshankar89 Dec 4, 2024

Choose a reason for hiding this comment

sundarshankar89 Dec 4, 2024

Choose a reason for hiding this comment

sundarshankar89 Dec 4, 2024

Choose a reason for hiding this comment