-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Reconcile] Adding Teradata connector #1275
base: main
Are you sure you want to change the base?
Conversation
* Removing quotes from the exception message * During exception, no output is given for metrics, not even default values.
* skipped preview versions
Creates coverage tests for TSQL, in the same manner as the coverage for Snowflake. Coverage is a non-blocking test suite that allows us to measure how far along the path of syntax coverage we are, and how much of the syntax we are able to convert to DBSQL IR. We capture parsing errors and check the IR to see if contains any UnresolvedXXX, which flags constructs that we do not currently cover. Closes: databrickslabs#406 Co-authored-by: Valentin Kasas <[email protected]>
…atabrickslabs#402) We remove explicit specification of the builtIn functions, leaving only those that definitely require a special syntax. Progresses: databrickslabs#275 --------- Co-authored-by: Valentin Kasas <[email protected]>
…ckslabs#415) As we now have translation capability for them, we start out by examining the TSQL functions and providing the equivalent Databricks SQL for them. - Some functions, mainly the common ANSI defined SQL functions are directly equivalent. - Some have the same name but mean different things. - Some functions have equivalent ways to achieve the same thing. - Some functions will likely be unsupportable without changes to Databricks SQL. - Some functions may be unsupportable due to fundamental differences between TSQL and Databricks SQL engines/models. Files are supplied under the tests/resources/functional/tsql directory. The TSQL and Databricks SQL equivlent are provided and the conversion is documented before the TSQL is introduced as we may be able to automate documentation to a certain degree using this commentary. --------- Co-authored-by: Valentin Kasas <[email protected]>
…bs#407) First batch of acceptance-test-based improvements targeting missing implementations (materialized as NullPointerException during acceptance tests). The making of this change could serve as a template for future acceptance-test-based improvements: focus on a specific type of failure in the acceptance tests take measures to fix the error (can incur changes in the relevant grammar) add unit tests to fully cover the changes introduced in the previous step
Analytical windowing functions were specified in the grammar instead of our standard function lookup table. These functions are now removed from the parser grammar and specified in the standard tables. Additionally, the WITHIN GROUP syntax is supported using an additional `expression` rule alt. This makes function handling orthogonal for all but the so called `builtinFunctions` We now support analytical aggregate functions as evinced by the following examples: ```sql ... FIRST_VALUE(Salary) OVER (PARTITION BY DepartmentID ORDER BY Salary DESC) ``` ```sql ... PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY Salary) OVER (PARTITION BY DepartmentID) ``` Progresses: databrickslabs#357 Progresses: databrickslabs#275
…bs#420) More functions are added to the set of coverage tests for TSQL conversions
…atabrickslabs#414) Some functions must be translated from TSQL or Snowflake versions into the equivalent IR for Databricks SQL. In some cases the function must be translated in one dialect, say TSql but is equivalent in another, say Snowflake. Here we upgrade the FunctionBuilder system to be dialect aware, and provide a ConversionStrategy system that allows for any type of conversion from a simple name translation or more complicated specific IR representations when there is no equivalent. For example the TSQL code ```tsql SELECT ISNULL(x, 0) ``` Should translate to: ```sql SELECT IFNULL(x, 0) ``` In Databricks SQL, but in Snowflake SQL: ```snowflake SELECT ISNULL(col) ``` Is directly equivalent to Databricks SQL and needs no conversion. --------- Co-authored-by: Valentin Kasas <[email protected]>
With Python version 3.11, when trying to initialize the config classes, we are getting ValueError: mutable default <class 'databricks.labs.remorph.reconcile.recon_config.StatusOutput'> for field status is not allowed: use default_factory This change addresses this error by using default_factory.
An internal cleanup of grammar and keywords - Removes superfluous comments - Corrects hacks with "" strings - Expands keywords - for instance SELECT * FROM TABLE now works - Removes unused accumulated rules from TSQL
Here we relocate the tests so that the do not occupy one directory per test type and we add coverage for more TSQL builtin functions. --------- Co-authored-by: Valentin Kasas <[email protected]>
We implement dataType conversions such that we can implement translation of the following functions and pseudo-functions: ```tsql NEXT VALUE FOR sequence ``` ```tsql CAST(col TO sometype) ``` ```tsql TRY_CAST(col TO sometype) ``` ```tsql JSON_ARRAY(x, 42, y) JSON_ARRAY(x, 42, y NULL ON NULL) JSON_ARRAY(x, 42, y ABSENT ON NULL) ``` ```tsql JSON_OBJECT('one': 1, 'two': 2, 'three': 3 ABSENT ON NULL) JSON_OBJECT('a': a, 'b': b, 'c': c NULL ON NULL) ``` Progresses: databrickslabs#275 --------- Co-authored-by: Serge Smertin <[email protected]> Co-authored-by: Valentin Kasas <[email protected]>
We implement CTE support for TSQL as in the following example: ```tsql WITH cteTable (col1, col2, col3count) AS SELECT col1, fred, COUNT(OrderDate) AS counter FROM Table1 WHERE fred IS NOT NULL ) SELECT col2, col1, col3 FROM cteTable ```
Add entries in `SnowflakeFunctionBuilder` for functions that were used in the acceptance tests but not "defined" yet. Note that the grammar around `function_call` has been reworked, this work is still in progress. Future work will include: - completing the definition of `builtin_function` (functions that have special syntax) - improving implementation around `aggregate_function` and `ranking_windowed_function`
Removes superfluous alts defined in select list element rule in TSQL grammar. Also removes the accepts on it which were rightly not used. Note that the rule for parsing udt column declarations may still be relevant for `CREATE TYPE`, hence it is not removed.
## Changes Added CLI command `databricks labs remorph reconcile` to trigger the `Data Reconciliation` process. Logic: - It loads the `reconcile.yml` config and `recon_config.json` from the Workspace. If those are not present, it prompts the user to re-install `reconcile` module and exits the command. - It triggers the `Remorph_Reconciliation_Job` based on the Job ID stored in the `reconcile.yml`. - This feature is intended to enhance our Tool's capabilities by simplying the `reconcile` execution process. - The following are the prerequisites: * Use `databricks labs remorph install` to configure the `reconcile` module. It generates the `reconcile.yml` on the Databricks Workspace. * Use `databricks labs remorph generate-recon-config` to generate the `recon_config_<SOURCE>.json` and store it on the Databricks Workspace. [PR#249 In Progress](databrickslabs#249) - Once the prerequisites are met, the User can run `databricks labs remorph reconcile` CLI command to start the `Data Reconciliation` process. - `reconcile` CLI command invokes the `reconciliation` on Databricks Workflows. and gives the Job Run URL to check further ### Linked issues Resolves databrickslabs#249 ### Functionality - [x] added a new CLI command ### Tests - [x] manually tested - [x] added unit tests ### TODO - [ ] add integration tests - [ ] verify on staging environment (attach screenshot ) --------- Co-authored-by: Bishwajit <[email protected]> Co-authored-by: Ganesh Dogiparthi <[email protected]> Co-authored-by: SundarShankar89 <[email protected]>
- Added README.md for Reconciliation process --------- Co-authored-by: Ganesh Dogiparthi <[email protected]> Co-authored-by: ganeshdogiparthi-db <[email protected]> Co-authored-by: SundarShankar89 <[email protected]>
Change rule names and Context names to camelCase from snake_case. While this appears to be a huge PR, it is strictly confined to name changes so a review shoudl not take long.
Here we improve the SELECT statement grammar such that it resembles the Snowflake grammar other than trying to force semantic checks into the syntax. The Scala code to build the IR is essentially reused, with a few changes to accommodate the differences between TSQL and Snowflake. - TODO markers in TSqlRelationBuilder show what is left to implement --------- Co-authored-by: Valentin Kasas <[email protected]>
NB: Previous PR was merged before I was totally finished Normalizes parameter generation to always use ${} for clarity and conforming to Databricks notebook examples for widgets. Adds additional coverage test for variable refs in strings
…slabs#882) Adds a start point for useful ANTLR utilities. - Given an ANTLR ParserRuleContext : retrieve the original text from the input source
1. Updated Spark setup script to check whether `spark gzip file` exists 1. Refactored the script to remove the warnings: * `Double quote to prevent globbing and word splitting` * `Use 'cd ... || exit' or 'cd ... || return' in case cd fails.` * `Consider using 'grep -c' instead of 'grep|wc -l'.` * `Usage: sleep seconds`, converted 2m to 120 seconds 1. Tested the following Scenarios: 1. Scenario: Extracted Spark folder is already present Outcome: Directly starts the spark server using `sbin/start-connect-server.sh` 1. Scenario: Extracted Spark folder **is not present**, and Zip file (`spark-<VERSION>.tgz`) is present Outcome: Extract the zip file and start the spark server 1. Scenario: Extracted Spark folder **is not present**, and Zip file **is not present** Outcome: Download, Extract and start the spark server
…te task` (databrickslabs#864) `alter session | stream...` `create stream` `create task` and `execute task`
now it generates either Project or Deduplicates. --------- Co-authored-by: Serge Smertin <[email protected]>
@@ -821,14 +823,14 @@ def get_record_count(self, table_conf: Table, report_type: str) -> ReconcileReco | |||
table=table_conf.source_name, | |||
query=source_count_query, | |||
options=None, | |||
).collect()[0]["count"] | |||
).collect()[0][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
).collect()[0][0] | |
).first() |
target_count = self._target.read_data( | ||
catalog=self._database_config.target_catalog, | ||
schema=self._database_config.target_schema, | ||
table=table_conf.target_name, | ||
query=target_count_query, | ||
options=None, | ||
).collect()[0]["count"] | ||
).collect()[0][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
).collect()[0][0] | |
).first() | |
source_count = int(source_count_row[0]) if source_count_row is not None else 0 | |
target_count = int(target_count_row[0]) if target_count_row is not None else 0 |
@@ -821,14 +823,14 @@ def get_record_count(self, table_conf: Table, report_type: str) -> ReconcileReco | |||
table=table_conf.source_name, | |||
query=source_count_query, | |||
options=None, | |||
).collect()[0]["count"] | |||
).collect()[0][0] | |||
target_count = self._target.read_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
target_count = self._target.read_data( | |
target_count_row = self._target.read_data( |
@@ -166,7 +172,7 @@ def _get_mismatch_df(source: DataFrame, target: DataFrame, key_columns: list[str | |||
select_expr = key_cols + source_aliased + target_aliased + match_expr | |||
|
|||
filter_columns = " and ".join([column + "_match" for column in column_list]) | |||
filter_expr = ~expr(filter_columns) | |||
filter_expr = ~expr(filter_columns) if filter_columns else lit(True) | |||
|
|||
mismatch_df = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mismatch_df = ( | |
mismatch_df = ( | |
source.alias('base') | |
.join(other=target.alias('compare'), on=key_columns, how="inner") | |
.select(*select_expr) | |
) |
No description provided.