Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Reconcile] Adding Teradata connector #1275

Draft
wants to merge 459 commits into
base: main
Choose a base branch
from

Conversation

srikanth-db
Copy link

No description provided.

ganeshdogiparthi-db and others added 30 commits May 31, 2024 11:55
* Removing quotes from the exception message
* During exception, no output is given for metrics, not even default
values.
Creates coverage tests for TSQL, in the same manner as the coverage for
Snowflake.

Coverage is a non-blocking test suite that allows us to measure how far
along the path of syntax coverage we are, and how much of the syntax we
are able to convert to DBSQL IR.

We capture parsing errors and check the IR to see if contains any
UnresolvedXXX, which flags constructs that we do not currently cover.

Closes: databrickslabs#406

Co-authored-by: Valentin Kasas <[email protected]>
…atabrickslabs#402)

We remove explicit specification of the builtIn functions, leaving only
those that definitely require a special syntax.

Progresses: databrickslabs#275

---------

Co-authored-by: Valentin Kasas <[email protected]>
…ckslabs#415)

As we now have translation capability for them, we start out by
examining the TSQL functions and providing the equivalent Databricks SQL
for them.

- Some functions, mainly the common ANSI defined SQL functions are
directly equivalent.
  - Some have the same name but mean different things.
  - Some functions have equivalent ways to achieve the same thing.
- Some functions will likely be unsupportable without changes to
Databricks SQL.
- Some functions may be unsupportable due to fundamental differences
between TSQL and Databricks SQL engines/models.
  
Files are supplied under the tests/resources/functional/tsql directory.
The TSQL and Databricks SQL equivlent are provided and the conversion is
documented before the TSQL is introduced as we may be able to automate
documentation to a certain degree using this commentary.

---------

Co-authored-by: Valentin Kasas <[email protected]>
…bs#407)

First batch of acceptance-test-based improvements targeting missing
implementations (materialized as NullPointerException during acceptance
tests).

The making of this change could serve as a template for future
acceptance-test-based improvements:

focus on a specific type of failure in the acceptance tests
take measures to fix the error (can incur changes in the relevant
grammar)
add unit tests to fully cover the changes introduced in the previous
step
Analytical windowing functions were specified in the grammar instead of
our standard function lookup table.

These functions are now removed from the parser grammar and specified in
the standard tables.

Additionally, the WITHIN GROUP syntax is supported using an additional
`expression` rule alt. This makes function handling orthogonal for all
but the so called `builtinFunctions`

We now support analytical aggregate functions as evinced by the
following examples:

```sql
... FIRST_VALUE(Salary) OVER (PARTITION BY DepartmentID ORDER BY Salary DESC)
``` 

```sql
... PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY Salary) OVER (PARTITION BY DepartmentID)
```

Progresses: databrickslabs#357 
Progresses: databrickslabs#275
…bs#420)

More functions are added to the set of coverage tests for TSQL
conversions
…atabrickslabs#414)

Some functions must be translated from TSQL or Snowflake versions into
the equivalent IR for Databricks SQL. In some cases the function must be
translated in one dialect, say TSql but is equivalent in another, say
Snowflake.

Here we upgrade the FunctionBuilder system to be dialect aware, and
provide a ConversionStrategy system that allows for any type of
conversion from a simple name translation or more complicated specific
IR representations when there is no equivalent.

For example the TSQL code
```tsql
SELECT ISNULL(x, 0)
```

Should translate to:

```sql
SELECT IFNULL(x, 0)
```

In Databricks SQL, but in Snowflake SQL:

```snowflake
SELECT ISNULL(col)
```

Is directly equivalent to Databricks SQL and needs no conversion.

---------

Co-authored-by: Valentin Kasas <[email protected]>
With Python version 3.11, when trying to initialize the config classes,
we are getting ValueError: mutable default <class
'databricks.labs.remorph.reconcile.recon_config.StatusOutput'> for field
status is not allowed: use default_factory

This change addresses this error by using default_factory.
An internal cleanup of grammar and keywords  

  - Removes superfluous comments
  - Corrects hacks with "" strings
  - Expands keywords - for instance SELECT * FROM TABLE now works
  - Removes unused accumulated rules from TSQL
Here we relocate the tests so that the do not occupy one directory per
test type and we add coverage for more TSQL builtin functions.

---------

Co-authored-by: Valentin Kasas <[email protected]>
We implement dataType conversions such that we can implement translation
of the following functions and pseudo-functions:

```tsql
NEXT VALUE FOR sequence
```
```tsql
CAST(col TO sometype)
```
```tsql
TRY_CAST(col TO sometype)
```

```tsql
JSON_ARRAY(x, 42, y)
JSON_ARRAY(x, 42, y NULL ON NULL)
JSON_ARRAY(x, 42, y ABSENT ON NULL)
```
```tsql
JSON_OBJECT('one': 1, 'two': 2, 'three': 3 ABSENT ON NULL)
JSON_OBJECT('a': a, 'b': b, 'c': c NULL ON NULL)
```

Progresses: databrickslabs#275

---------

Co-authored-by: Serge Smertin <[email protected]>
Co-authored-by: Valentin Kasas <[email protected]>
We implement CTE support for TSQL as in the following example:

```tsql
WITH cteTable (col1, col2, col3count)
AS
    SELECT col1, fred, COUNT(OrderDate) AS counter
    FROM Table1
    WHERE fred IS NOT NULL
)
SELECT col2, col1, col3
FROM cteTable
```
Add entries in `SnowflakeFunctionBuilder` for functions that were used
in the acceptance tests but not "defined" yet.

Note that the grammar around `function_call` has been reworked, this
work is still in progress. Future work will include:
- completing the definition of `builtin_function` (functions that have
special syntax)
- improving implementation around `aggregate_function` and
`ranking_windowed_function`
Removes superfluous alts defined in select list element rule in TSQL
grammar. Also removes the accepts on it which were rightly not used.

Note that the rule for parsing udt column declarations may still be
relevant for `CREATE TYPE`, hence it is not removed.
## Changes

Added CLI command `databricks labs remorph reconcile` to trigger the
`Data Reconciliation` process.

Logic:
- It loads the `reconcile.yml` config and `recon_config.json` from the
Workspace. If those are not present, it prompts the user to re-install
`reconcile` module and exits the command.
- It triggers the `Remorph_Reconciliation_Job` based on the Job ID
stored in the `reconcile.yml`.
- This feature is intended to enhance our Tool's capabilities by
simplying the `reconcile` execution process.

- The following are the prerequisites:
* Use `databricks labs remorph install` to configure the `reconcile`
module. It generates the `reconcile.yml` on the Databricks Workspace.
* Use `databricks labs remorph generate-recon-config` to generate the
`recon_config_<SOURCE>.json` and store it on the Databricks Workspace.
[PR#249 In
Progress](databrickslabs#249)

- Once the prerequisites are met, the User can run `databricks labs
remorph reconcile` CLI command to start the `Data Reconciliation`
process.
- `reconcile` CLI command invokes the `reconciliation` on Databricks
Workflows.
and gives the Job Run URL to check further

### Linked issues
Resolves 
databrickslabs#249

### Functionality
- [x]  added a new CLI command

### Tests
- [x]  manually tested
- [x]  added unit tests

### TODO

- [ ]  add integration tests
- [ ]  verify on staging environment (attach screenshot )

---------

Co-authored-by: Bishwajit <[email protected]>
Co-authored-by: Ganesh Dogiparthi <[email protected]>
Co-authored-by: SundarShankar89 <[email protected]>
- Added README.md for Reconciliation process

---------

Co-authored-by: Ganesh Dogiparthi <[email protected]>
Co-authored-by: ganeshdogiparthi-db <[email protected]>
Co-authored-by: SundarShankar89 <[email protected]>
Change rule names and Context names to camelCase from snake_case.

While this appears to be a huge PR, it is strictly confined to name
changes so a review shoudl not take long.
Here we improve the SELECT statement grammar such that it resembles the
Snowflake grammar other than trying to force semantic checks into the
syntax. The Scala code to build the IR is essentially reused, with a few
changes to accommodate the differences between TSQL and Snowflake.
   
   - TODO markers in TSqlRelationBuilder show what is left to implement

---------

Co-authored-by: Valentin Kasas <[email protected]>
jimidle and others added 18 commits August 30, 2024 16:00
NB: Previous PR was merged before I was totally finished

Normalizes parameter generation to always use ${} for clarity and
conforming to Databricks notebook examples for widgets.

Adds additional coverage test for variable refs in strings
…slabs#882)

Adds a start point for useful ANTLR utilities.

- Given an ANTLR ParserRuleContext : retrieve the original text from the
input source
1. Updated Spark setup script to check whether `spark gzip file` exists

1. Refactored the script to remove the warnings: 
    * `Double quote to prevent globbing and word splitting`
    * `Use 'cd ... || exit' or 'cd ... || return' in case cd fails.`
    * `Consider using 'grep -c' instead of 'grep|wc -l'.`
    * `Usage: sleep seconds`, converted 2m to 120 seconds

1. Tested the following Scenarios:
    1. Scenario: Extracted Spark folder is already present
Outcome: Directly starts the spark server using
`sbin/start-connect-server.sh`
1. Scenario: Extracted Spark folder **is not present**, and Zip file
(`spark-<VERSION>.tgz`) is present
        Outcome:  Extract the zip file and start the spark server
1. Scenario: Extracted Spark folder **is not present**, and Zip file
**is not present**
        Outcome:  Download, Extract and start the spark server
…te task` (databrickslabs#864)

`alter session | stream...` `create stream` `create task` and `execute
task`
now it generates either Project or Deduplicates.

---------

Co-authored-by: Serge Smertin <[email protected]>
@sundarshankar89 sundarshankar89 changed the title adding Teradata connector [Reconcile] Adding Teradata connector Dec 4, 2024
@@ -821,14 +823,14 @@ def get_record_count(self, table_conf: Table, report_type: str) -> ReconcileReco
table=table_conf.source_name,
query=source_count_query,
options=None,
).collect()[0]["count"]
).collect()[0][0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
).collect()[0][0]
).first()

target_count = self._target.read_data(
catalog=self._database_config.target_catalog,
schema=self._database_config.target_schema,
table=table_conf.target_name,
query=target_count_query,
options=None,
).collect()[0]["count"]
).collect()[0][0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
).collect()[0][0]
).first()
source_count = int(source_count_row[0]) if source_count_row is not None else 0
target_count = int(target_count_row[0]) if target_count_row is not None else 0

@@ -821,14 +823,14 @@ def get_record_count(self, table_conf: Table, report_type: str) -> ReconcileReco
table=table_conf.source_name,
query=source_count_query,
options=None,
).collect()[0]["count"]
).collect()[0][0]
target_count = self._target.read_data(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
target_count = self._target.read_data(
target_count_row = self._target.read_data(

@@ -166,7 +172,7 @@ def _get_mismatch_df(source: DataFrame, target: DataFrame, key_columns: list[str
select_expr = key_cols + source_aliased + target_aliased + match_expr

filter_columns = " and ".join([column + "_match" for column in column_list])
filter_expr = ~expr(filter_columns)
filter_expr = ~expr(filter_columns) if filter_columns else lit(True)

mismatch_df = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mismatch_df = (
mismatch_df = (
source.alias('base')
.join(other=target.alias('compare'), on=key_columns, how="inner")
.select(*select_expr)
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.