Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queries for automated freshness detection #1254

Open
bastienboutonnet opened this issue Apr 4, 2022 · 0 comments
Open

Queries for automated freshness detection #1254

bastienboutonnet opened this issue Apr 4, 2022 · 0 comments
Assignees

Comments

@bastienboutonnet
Copy link
Contributor

bastienboutonnet commented Apr 4, 2022

The automated freshness detection process will need to select data from users datasets in order to issue a suggestion on: the column on which to place a freshness test, a data-driven time delta to be used as a threshold in the configuration of the check.

The detection algo wants to see data from datetime-like columns of a dataset. This means for each dataset we should be able to check the database type of each column and perform a select from these columns and pass those to the freshness detector.

We assume soda-core knows:

  • how to introspect a dataset's metadata (column database types)
  • how to select from all supported databases

Soda-core should make available to freshness detector a way to do something like:
"Give me a certain number of rows from dataset x's datetime-like columns"

Scale Considerations

  • How not to blow up memory?
    Because data will have to be brought in memory, we will have to be quite careful not to always select the entire row set of a table. During our POC, we think getting something like 1000 records is usually enough to be able to make a decision. We could therefore do the following:
    • quickly check the number of rows we are about to retrieve from the table. If this exceeds, say, 1000 rows we can issue our select with a limit 1000 clause.
    • we may want to consider doing several sampling with a limit 1000 to get random sections of the dataset on which to repeatedly attempt to detect the freshness column. The freshness detector would then work on the median of the sample iteration. For soda-core, this means automated freshness may want to call for a sampling approach or simple limit 1000. Not necessary in the first iteration unless we see our success rate be low.

@vijaykiran and @tombaeyens I'd like to get your eyes on this so that we can start grooming the integration. From our side, the detector POC will be made into production code soon as we were pretty successful at cracking this problem, so what we need to do next is work out the integration.

Let's start some thinking async and set up a call to iron out uncertainty, questions and final decisions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants