Queries for automated freshness detection #1254

bastienboutonnet · 2022-04-04T11:49:03Z

The automated freshness detection process will need to select data from users datasets in order to issue a suggestion on: the column on which to place a freshness test, a data-driven time delta to be used as a threshold in the configuration of the check.

The detection algo wants to see data from datetime-like columns of a dataset. This means for each dataset we should be able to check the database type of each column and perform a select from these columns and pass those to the freshness detector.

We assume soda-core knows:

how to introspect a dataset's metadata (column database types)
how to select from all supported databases

Soda-core should make available to freshness detector a way to do something like:
"Give me a certain number of rows from dataset x's datetime-like columns"

Scale Considerations

How not to blow up memory?
Because data will have to be brought in memory, we will have to be quite careful not to always select the entire row set of a table. During our POC, we think getting something like 1000 records is usually enough to be able to make a decision. We could therefore do the following:
- quickly check the number of rows we are about to retrieve from the table. If this exceeds, say, 1000 rows we can issue our select with a limit 1000 clause.
- we may want to consider doing several sampling with a limit 1000 to get random sections of the dataset on which to repeatedly attempt to detect the freshness column. The freshness detector would then work on the median of the sample iteration. For soda-core, this means automated freshness may want to call for a sampling approach or simple limit 1000. Not necessary in the first iteration unless we see our success rate be low.

@vijaykiran and @tombaeyens I'd like to get your eyes on this so that we can start grooming the integration. From our side, the detector POC will be made into production code soon as we were pretty successful at cracking this problem, so what we need to do next is work out the integration.

Let's start some thinking async and set up a call to iron out uncertainty, questions and final decisions.

The text was updated successfully, but these errors were encountered:

bastienboutonnet added the freshness label Apr 4, 2022

bastienboutonnet assigned vijaykiran, tombaeyens and bastienboutonnet Apr 4, 2022

vijaykiran removed their assignment Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queries for automated freshness detection #1254

Queries for automated freshness detection #1254

bastienboutonnet commented Apr 4, 2022 •

edited

Loading

Queries for automated freshness detection #1254

Queries for automated freshness detection #1254

Comments

bastienboutonnet commented Apr 4, 2022 • edited Loading

Scale Considerations

bastienboutonnet commented Apr 4, 2022 •

edited

Loading