You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The automated freshness detection process will need to select data from users datasets in order to issue a suggestion on: the column on which to place a freshness test, a data-driven time delta to be used as a threshold in the configuration of the check.
The detection algo wants to see data from datetime-like columns of a dataset. This means for each dataset we should be able to check the database type of each column and perform a select from these columns and pass those to the freshness detector.
We assume soda-core knows:
how to introspect a dataset's metadata (column database types)
how to select from all supported databases
Soda-core should make available to freshness detector a way to do something like:
"Give me a certain number of rows from dataset x's datetime-like columns"
Scale Considerations
How not to blow up memory?
Because data will have to be brought in memory, we will have to be quite careful not to always select the entire row set of a table. During our POC, we think getting something like 1000 records is usually enough to be able to make a decision. We could therefore do the following:
quickly check the number of rows we are about to retrieve from the table. If this exceeds, say, 1000 rows we can issue our select with a limit 1000 clause.
we may want to consider doing several sampling with a limit 1000 to get random sections of the dataset on which to repeatedly attempt to detect the freshness column. The freshness detector would then work on the median of the sample iteration. For soda-core, this means automated freshness may want to call for a sampling approach or simple limit 1000. Not necessary in the first iteration unless we see our success rate be low.
@vijaykiran and @tombaeyens I'd like to get your eyes on this so that we can start grooming the integration. From our side, the detector POC will be made into production code soon as we were pretty successful at cracking this problem, so what we need to do next is work out the integration.
Let's start some thinking async and set up a call to iron out uncertainty, questions and final decisions.
The text was updated successfully, but these errors were encountered:
The automated freshness detection process will need to select data from users datasets in order to issue a suggestion on: the column on which to place a freshness test, a data-driven time delta to be used as a threshold in the configuration of the check.
The detection algo wants to see data from datetime-like columns of a dataset. This means for each dataset we should be able to check the database type of each column and perform a select from these columns and pass those to the freshness detector.
We assume soda-core knows:
Soda-core should make available to freshness detector a way to do something like:
"Give me a certain number of rows from dataset x's datetime-like columns"
Scale Considerations
Because data will have to be brought in memory, we will have to be quite careful not to always select the entire row set of a table. During our POC, we think getting something like 1000 records is usually enough to be able to make a decision. We could therefore do the following:
select
with alimit 1000
clause.limit 1000
. Not necessary in the first iteration unless we see our success rate be low.@vijaykiran and @tombaeyens I'd like to get your eyes on this so that we can start grooming the integration. From our side, the detector POC will be made into production code soon as we were pretty successful at cracking this problem, so what we need to do next is work out the integration.
Let's start some thinking async and set up a call to iron out uncertainty, questions and final decisions.
The text was updated successfully, but these errors were encountered: