-
Notifications
You must be signed in to change notification settings - Fork 1.8k
feat: Implement the statistics_cache function
#19054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
statistics_cache function
datafusion-cli/src/main.rs
Outdated
| let sql = "SELECT split_part(path, '/', -1) as filename, file_size_bytes, num_rows, num_columns, table_size_bytes from statistics_cache() order by filename"; | ||
| let df = ctx.sql(sql).await?; | ||
| let rbs = df.collect().await?; | ||
| assert_snapshot!(batches_to_string(&rbs),@r" | ||
| ++ | ||
| ++ | ||
| "); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This confirms that the file statistics cache is not populated when the table is created, only after accessing it once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to provide a pre-warming option to it? Similar to metadataCache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, it's being implemented here: #18971.
| num_rows: stats.num_rows, | ||
| num_columns: stats.column_statistics.len(), | ||
| table_size_bytes: stats.total_byte_size, | ||
| statistics_size_bytes: 0, // TODO: set to the real size in the future |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future we need to set this to the real size.
|
Also the |
| CacheAccessor<Path, Arc<Statistics>, Extra = ObjectMeta> | ||
| { | ||
| /// Retrieves the information about the entries currently cached. | ||
| fn list_entries(&self) -> HashMap<Path, FileStatisticsCacheEntry>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to make this thread safe? When we bring in size limit, we might want to evict the cache object as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand, this should be thread safe as is right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I got confused with an inner object created as HashMap, this makes sense here.
|
There is a problem with the spell checking in the output of the function example but I don't know how to deal with it since it should be treated verbatim. cc: @alamb |
Co-authored-by: Martin Grigorov <[email protected]>
Co-authored-by: Martin Grigorov <[email protected]>
Which issue does this PR close?
datafusion-cli#18953.Rationale for this change
Allow a way to check the contents of the file statistics cache.
What changes are included in this PR?
statistics_cachefunction todatafusion-cli.FileStatisticsCacheto a trait and implemented thelist_entriesmethod.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes,
FileStatisticsCachehas been changed to a trait. Previous implementations need to implement thelist_entriesmethod.