Skip to content

Conversation

@nuno-faria
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Allow a way to check the contents of the file statistics cache.

What changes are included in this PR?

  • Added the statistics_cache function to datafusion-cli.
  • Converted FileStatisticsCache to a trait and implemented the list_entries method.
  • Added unit tests.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes, FileStatisticsCache has been changed to a trait. Previous implementations need to implement the list_entries method.

@nuno-faria nuno-faria changed the title feat: Implement statistics_cache function feat: Implement the statistics_cache function Dec 2, 2025
@github-actions github-actions bot added documentation Improvements or additions to documentation catalog Related to the catalog crate execution Related to the execution crate labels Dec 2, 2025
Comment on lines 699 to 705
let sql = "SELECT split_part(path, '/', -1) as filename, file_size_bytes, num_rows, num_columns, table_size_bytes from statistics_cache() order by filename";
let df = ctx.sql(sql).await?;
let rbs = df.collect().await?;
assert_snapshot!(batches_to_string(&rbs),@r"
++
++
");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This confirms that the file statistics cache is not populated when the table is created, only after accessing it once.

Copy link
Contributor

@alchemist51 alchemist51 Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to provide a pre-warming option to it? Similar to metadataCache?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, it's being implemented here: #18971.

num_rows: stats.num_rows,
num_columns: stats.column_statistics.len(),
table_size_bytes: stats.total_byte_size,
statistics_size_bytes: 0, // TODO: set to the real size in the future
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future we need to set this to the real size.

@nuno-faria
Copy link
Contributor Author

Also the datafusion-cli never collects file statistics, even after querying a table, since a cache is not passed to the runtime. This can be easily fixed by updating the runtime creation in datafusion-cli/main_inner.rs.

CacheAccessor<Path, Arc<Statistics>, Extra = ObjectMeta>
{
/// Retrieves the information about the entries currently cached.
fn list_entries(&self) -> HashMap<Path, FileStatisticsCacheEntry>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to make this thread safe? When we bring in size limit, we might want to evict the cache object as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand, this should be thread safe as is right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I got confused with an inner object created as HashMap, this makes sense here.

@nuno-faria
Copy link
Contributor Author

There is a problem with the spell checking in the output of the function example but I don't know how to deal with it since it should be treated verbatim. cc: @alamb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

catalog Related to the catalog crate documentation Improvements or additions to documentation execution Related to the execution crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a way to show the contents of the FileStatisticsCache in datafusion-cli

3 participants