Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only force ALTREP compact row names (i.e. from duckplyr) if requested #745

Merged
merged 10 commits into from
Apr 1, 2025

Conversation

DavisVaughan
Copy link
Contributor

@DavisVaughan DavisVaughan commented Mar 13, 2025

Addresses posit-dev/positron#4158 at the request of @hadley (but not the part about errors during materialization crashing the kernel, that is a much deeper and harder to fix problem that requires us to adjust our assumptions about what a simple INTEGER_ELT() call can do - i.e. in an ALTREP world that can run arbitrary code that can error 😢)

Joint PR with tidyverse/duckplyr#661

The problem is that duckplyr uses ALTREP compact row names as the R_RowNamesSymbol attribute. Calling INTEGER_ELT() or INTEGER() on this to get the number of rows will trigger the entire duckdb query to run, something we want to avoid. This is hard though, because that's how we determine the number of rows in the data frame. We've determined that rather than trying to carefully avoid ALTREP compact row names every time we look at the number of rows in a data frame, a better solution is to use @dfalbel's hooks for allowing S3 classes to provide custom Variables Pane implementations - avoiding these troublesome code paths entirely.

Most of the work that was required for this is actually in tidyverse/duckplyr#661. However, in variable_length() here on the Ark side we were calling table_info() which queried the number of rows (bad) and number of columns even though we only needed the number of columns. I've reworked some things to expose ways to compute just the number of columns or just the number of rows for a data frame. With that in place, tidyverse/duckplyr#661 does the rest of the work.

This won't trigger the duckdb query:

  • Creating a duckdb tibble and having it show up in the Variables pane

This will trigger the duckdb query, and we are ok with this, and this is similar to RStudio:

  • "Expanding" the tibble in the Variables pane to look at the column values
  • Viewing the tibble in the Data Explorer

Here are some videos of me toying around with this:

Screen.Recording.2025-03-13.at.5.58.43.PM.mov
Screen.Recording.2025-03-13.at.5.59.48.PM.mov

See also, RStudio's rs_dim() here:
https://github.com/rstudio/rstudio/blob/7a9ab7afe9ae006e60897596a189a904a716ec4f/src/cpp/session/modules/environment/SessionEnvironment.cpp#L536
https://github.com/rstudio/rstudio/blob/7a9ab7afe9ae006e60897596a189a904a716ec4f/src/cpp/session/modules/SessionEnvironment.R#L560

@DavisVaughan DavisVaughan requested review from lionel- and dfalbel March 13, 2025 22:18
@dfalbel
Copy link
Contributor

dfalbel commented Mar 13, 2025

I didn't look at the code still. I wonder if it would make sense to implement duckdb specifics using the variables pane extension mechanism. For instance by implementing custom ark_positron_variable_display_value() for prudent_duckplyr_df / duckplyr_df. This could potentially even live in duckplyr's codebase.

/// Compute the number of columns in a data frame
///
/// This is easy, it's the length of the VECSXP
pub fn df_n_col(x: SEXP) -> crate::Result<i32> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have all of the type checking in the DataFrame class instead? I feel like we're creating lots of different ways of accessing information. If possible, it would be nice if there was only one way of validating a dataframe and accessing its fields by instantiating DataFrame.

The rows could be an option and there could be a method to materialise the rows.

@krlmlr
Copy link

krlmlr commented Mar 14, 2025

Thanks for looking into it.

Calling head() on the duck frame (the same way pillar does) won't materialize and will be faster in many cases.

Why is it hard to handle ALTREP failures? Does this change mean that the kernel won't crash for a stingy duck frame unless the user clicks a button? Could this be worked around with calling head() ?

@hadley
Copy link

hadley commented Mar 24, 2025

@krlmlr why does an ALTREP access generate an error? That seems weird to be since you'd normally expect (e.g.) INTEGER(x) to always work.

@krlmlr
Copy link

krlmlr commented Mar 24, 2025

This may happen for any number of reasons, including out-of-memory conditions. The "prudence" feature tries to avoid these OOM conditions and makes the errors more visible.

Can you think of an alternative?

@DavisVaughan DavisVaughan merged commit 78fe9aa into main Apr 1, 2025
6 checks passed
@DavisVaughan DavisVaughan deleted the feature/duckplyr branch April 1, 2025 13:59
@github-actions github-actions bot locked and limited conversation to collaborators Apr 1, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants