Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1918055: Update agg error for unsupported aggregation functions #3133

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

sfc-gh-lmukhopadhyay
Copy link
Contributor

@sfc-gh-lmukhopadhyay sfc-gh-lmukhopadhyay commented Mar 7, 2025

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1918055

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
  3. Please describe how your code solves the related issue.

    Updating groupby.agg and agg error for unsupported aggregation functions to match pandas. It will now return
    'SeriesGroupBy' object has no attribute 'COUNT'

@sfc-gh-lmukhopadhyay sfc-gh-lmukhopadhyay added NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs and removed snowpark-pandas labels Mar 7, 2025
Signed-off-by: Labanya Mukhopadhyay <[email protected]>
@sfc-gh-lmukhopadhyay sfc-gh-lmukhopadhyay marked this pull request as ready for review March 7, 2025 23:55
@sfc-gh-lmukhopadhyay sfc-gh-lmukhopadhyay requested a review from a team as a code owner March 7, 2025 23:56
@sfc-gh-lmukhopadhyay sfc-gh-lmukhopadhyay changed the title SNOW-1918055: Update groupby.agg error for unsupported aggregation functions SNOW-1918055: Update agg error for unsupported aggregation functions Mar 7, 2025
Comment on lines +876 to +880
bool
True if all functions in the list are snowflake supported aggregation functions, otherwise,
return False
list
The list of unsupported functions used for aggregation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: usually I see docstrings for functions like this explicitly mention tuple[bool, list] as the return type, and describe what each member of the tuple means rather than separating out the values.

for value in agg_func.values()
)
if not is_supported_func:
supported_flag = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just return here? Is your intent to combine the unsupported_arguments lists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first function that's unsupported will be returned. The case with multiple unsupported functions needs to be handled which will require returning in this loop, so I'll make those changes as well as using repr_aggregate_function(agg_func, agg_kwargs)!

"""
# validate agg_func, only snowflake builtin agg function or dict of snowflake builtin agg
# function can be implemented in distributed way.
unsupported_arguments: list[str] = []
supported_flag = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the name "is_supported" so there's 0 ambiguity about the meaning of the flag's T/F value.

) = check_is_aggregation_supported_in_snowflake(agg_func, agg_kwargs, axis)
if not is_supported:
raise AttributeError(
f"'SeriesGroupBy' object has no attribute '{unsupported_arguments}'"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if this is an aggregation that native pandas supports but we do not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be returning False since we check if there is a corresponding Snowflake aggregation function via get_snowflake_agg_func(). The overall checking logic for if a function is supported should not be changing here

basic_snowpark_pandas_df = pd.DataFrame(
data=8 * [range(3)], columns=["a", "b", "c"]
)
# basic_snowpark_pandas_df = basic_snowpark_pandas_df.groupby(['a', 'b']).sum()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean to delete this?

@@ -54,12 +54,13 @@
- Fixed a bug where creating a Dataframe with large number of values raised `Unsupported feature 'SCOPED_TEMPORARY'.` error if thread-safe session was disabled.
- Fixed a bug where `df.describe` raised internal SQL execution error when the dataframe is created from reading a stage file and CTE optimization is enabled.
- Fixed a bug where `df.order_by(A).select(B).distinct()` would generate invalid SQL when simplified query generation was enabled using `session.conf.set("use_simplified_query_generation", True)`.
- Disabled simplified query generation by default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@@ -851,40 +851,50 @@ def _is_supported_snowflake_agg_func(
The value can be different for different aggregation functions.
Returns:
is_valid: bool. Whether it is valid to implement with snowflake or not.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspected this comment is the result of copy-pasting , Whether it is valid to implement with snowflake or not. is not consistent of the semantic of this function (to check "check if the aggregation function is supported with snowflake")

"""
if isinstance(agg_func, tuple) and len(agg_func) == 2:
# For named aggregations, like `df.agg(new_col=("old_col", "sum"))`,
# take the second part of the named aggregation.
agg_func = agg_func[0]
return get_snowflake_agg_func(agg_func, agg_kwargs, axis, _is_df_agg) is not None
if get_snowflake_agg_func(agg_func, agg_kwargs, axis, _is_df_agg) is None:
return False, agg_func
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is agg_func guaranteed to be a list?

Comment on lines +883 to +887
unsupported_list: list[str] = []
for func in agg_funcs:
is_supported, unsupported_list = _is_supported_snowflake_agg_func(
func, agg_kwargs, axis, _is_df_agg
)
Copy link
Contributor

@sfc-gh-jjiao sfc-gh-jjiao Mar 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the unsupported_list need to be appended to in the for loop ? it seems like it has been replaced/overwritten every time here

Copy link
Contributor

@sfc-gh-jjiao sfc-gh-jjiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please add a unit test for this function _are_all_agg_funcs_supported_by_snowflake ? we could use some functions that for sure is not going to be supported. I suspect the current code change has a bug. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs snowpark-pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants