Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1984396: Snowpark Local Testing minus filters out rows that match values across multiple rows in the subtracted set #3163

Open
matt-comity opened this issue Mar 14, 2025 · 1 comment
Assignees
Labels
bug Something isn't working status-triage_done Initial triage done, will be further handled by the driver team

Comments

@matt-comity
Copy link

  1. What version of Python are you using?

Python 3.11.9 (main, Jun 24 2024, 14:49:51) [Clang 15.0.0 (clang-1500.3.9.4)]

  1. What operating system and processor architecture are you using?

macOS-15.3.1-arm64-arm-64bit

  1. What are the component versions in the environment (pip freeze)?

    There are a lot, but relevant for the test example:

pandas==2.1.4
snowflake-snowpark-python==1.27.0
  1. What did you do?

I was trying to use subtract / minus / except_ and write some unit tests for my code using Snowpark but ran into some odd behavior. I've created a toy example that illustrates the problem below.

from snowflake.snowpark import Session
from datetime import date
session = Session.builder.config("local_testing", True).create()
df1 = session.create_dataframe([[1, 2], [3, 4]])
df2 = session.create_dataframe([[1, 1], [2, 2]])
df1.subtract(df2).show()
  1. What did you expect to see?

Expected:

---------------
|"_1"  |"_2"  |
---------------
|1     |2     |
|3     |4     |
---------------

Got:

---------------
|"_1"  |"_2"  |
---------------
|3     |4     |
---------------

As you can see, the row [1, 2] is getting filtered out despite not existing in the dataframe being subtracted. This is because both 1 and 2 show up as values among the rows. The bug is on this line of code, as it is checking if all of the values in each row in df1 show up in rows in df2, but not necessarily the same row. This is due to smushing all the df2 values together via cur_df.values.ravel(), so we lose the row distinctions.

In Snowflake itself, an equivalent query does what you'd expect:

select * from (select * from values (1, 2), (3, 4)) minus (select * from values (1, 1), (2, 2));
COLUMN1 | COLUMN2
--      | --
3       | 4
1       | 2
  1. Can you set logging to DEBUG and collect the logs?

N/A

@matt-comity matt-comity added bug Something isn't working needs triage Initial RCA is required labels Mar 14, 2025
@github-actions github-actions bot changed the title Snowpark Local Testing minus removing rows that match values across multiple removed rows SNOW-1984396: Snowpark Local Testing minus removing rows that match values across multiple removed rows Mar 14, 2025
@matt-comity matt-comity changed the title SNOW-1984396: Snowpark Local Testing minus removing rows that match values across multiple removed rows SNOW-1984396: Snowpark Local Testing minus filters out rows that match values across multiple rows in the subtracted set Mar 15, 2025
@sfc-gh-sghosh sfc-gh-sghosh self-assigned this Mar 18, 2025
@sfc-gh-sghosh sfc-gh-sghosh added status-triage Issue is under initial triage and removed needs triage Initial RCA is required labels Mar 18, 2025
@sfc-gh-sghosh
Copy link

Hello @matt-comity ,

Thanks for raising the issue.
we are able to reproduce the issue with local session.
The issue is being fixed via PR #3167

from snowflake.snowpark import Session
from datetime import date
session = Session.builder.config("local_testing", True).create()
df1 = session.create_dataframe([[1, 2], [3, 4]])
df2 = session.create_dataframe([[1, 1], [2, 2]])
df1.subtract(df2).show()
df1.minus(df2).show()
df1.except_(df2).show()


|"_1" |"_2" |

|3 |4 |


|"_1" |"_2" |

|3 |4 |


|"_1" |"_2" |

|3 |4 |

@sfc-gh-sghosh sfc-gh-sghosh added status-triage_done Initial triage done, will be further handled by the driver team and removed status-triage Issue is under initial triage labels Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working status-triage_done Initial triage done, will be further handled by the driver team
Projects
None yet
Development

No branches or pull requests

3 participants