Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a functionality in apply_in_pandas to support spark api #3162

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

sfc-gh-dyadav
Copy link

@sfc-gh-dyadav sfc-gh-dyadav commented Mar 14, 2025

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1800723

If the dataframe comes from spark the column names are not same as the spark column names but the function would be assuming the spark column names and operating like that, this change resolves that issue
This adds a functionality of (key, dataframe) which can also be the type of function spark support

  1. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines

The test for this will be added in this PR https://github.com/snowflakedb/sas/pull/725/files, this is a fork introduced for the non public usecase of snowpark library

  1. Please describe how your code solves the related issue.

    Please write a short description of how your code change solves the related issue.
    I am extracting the spark names from the column_map which will only be present if this is being sent from the accelerated spark layer.

@sfc-gh-dyadav sfc-gh-dyadav added the NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md label Mar 14, 2025
@sfc-gh-snowflakedb-snyk-sa
Copy link

sfc-gh-snowflakedb-snyk-sa commented Mar 15, 2025

🎉 Snyk checks have passed. No issues have been found so far.

security/snyk check is complete. No issues have been found. (View Details)

license/snyk check is complete. No issues have been found. (View Details)

Comment on lines +429 to +438
if key_columns is not None:
import numpy as np

key_list = [pdf[key].iloc[0] for key in key_columns]
numpy_array = np.array(key_list)
keys = tuple(numpy_array)
if original_columns is not None:
pdf.columns = original_columns
if key_columns is not None:
return func(keys, pdf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if key_columns is not None:
import numpy as np
key_list = [pdf[key].iloc[0] for key in key_columns]
numpy_array = np.array(key_list)
keys = tuple(numpy_array)
if original_columns is not None:
pdf.columns = original_columns
if key_columns is not None:
return func(keys, pdf)
if original_columns is not None:
pdf.columns = original_columns
if key_columns is not None:
import numpy as np
key_list = [pdf[key].iloc[0] for key in key_columns]
numpy_array = np.array(key_list)
keys = tuple(numpy_array)
return func(keys, pdf)

nit: can we restructure it this way? grouping the if statements together

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants