Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1882075,SNOW-1990301,SNOW-1990302,SNOW-1821290: AST finalizations for stateless batches, and DataframeWriter,DataframeReader new APIs #3180

Conversation

sfc-gh-heshah
Copy link
Collaborator

@sfc-gh-heshah sfc-gh-heshah commented Mar 20, 2025

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    This PR includes all changes required to complete the following tickets.

    It also includes changes in support of SNOW-1997136: Ensure AST batch is stateless.

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
    • If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support.
  3. Please describe how your code solves the related issue.

    AST generation for DataframeReader APIs which return the same DataframeReader instance have been updated to follow the same pattern as DataframeWriter. Since all such methods are setters, it is only necessary to have the last set attributes in the AST sent to the serveer. This is in support of the new format API in both Snowpark classes.

    AST generation was added for the new APIs DataframeReader.{format,load}, and DataframeWriter.{format,read,insert_into} along with expectation test updates.

    Changes were also made in support of AST batch statelessness in the IR, by changing fields typed VarId to be Expr typed. This allows more consistent usage of _set_ast_ref as a helper method across all Snowpark APIs.

    This PR is accompanied by a server-side PR.

Sorry, something went wrong.

@sfc-gh-snowflakedb-snyk-sa
Copy link

sfc-gh-snowflakedb-snyk-sa commented Mar 20, 2025

🎉 Snyk checks have passed. No issues have been found so far.

security/snyk check is complete. No issues have been found. (View Details)

license/snyk check is complete. No issues have been found. (View Details)

@github-actions github-actions bot added the local testing Local Testing issues/PRs label Mar 20, 2025
@sfc-gh-heshah sfc-gh-heshah added NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs snowpark-ast Change materially affects the Snowpark AST snowpark-pandas labels Mar 20, 2025 — with Graphite App
@sfc-gh-heshah sfc-gh-heshah added the snowpark-server Server-side snowpark work label Mar 20, 2025 — with Graphite App
@sfc-gh-heshah sfc-gh-heshah force-pushed the heshah-SNOW-1882075_1990301_1990302_1821290-AST-finalizations-for-stateless-batches-and-DataframeWriter-DataframeReader-new-APIs branch from c4dfaea to dfc31e6 Compare March 20, 2025 09:50
@sfc-gh-heshah sfc-gh-heshah marked this pull request as ready for review March 20, 2025 16:49
@sfc-gh-heshah sfc-gh-heshah requested review from a team as code owners March 20, 2025 16:49
@sfc-gh-heshah sfc-gh-heshah force-pushed the heshah-SNOW-1882075_1990301_1990302_1821290-AST-finalizations-for-stateless-batches-and-DataframeWriter-DataframeReader-new-APIs branch from dfc31e6 to 64f26b6 Compare March 21, 2025 18:35
@sfc-gh-heshah sfc-gh-heshah force-pushed the heshah-SNOW-1882075_1990301_1990302_1821290-AST-finalizations-for-stateless-batches-and-DataframeWriter-DataframeReader-new-APIs branch from 64f26b6 to 38c3df8 Compare March 22, 2025 18:25
Copy link
Contributor

@sfc-gh-vbudati sfc-gh-vbudati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Hemit!

@sfc-gh-heshah sfc-gh-heshah force-pushed the heshah-SNOW-1882075_1990301_1990302_1821290-AST-finalizations-for-stateless-batches-and-DataframeWriter-DataframeReader-new-APIs branch 2 times, most recently from 8fa300c to 5a59f68 Compare March 27, 2025 04:29
@@ -250,6 +250,7 @@ class Column:
# For example, running: df.filter(col("A").isin(1, 2, 3) & col("B")) would fail since the boolean operator
# '&' would try to construct an AST using that of the new col("A").isin(1, 2, 3) column (which we currently
# don't fill if the only argument provided in the Column constructor is 'expr1' of type Expression)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall there used to be some issue adding this to the constructor, do we have tests covering on/off for using Column objects (i.e. calling this API, and also col)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify, do you mean on/off in terms of our global AST enabled flag?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh like with AST flag enabled/disabled.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes we do have tests in both cases then (public calls made by tests, and internal calls to the constructor) if I'm understanding correctly.

Copy link
Contributor

@sfc-gh-lspiegelberg sfc-gh-lspiegelberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No questions re the minor changes, but confused about _flush_ast. What's the role of it? Why do we need it? Could we get away with not adding yet another parameter?

Copy link
Collaborator Author

save_as_table and copy_into_location (to which the _flush_ast parameter was added) are both public APIs with eager evaluation and internal uses. However, most other DataframeWriter APIs rely on them to perform the eager evaluation (in terms of Snowpark execution), and to flush the AST.

Only using _emit_ast meant that internal callers can't ask save_as_table and copy_into_location to both suppress their AST statement generation and flush the AST batch, because the latter was still guarded by our if _emit_ast check.

Using the combination means that internal callers can set _emit_ast to False to suppress the AST generation from these APIs, while also setting _flush_ast to true to indicate that the eager request sent through the underlying connection should still include the AST batch.

It's a unique case: public APIs, eager (in terms of actually performing the execution within the function itself rather than in a helper), and used internally by other public APIs. I don't think combination exists anywhere else in the Snowpark codebase unless I'm mistaken.

Elsewhere we seem to pass an _ast argument containing a pre-built proto.Expr itself to be inherited by the callee, so we do use another parameter. In this case I think _flush_ast makes more sense to indicate what's actually happening, however, we can follow the common pattern and just use an _ast parameter with the built proto.Expr() for save_as_table and copy_into_table to add to the Request before flushing.

@sfc-gh-lspiegelberg
Copy link
Contributor

save_as_table and copy_into_location (to which the _flush_ast parameter was added) are both public APIs with eager evaluation and internal uses. However, most other DataframeWriter APIs rely on them to perform the eager evaluation (in terms of Snowpark execution), and to flush the AST.

Only using _emit_ast meant that internal callers can't ask save_as_table and copy_into_location to both suppress their AST statement generation and flush the AST batch, because the latter was still guarded by our if _emit_ast check.

Using the combination means that internal callers can set _emit_ast to False to suppress the AST generation from these APIs, while also setting _flush_ast to true to indicate that the eager request sent through the underlying connection should still include the AST batch.

It's a unique case: public APIs, eager (in terms of actually performing the execution within the function itself rather than in a helper), and used internally by other public APIs. I don't think combination exists anywhere else in the Snowpark codebase unless I'm mistaken.

Elsewhere we seem to pass an _ast argument containing a pre-built proto.Expr itself to be inherited by the callee, so we do use another parameter. In this case I think _flush_ast makes more sense to indicate what's actually happening, however, we can follow the common pattern and just use an _ast parameter with the built proto.Expr() for save_as_table and copy_into_table to add to the Request before flushing.

Hm, internal use of save_as_table should not generate AST/not flush it. My mind model here is that all of this moves to the server-side, i.e. what is right now internal will simply execute server-side.

@sfc-gh-heshah sfc-gh-heshah force-pushed the heshah-SNOW-1882075_1990301_1990302_1821290-AST-finalizations-for-stateless-batches-and-DataframeWriter-DataframeReader-new-APIs branch 5 times, most recently from 52ff093 to 6ff3690 Compare March 28, 2025 06:18
…s for stateless batches, and DataframeWriter,DataframeReader new APIs
@sfc-gh-heshah sfc-gh-heshah force-pushed the heshah-SNOW-1882075_1990301_1990302_1821290-AST-finalizations-for-stateless-batches-and-DataframeWriter-DataframeReader-new-APIs branch from 6ff3690 to 82a5026 Compare March 28, 2025 07:33
@sfc-gh-heshah sfc-gh-heshah merged commit 9500f7f into main Mar 28, 2025
41 checks passed
@sfc-gh-heshah sfc-gh-heshah deleted the heshah-SNOW-1882075_1990301_1990302_1821290-AST-finalizations-for-stateless-batches-and-DataframeWriter-DataframeReader-new-APIs branch March 28, 2025 19:10
@github-actions github-actions bot locked and limited conversation to collaborators Mar 28, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
local testing Local Testing issues/PRs NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs snowpark-ast Change materially affects the Snowpark AST snowpark-server Server-side snowpark work
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants