[SPARK-51065][SQL] Disallowing non-nullable schema when Avro encoding is used for TransformWithState #49751

ericm-db · 2025-01-31T16:46:38Z

What changes were proposed in this pull request?

Right now, effectively set all fields in a schema to nullable, regardless of what the user specifies.

However, when Avro encoding is used, we want to enforce nullability in order to enable the schema evolution cases we support.
Nullability can only be set by the user in Python, so when non-nullable fields are defined, we throw an error
In Scala, Encoders.product set fields to non-nullable by default (user cannot configure this), so we turn the fields to nullable

Why are the changes needed?

In order to keep parity with the user-specified schema with the actual schema that we use, and to enable the schema evolution use cases we want

Does this PR introduce any user-facing change?

This error is thrown if the schema is defined as non-nullable

Traceback (most recent call last):
  File "/Users/eric.marnadi/spark/python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py", line 1496, in test_not_nullable_fails
    self._run_evolution_test(
  File "/Users/eric.marnadi/spark/python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py", line 1344, in _run_evolution_test
    q.processAllAvailable()
  File "/Users/eric.marnadi/spark/python/pyspark/sql/streaming/query.py", line 351, in processAllAvailable
    return self._jsq.processAllAvailable()
  File "/Users/eric.marnadi/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/java_gateway.py", line 1362, in __call__
    return_value = get_return_value(
  File "/Users/eric.marnadi/spark/python/pyspark/errors/exceptions/captured.py", line 258, in deco
    raise converted from None
pyspark.errors.exceptions.captured.StreamingQueryException: [STREAM_FAILED] Query [id = 541c5df0-24e4-4702-b87a-c4edfb6a952c, runId = 4259c7b9-3846-4f73-9204-c3d71b07018c] terminated with exception: [STATE_STORE_SCHEMA_MUST_BE_NULLABLE] If schema evolution is enabled, all the fields in the schema for column family state must be nullable.
Please set the 'spark.sql.streaming.stateStore.encodingFormat' to 'UnsafeRow' or make the schema nullable.
Current schema: StructType(StructField(id,IntegerType,false),StructField(name,StringType,false)) SQLSTATE: XXKST SQLSTATE: XXKST
=== Streaming Query ===
Identifier: evolution_test [id = 541c5df0-24e4-4702-b87a-c4edfb6a952c, runId = 4259c7b9-3846-4f73-9204-c3d71b07018c]
Current Committed Offsets: {}
Current

How was this patch tested?

Unit tests

Was this patch authored or co-authored using generative AI tooling?

No

…ormWithState

common/utils/src/main/resources/error/error-conditions.json

...re/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulProcessorHandleImpl.scala

anishshri-db · 2025-02-01T00:23:34Z

@ericm-db - can u add the SPARK ticket in the PR title ?

anishshri-db · 2025-02-01T00:24:13Z

@ericm-db - also, is test failure related to the change ?

...re/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulProcessorHandleImpl.scala

ericm-db · 2025-02-05T17:55:27Z

@HeartSaVioR Can you PTAL when you get a chance?

HeartSaVioR

First pass. I feel like I'm not fully understand the full picture of this, so need to get answers from my review comments.

common/utils/src/main/resources/error/error-conditions.json

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala

...re/src/main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasExec.scala

sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithStateSuite.scala

HeartSaVioR · 2025-02-06T03:52:18Z

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

@@ -1470,6 +1470,39 @@ def check_exception(error):
                    check_exception=check_exception,
                )

+    def test_not_nullable_fails(self):


Why not having identical test in Scala as well? I don't see a new test verifying the error.

The thing is, there is no way for user to specify this using Scala.

Yes and probably also no.

I agree moderate users may not ever try to get over and just stick with case class or POJO or so. But "we" can imagine a way to get over, exactly the same way how we could support PySpark:

override protected val stateEncoder: ExpressionEncoder[Any] = ExpressionEncoder(stateType).resolveAndBind().asInstanceOf[ExpressionEncoder[Any]]

This is how we come up with state encoder for Python version of FMGWS. This is to serde with Row interface - my rough memory says it's not InternalRow but Row, so, most likely work with GenericRow, but we can try with both GenericRow and InternalRow.

I'm OK with deferring this as follow-up.

Now I get that you are not able to test this actually, as we have to just accept non-nullable column and change to nullable. Again I doubt this is just a bug though.

(Not really a bug, we figured out.)

...re/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulProcessorHandleImpl.scala

anishshri-db · 2025-02-06T20:18:56Z

.../main/scala/org/apache/spark/sql/execution/streaming/StateStoreColumnFamilySchemaUtils.scala

@@ -47,7 +47,7 @@ object StateStoreColumnFamilySchemaUtils {
        // Byte type is converted to Int in Avro, which doesn't work for us as Avro
        // uses zig-zag encoding as opposed to big-endian for Ints
        Seq(
-          StructField(s"${field.name}_marker", BinaryType, nullable = false),


Lets say nullable=true explicitly ?

HeartSaVioR

I see the divergence comes from the fact Encoder of case class gives non-nullable columns as schema definition, which I wonder whether this is really correct behavior.

I'd suggest to experiment with case class with explicitly giving null in the field, and see whether it is really safe or we will come to NPE. If it's latter, it's definitely a bug we need to fix. We can do this as a follow up, but maybe before Spark 4.0 release, as it is a bit weird to me.

HeartSaVioR · 2025-02-07T04:55:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala

@@ -149,8 +149,11 @@ case class TransformWithStateExec(
      0, keyExpressions.toStructType, 0, DUMMY_VALUE_ROW_SCHEMA,
      Some(NoPrefixKeyStateEncoderSpec(keySchema)))

+    // For Scala, the user can't explicitly set nullability on schema, so there is


Likewise I mentioned in other comment, it is not impossible to set nullability on encoder (although I tend to agree most users won't). Let's not make this be conditional.

Also, this is concerning me - if we are very confident that users would never be able to set column to be nullable, why we need to change the schema as we all know it has to be nullable? What we are worrying about if we just do the same with Python?

#49751 (comment)
I realized you had to go through this way due to case class enconder. Sorry about that.

HeartSaVioR · 2025-02-07T05:42:05Z

common/utils/src/main/resources/error/error-conditions.json

@@ -5072,6 +5072,14 @@
    ],
    "sqlState" : "42601"
  },
+  "TRANSFORM_WITH_STATE_SCHEMA_MUST_BE_NULLABLE" : {
+    "message" : [
+      "If schema evolution is enabled, all the fields in the schema for column family <columnFamilyName> must be nullable",


Do we think whichever is easier to understand, "using Avro" or "schema evolution is enabled"?

I foresee the direction of using Avro for all stateful operators (unless there is outstanding regression), and once we make Avro by default, this will be confusing one to consume because they don't do anything for schema evolution. IMO it is "indirect" information and they would probably try to figure out how to disable schema evolution instead, without knowing that Avro and schema evolution is coupled.

cc. @anishshri-db to hear his voice.

Yea I think its fine to say that we refer to the transformWithState case relative to Avro being used - dont need to explicitly call out schema evolution here

HeartSaVioR · 2025-02-07T05:50:43Z

...re/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulProcessorHandleImpl.scala

+   *                            true when using Python, as this is the only avenue through
+   *                            which users can set nullability
+   * @param shouldSetNullable Whether we need to set the fields as nullable. This is set to
+   *                          true when using Scala, as case classes are set to


case classes are set to non-nullable by default.

I'm actually surprised and it sounds like a bug to me. (Sorry, you had to handle Python and Scala differently due to this. My bad.)

What if you set null to any of fields in case class? Will it work, and if it works, how?

If this is indeed a bug and we can fix that, then we can simplify things a lot. I'm OK if you want to defer this, but definitely need to have follow up ticket for this.

Maybe it is only true for primitive type? If then it might make sense, like an optimization for the type which could never have null. If you see non-nullable to String or so, this should be a bug.

I found this, def nullable: Boolean = !isPrimitive from AgnosticEncoder trait, so it's intended and not a bug. For the track I've asked to update the comment since it is not always non-nullable, but primitive types are nullable.

HeartSaVioR · 2025-02-07T06:03:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreErrors.scala

@@ -145,6 +145,12 @@ object StateStoreErrors {
    new StateStoreValueSchemaNotCompatible(storedValueSchema, newValueSchema)
  }

+  def twsSchemaMustBeNullable(


I think TWS deserves its own error collection class, but I agree this is out of scope. Let's make a follow-up.

HeartSaVioR

+1 pending CI

HeartSaVioR · 2025-02-07T13:46:22Z

@ericm-db Looks like CI failure is relevant (I only checked with core module and there is another failure in other module). Could you please take a look and fix it? Thanks!

ericm-db · 2025-02-07T23:13:07Z

@HeartSaVioR Can you PTAL now?

HeartSaVioR · 2025-02-09T04:18:41Z

Thanks! Merging to master/4.0.

… is used for TransformWithState ### What changes were proposed in this pull request? Right now, effectively set all fields in a schema to nullable, regardless of what the user specifies. - However, when Avro encoding is used, we want to enforce nullability in order to enable the schema evolution cases we support. - Nullability can only be set by the user in Python, so when non-nullable fields are defined, we throw an error - In Scala, Encoders.product set fields to non-nullable by default (user cannot configure this), so we turn the fields to nullable ### Why are the changes needed? In order to keep parity with the user-specified schema with the actual schema that we use, and to enable the schema evolution use cases we want ### Does this PR introduce _any_ user-facing change? This error is thrown if the schema is defined as non-nullable ``` Traceback (most recent call last): File "/Users/eric.marnadi/spark/python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py", line 1496, in test_not_nullable_fails self._run_evolution_test( File "/Users/eric.marnadi/spark/python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py", line 1344, in _run_evolution_test q.processAllAvailable() File "/Users/eric.marnadi/spark/python/pyspark/sql/streaming/query.py", line 351, in processAllAvailable return self._jsq.processAllAvailable() File "/Users/eric.marnadi/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/java_gateway.py", line 1362, in __call__ return_value = get_return_value( File "/Users/eric.marnadi/spark/python/pyspark/errors/exceptions/captured.py", line 258, in deco raise converted from None pyspark.errors.exceptions.captured.StreamingQueryException: [STREAM_FAILED] Query [id = 541c5df0-24e4-4702-b87a-c4edfb6a952c, runId = 4259c7b9-3846-4f73-9204-c3d71b07018c] terminated with exception: [STATE_STORE_SCHEMA_MUST_BE_NULLABLE] If schema evolution is enabled, all the fields in the schema for column family state must be nullable. Please set the 'spark.sql.streaming.stateStore.encodingFormat' to 'UnsafeRow' or make the schema nullable. Current schema: StructType(StructField(id,IntegerType,false),StructField(name,StringType,false)) SQLSTATE: XXKST SQLSTATE: XXKST === Streaming Query === Identifier: evolution_test [id = 541c5df0-24e4-4702-b87a-c4edfb6a952c, runId = 4259c7b9-3846-4f73-9204-c3d71b07018c] Current Committed Offsets: {} Current ``` ### How was this patch tested? Unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #49751 from ericm-db/disallow-non-nullable-schema. Authored-by: Eric Marnadi <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]> (cherry picked from commit 301b666) Signed-off-by: Jungtaek Lim <[email protected]>

HeartSaVioR · 2025-02-09T04:24:02Z

I didn't realize the prefix is not properly set in the PR title - @ericm-db let's use [SS] if the PR is relevant to Structured Streaming. If you think the PR is bound to multiple components, you can add all of them.

Disallowing non-nullable schema when Avro encoding is used for Transf…

98e423a

…ormWithState

github-actions bot added SQL STRUCTURED STREAMING PYTHON labels Jan 31, 2025

adding two booleans

df5359e

anishshri-db reviewed Jan 31, 2025

View reviewed changes

common/utils/src/main/resources/error/error-conditions.json Outdated Show resolved Hide resolved

anishshri-db approved these changes Jan 31, 2025

View reviewed changes

anishshri-db reviewed Jan 31, 2025

View reviewed changes

...re/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulProcessorHandleImpl.scala Outdated Show resolved Hide resolved

ericm-db added 2 commits January 31, 2025 13:01

feedback

16106a6

this should be it

fbfa147

adding check

6786133

ericm-db changed the title ~~Disallowing non-nullable schema when Avro encoding is used for TransformWithState~~ [SPARK-51065] Disallowing non-nullable schema when Avro encoding is used for TransformWithState Feb 3, 2025

ericm-db added 2 commits February 2, 2025 20:22

should be nullable

96b14c3

should be nullable

570820a

HyukjinKwon changed the title ~~[SPARK-51065] Disallowing non-nullable schema when Avro encoding is used for TransformWithState~~ [SPARK-51065][SQL] Disallowing non-nullable schema when Avro encoding is used for TransformWithState Feb 3, 2025

error-conditions

6697723

anishshri-db reviewed Feb 3, 2025

View reviewed changes

...re/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulProcessorHandleImpl.scala Outdated Show resolved Hide resolved

ericm-db added 4 commits February 4, 2025 13:32

undoing other changes

8897e58

lint

84b5342

setting nullable to true

ac28140

setting more things to false

409164f

HeartSaVioR reviewed Feb 6, 2025

View reviewed changes

HeartSaVioR mentioned this pull request Feb 6, 2025

[SPARK-51096][SQL][TESTS] Splitting TransformWithStateSuite into UnsafeRow and Avro encoding suites #49815

Closed

ericm-db added 3 commits February 6, 2025 08:40

merge

45de742

error stuff

77a9c3f

not passing in boolean

42affc0

anishshri-db reviewed Feb 6, 2025

View reviewed changes

ericm-db added 4 commits February 6, 2025 13:03

setting vars

9b947cb

comment

5452b68

explicitly setting nullable = true

ade7074

spacing in error-conditions.json

4ef6468

HeartSaVioR reviewed Feb 7, 2025

View reviewed changes

ericm-db added 3 commits February 6, 2025 22:50

msg update

ce8bb14

comment update

f0b3d8c

more comments

2f33bb4

HeartSaVioR approved these changes Feb 7, 2025

View reviewed changes

ericm-db added 2 commits February 7, 2025 09:14

fixing stuff

a28aba8

fixing test

d628b8d

HeartSaVioR closed this in 301b666 Feb 9, 2025

[SPARK-51065][SQL] Disallowing non-nullable schema when Avro encoding is used for TransformWithState #49751

[SPARK-51065][SQL] Disallowing non-nullable schema when Avro encoding is used for TransformWithState #49751

Uh oh!

Conversation

ericm-db commented Jan 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Uh oh!

anishshri-db commented Feb 1, 2025

Uh oh!

anishshri-db commented Feb 1, 2025

Uh oh!

Uh oh!

ericm-db commented Feb 5, 2025

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Feb 7, 2025

Uh oh!

ericm-db commented Feb 7, 2025

Uh oh!

HeartSaVioR commented Feb 9, 2025

Uh oh!

HeartSaVioR commented Feb 9, 2025

Uh oh!

Uh oh!

ericm-db commented Jan 31, 2025 •

edited

Loading

HeartSaVioR Feb 7, 2025 •

edited

Loading