-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-51065][SQL] Disallowing non-nullable schema when Avro encoding is used for TransformWithState #49751
base: master
Are you sure you want to change the base?
[SPARK-51065][SQL] Disallowing non-nullable schema when Avro encoding is used for TransformWithState #49751
Changes from 19 commits
98e423a
df5359e
16106a6
fbfa147
6786133
96b14c3
570820a
6697723
8897e58
84b5342
ac28140
409164f
45de742
77a9c3f
42affc0
9b947cb
5452b68
ade7074
4ef6468
ce8bb14
f0b3d8c
2f33bb4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1470,6 +1470,39 @@ def check_exception(error): | |
check_exception=check_exception, | ||
) | ||
|
||
def test_not_nullable_fails(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not having identical test in Scala as well? I don't see a new test verifying the error. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The thing is, there is no way for user to specify this using Scala. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes and probably also no. I agree moderate users may not ever try to get over and just stick with case class or POJO or so. But "we" can imagine a way to get over, exactly the same way how we could support PySpark:
This is how we come up with state encoder for Python version of FMGWS. This is to serde with Row interface - my rough memory says it's not InternalRow but Row, so, most likely work with GenericRow, but we can try with both GenericRow and InternalRow. I'm OK with deferring this as follow-up. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now I get that you are not able to test this actually, as we have to just accept non-nullable column and change to nullable. Again I doubt this is just a bug though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (Not really a bug, we figured out.) |
||
with self.sql_conf({"spark.sql.streaming.stateStore.encodingFormat": "avro"}): | ||
with tempfile.TemporaryDirectory() as checkpoint_dir: | ||
input_path = tempfile.mkdtemp() | ||
self._prepare_test_resource1(input_path) | ||
|
||
df = self._build_test_df(input_path) | ||
|
||
def check_basic_state(batch_df, batch_id): | ||
result = batch_df.collect()[0] | ||
assert result.value["id"] == 0 # First ID from test data | ||
assert result.value["name"] == "name-0" | ||
|
||
def check_exception(error): | ||
from pyspark.errors.exceptions.captured import StreamingQueryException | ||
|
||
if not isinstance(error, StreamingQueryException): | ||
return False | ||
|
||
error_msg = str(error) | ||
return ( | ||
"[TRANSFORM_WITH_STATE_SCHEMA_MUST_BE_NULLABLE]" in error_msg | ||
and "column family state must be nullable" in error_msg | ||
) | ||
|
||
self._run_evolution_test( | ||
BasicProcessorNotNullable(), | ||
checkpoint_dir, | ||
check_basic_state, | ||
df, | ||
check_exception=check_exception, | ||
) | ||
|
||
|
||
class SimpleStatefulProcessorWithInitialState(StatefulProcessor): | ||
# this dict is the same as input initial state dataframe | ||
|
@@ -1893,6 +1926,27 @@ def close(self) -> None: | |
pass | ||
|
||
|
||
class BasicProcessorNotNullable(StatefulProcessor): | ||
# Schema definitions | ||
state_schema = StructType( | ||
[StructField("id", IntegerType(), False), StructField("name", StringType(), False)] | ||
) | ||
|
||
def init(self, handle): | ||
self.state = handle.getValueState("state", self.state_schema) | ||
|
||
def handleInputRows(self, key, rows, timerValues) -> Iterator[pd.DataFrame]: | ||
for pdf in rows: | ||
pass | ||
id_val = int(key[0]) | ||
name = f"name-{id_val}" | ||
self.state.update((id_val, name)) | ||
yield pd.DataFrame({"id": [key[0]], "value": [{"id": id_val, "name": name}]}) | ||
|
||
def close(self) -> None: | ||
pass | ||
|
||
|
||
class AddFieldsProcessor(StatefulProcessor): | ||
state_schema = StructType( | ||
[ | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -47,7 +47,7 @@ object StateStoreColumnFamilySchemaUtils { | |
// Byte type is converted to Int in Avro, which doesn't work for us as Avro | ||
// uses zig-zag encoding as opposed to big-endian for Ints | ||
Seq( | ||
StructField(s"${field.name}_marker", BinaryType, nullable = false), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Lets say |
||
StructField(s"${field.name}_marker", BinaryType, nullable = true), | ||
field.copy(name = s"${field.name}_value", BinaryType) | ||
) | ||
} else { | ||
|
@@ -117,7 +117,7 @@ object StateStoreColumnFamilySchemaUtils { | |
getRowCounterCFName(stateName), keySchemaId = 0, | ||
keyEncoder.schema, | ||
valueSchemaId = 0, | ||
StructType(Seq(StructField("count", LongType, nullable = false))), | ||
StructType(Seq(StructField("count", LongType, nullable = true))), | ||
Some(NoPrefixKeyStateEncoderSpec(keyEncoder.schema))) | ||
schemas.put(counterSchema.colFamilyName, counterSchema) | ||
|
||
|
@@ -149,7 +149,7 @@ object StateStoreColumnFamilySchemaUtils { | |
keySchemaId = 0, | ||
keyEncoder.schema, | ||
valueSchemaId = 0, | ||
StructType(Seq(StructField("count", LongType, nullable = false))), | ||
StructType(Seq(StructField("count", LongType, nullable = true))), | ||
Some(NoPrefixKeyStateEncoderSpec(keyEncoder.schema))) | ||
schemas.put(countSchema.colFamilyName, countSchema) | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -363,18 +363,36 @@ class DriverStatefulProcessorHandleImpl(timeMode: TimeMode, keyExprEnc: Expressi | |
addTimerColFamily() | ||
} | ||
|
||
/** | ||
* This method returns all column family schemas, and checks and enforces nullability | ||
* if need be. The nullability check and set is only set to true when Avro is enabled. | ||
* @param shouldCheckNullable Whether we need to check the nullability. This is set to | ||
* true when using Python, as this is the only avenue through | ||
* which users can set nullability | ||
* @param shouldSetNullable Whether we need to set the fields as nullable. This is set to | ||
* true when using Scala, as case classes are set to | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'm actually surprised and it sounds like a bug to me. (Sorry, you had to handle Python and Scala differently due to this. My bad.) What if you set If this is indeed a bug and we can fix that, then we can simplify things a lot. I'm OK if you want to defer this, but definitely need to have follow up ticket for this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe it is only true for primitive type? If then it might make sense, like an optimization for the type which could never have null. If you see non-nullable to String or so, this should be a bug. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I found this, |
||
* non-nullable by default. | ||
* @return column family schemas used by this stateful processor. | ||
*/ | ||
def getColumnFamilySchemas( | ||
setNullableFields: Boolean | ||
shouldCheckNullable: Boolean, | ||
shouldSetNullable: Boolean | ||
): Map[String, StateStoreColFamilySchema] = { | ||
val schemas = columnFamilySchemas.toMap | ||
if (setNullableFields) { | ||
schemas.map { case (colFamilyName, stateStoreColFamilySchema) => | ||
colFamilyName -> stateStoreColFamilySchema.copy( | ||
valueSchema = stateStoreColFamilySchema.valueSchema.toNullable | ||
schemas.map { case (colFamilyName, schema) => | ||
schema.valueSchema.fields.foreach { field => | ||
if (!field.nullable && shouldCheckNullable) { | ||
throw StateStoreErrors.twsSchemaMustBeNullable( | ||
schema.colFamilyName, schema.valueSchema.toString()) | ||
} | ||
} | ||
if (shouldSetNullable) { | ||
colFamilyName -> schema.copy( | ||
valueSchema = schema.valueSchema.toNullable | ||
) | ||
} else { | ||
colFamilyName -> schema | ||
} | ||
} else { | ||
schemas | ||
} | ||
} | ||
|
||
|
@@ -549,7 +567,7 @@ class DriverStatefulProcessorHandleImpl(timeMode: TimeMode, keyExprEnc: Expressi | |
elementKeySchema: StructType): StateStoreColFamilySchema = { | ||
val countIndexName = s"$$count_$stateName" | ||
val countValueSchema = StructType(Seq( | ||
StructField("count", LongType, nullable = false) | ||
StructField("count", LongType) | ||
)) | ||
|
||
StateStoreColFamilySchema( | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -140,7 +140,7 @@ case class TransformWithStateExec( | |
* after init is called. | ||
*/ | ||
override def getColFamilySchemas( | ||
setNullableFields: Boolean | ||
shouldBeNullable: Boolean | ||
): Map[String, StateStoreColFamilySchema] = { | ||
val keySchema = keyExpressions.toStructType | ||
// we have to add the default column family schema because the RocksDBStateEncoder | ||
|
@@ -149,8 +149,11 @@ case class TransformWithStateExec( | |
0, keyExpressions.toStructType, 0, DUMMY_VALUE_ROW_SCHEMA, | ||
Some(NoPrefixKeyStateEncoderSpec(keySchema))) | ||
|
||
// For Scala, the user can't explicitly set nullability on schema, so there is | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Likewise I mentioned in other comment, it is not impossible to set nullability on encoder (although I tend to agree most users won't). Let's not make this be conditional. Also, this is concerning me - if we are very confident that users would never be able to set column to be nullable, why we need to change the schema as we all know it has to be nullable? What we are worrying about if we just do the same with Python? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. #49751 (comment) |
||
// no reason to throw an error, and we can simply set the schema to nullable. | ||
val columnFamilySchemas = getDriverProcessorHandle() | ||
.getColumnFamilySchemas(setNullableFields) ++ | ||
.getColumnFamilySchemas( | ||
shouldCheckNullable = false, shouldSetNullable = shouldBeNullable) ++ | ||
Map(StateStore.DEFAULT_COL_FAMILY_NAME -> defaultSchema) | ||
closeProcessorHandle() | ||
columnFamilySchemas | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -145,6 +145,12 @@ object StateStoreErrors { | |
new StateStoreValueSchemaNotCompatible(storedValueSchema, newValueSchema) | ||
} | ||
|
||
def twsSchemaMustBeNullable( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think TWS deserves its own error collection class, but I agree this is out of scope. Let's make a follow-up. |
||
columnFamilyName: String, | ||
schema: String): TWSSchemaMustBeNullable = { | ||
new TWSSchemaMustBeNullable(columnFamilyName, schema) | ||
} | ||
|
||
def stateStoreInvalidValueSchemaEvolution( | ||
oldValueSchema: String, | ||
newValueSchema: String): StateStoreInvalidValueSchemaEvolution = { | ||
|
@@ -346,6 +352,15 @@ class StateStoreValueSchemaNotCompatible( | |
"storedValueSchema" -> storedValueSchema, | ||
"newValueSchema" -> newValueSchema)) | ||
|
||
class TWSSchemaMustBeNullable( | ||
columnFamilyName: String, | ||
schema: String) | ||
extends SparkUnsupportedOperationException( | ||
errorClass = "TRANSFORM_WITH_STATE_SCHEMA_MUST_BE_NULLABLE", | ||
messageParameters = Map( | ||
"columnFamilyName" -> columnFamilyName, | ||
"schema" -> schema)) | ||
|
||
class StateStoreInvalidValueSchemaEvolution( | ||
oldValueSchema: String, | ||
newValueSchema: String) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we think whichever is easier to understand, "using Avro" or "schema evolution is enabled"?
I foresee the direction of using Avro for all stateful operators (unless there is outstanding regression), and once we make Avro by default, this will be confusing one to consume because they don't do anything for schema evolution. IMO it is "indirect" information and they would probably try to figure out how to disable schema evolution instead, without knowing that Avro and schema evolution is coupled.
cc. @anishshri-db to hear his voice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I think its fine to say that we refer to the
transformWithState
case relative toAvro
being used - dont need to explicitly call out schema evolution here