[SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs #49840
+121
−4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Simplify the resolution of EXISTS_DEFAULT on ResolveDefaultColumns::getExistenceDefaultValues(), which are called from file readers on executors.
Why are the changes needed?
Spark executors unnecessary contacts catalogs when resolving EXISTS_DEFAULTS (used for default values for existing data) for a column.
Detailed explanation: The code path for default values first runs an analysis of the user-provided CURRENT_DEFAULT value for a column (to evaluate functions, etc), and uses the result sql to save as the column's EXISTS_DEFAULT. EXISTS_DEFAULT is then used to avoid having to rewrite existing data using backfill to fill this value in the files. When reading existing files, Spark then attempts to resolve the EXISTS_DEFAULT metadata and use the value for null values it finds in that column.
The problem is, this second step on read redundantly runs all the analyzer rules again and finish analysis rules on EXISTS_DEFAULTS, some of which contact the catalog unnecessarily. Some of those rules are unnecessary as they were already run before to get the value.
Worse, it may cause exceptions if the executors are not configured properly to reach the catalog, such as:
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added a test in StructTypeSuite. I had to expose for testing some members in ResolveDefaultColumns.
Was this patch authored or co-authored using generative AI tooling?
No