Add configurable Parquet Data Anonymization feature #24559
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This patch introduces a pluggable data anonymization feature into Presto, enabling on the fly column-level anonymization (e.g., redacting, hashing or any other transformation on sensitive data) when reading Parquet files.
The changes add two new configuration properties (enable-parquet-anonymization and parquet-anonymization-manager-class) and implements a new AnonymizedColumnReader in the Parquet module that is activated when parquet_anonymization is enabled.
The design allows users to plug in custom anonymization managers to control how specific columns (e.g., PII) are transformed before being returned to the user.
Motivation and Context
Limiting sensitive data exposure is a growing concern, and we wanted a way to redact / obfuscate certain columns' data by default.
Particularly we use this for hash anonymization where hashed values can be used to join / count / group by without actually revealing the values. We can also use this to redact (i.e. replace values with *** partially).
Impact
New session properties:
enable_parquet_anonymization (boolean) – controls whether anonymization is enabled - default false.
parquet_anonymization_manager_class (string) – specifies the fully qualified class name of the anonymization manager - default empty.
When anonymization is disabled, performance remains unchanged and parquet reading is not affected at all.
When anonymization is enabled, minimal impact based on how many columns have anonymization enabled and what type of transformation your anonymizationManager is doing (i.e hashing values has higher impact while redacting strings "John" -> "J***" is marginal).
Test Plan
Added comprehensive tests in TestAnonymizedParquetReader and TestAnonymizedColumnReader to verify that:
Data is correctly anonymized when enabled.
Non-anonymized reads return original values.
We also have deployed this at scale at Uber since October.
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.