Add configurable Parquet Data Anonymization feature #24559

Pavitheran · 2025-02-13T22:58:15Z

Description

This patch introduces a pluggable data anonymization feature into Presto, enabling on the fly column-level anonymization (e.g., redacting, hashing or any other transformation on sensitive data) when reading Parquet files.

The changes add two new configuration properties (enable-parquet-anonymization and parquet-anonymization-manager-class) and implements a new AnonymizedColumnReader in the Parquet module that is activated when parquet_anonymization is enabled.

The design allows users to plug in custom anonymization managers to control how specific columns (e.g., PII) are transformed before being returned to the user.

Motivation and Context

Limiting sensitive data exposure is a growing concern, and we wanted a way to redact / obfuscate certain columns' data by default.
Particularly we use this for hash anonymization where hashed values can be used to join / count / group by without actually revealing the values. We can also use this to redact (i.e. replace values with *** partially).

Impact

New session properties:
enable_parquet_anonymization (boolean) – controls whether anonymization is enabled - default false.
parquet_anonymization_manager_class (string) – specifies the fully qualified class name of the anonymization manager - default empty.

When anonymization is disabled, performance remains unchanged and parquet reading is not affected at all.
When anonymization is enabled, minimal impact based on how many columns have anonymization enabled and what type of transformation your anonymizationManager is doing (i.e hashing values has higher impact while redacting strings "John" -> "J***" is marginal).

Test Plan

Added comprehensive tests in TestAnonymizedParquetReader and TestAnonymizedColumnReader to verify that:
Data is correctly anonymized when enabled.
Non-anonymized reads return original values.

We also have deployed this at scale at Uber since October.

Contributor checklist

Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Added pluggable data anonymization support for reading Parquet in Presto.
* Added new session properties: `enable_parquet_anonymization` and `parquet_anonymization_manager_class` to control anonymization behavior.

linux-foundation-easycla · 2025-02-13T22:58:19Z

✅login: Pavitheran / (33380eb)

The committers listed above are authorized under a signed CLA.

ZacBlanco · 2025-02-14T01:49:33Z

I think this is a great feature, but there is some parallel work going on to solicit community feedback on a more generic implementation of row filtering/column masking that would seem to apply to this use case too. I would recommend reading the RFC and see if that proposal satisfies you/your organization's current needs. I think we need something that handles this use case in Presto, but I'm not sure if it is the right solution to focus solely on a particular file type. It would be great to get your thoughts and facilitate some more discussion on that RFC

Pavitheran · 2025-02-14T15:59:12Z

Thanks for reviewing this @ZacBlanco . We’ve actually implemented a similar row filtering/column masking mechanism at our org similar to the design from the RFC. The main reason for introducing this file-format-level anonymization is that we have Parquet column encryption enabled, and we wanted a way to return anonymized values even when the user running the query doesn’t have access to the original column data.

Specifically, this allows:
Users with Parquet column encryption access to request either the original or anonymized data.
Users without access to still retrieve anonymized values rather than being fully restricted.
I believe this is not possible with the plan-level column masking approach (but i may be lacking creativity).

I initially planned to introduce this option in a follow-up PR but can include it here if that would be more helpful. But yeah if this overlaps too much with the broader column masking/row filtering efforts, that's understandable. Let me know your thoughts.

Add configurable parquet data anonymization feature

33380eb

Pavitheran requested review from shangxinli, a team, hantangwangd, ZacBlanco, vinothchandar and 7c00 as code owners February 13, 2025 22:58

Pavitheran requested a review from presto-oss February 13, 2025 22:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add configurable Parquet Data Anonymization feature #24559

Add configurable Parquet Data Anonymization feature #24559

Pavitheran commented Feb 13, 2025 •

edited

Loading

linux-foundation-easycla bot commented Feb 13, 2025 •

edited

Loading

ZacBlanco commented Feb 14, 2025

Pavitheran commented Feb 14, 2025

Add configurable Parquet Data Anonymization feature #24559

Are you sure you want to change the base?

Add configurable Parquet Data Anonymization feature #24559

Conversation

Pavitheran commented Feb 13, 2025 • edited Loading

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

linux-foundation-easycla bot commented Feb 13, 2025 • edited Loading

ZacBlanco commented Feb 14, 2025

Pavitheran commented Feb 14, 2025

Pavitheran commented Feb 13, 2025 •

edited

Loading

linux-foundation-easycla bot commented Feb 13, 2025 •

edited

Loading