Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configurable Parquet Data Anonymization feature #24559

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Pavitheran
Copy link

@Pavitheran Pavitheran commented Feb 13, 2025

Description

This patch introduces a pluggable data anonymization feature into Presto, enabling on the fly column-level anonymization (e.g., redacting, hashing or any other transformation on sensitive data) when reading Parquet files.

The changes add two new configuration properties (enable-parquet-anonymization and parquet-anonymization-manager-class) and implements a new AnonymizedColumnReader in the Parquet module that is activated when parquet_anonymization is enabled.

The design allows users to plug in custom anonymization managers to control how specific columns (e.g., PII) are transformed before being returned to the user.

Motivation and Context

Limiting sensitive data exposure is a growing concern, and we wanted a way to redact / obfuscate certain columns' data by default.
Particularly we use this for hash anonymization where hashed values can be used to join / count / group by without actually revealing the values. We can also use this to redact (i.e. replace values with *** partially).

Impact

New session properties:
enable_parquet_anonymization (boolean) – controls whether anonymization is enabled - default false.
parquet_anonymization_manager_class (string) – specifies the fully qualified class name of the anonymization manager - default empty.

When anonymization is disabled, performance remains unchanged and parquet reading is not affected at all.
When anonymization is enabled, minimal impact based on how many columns have anonymization enabled and what type of transformation your anonymizationManager is doing (i.e hashing values has higher impact while redacting strings "John" -> "J***" is marginal).

Test Plan

Added comprehensive tests in TestAnonymizedParquetReader and TestAnonymizedColumnReader to verify that:
Data is correctly anonymized when enabled.
Non-anonymized reads return original values.

We also have deployed this at scale at Uber since October.

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Added pluggable data anonymization support for reading Parquet in Presto.
* Added new session properties: `enable_parquet_anonymization` and `parquet_anonymization_manager_class` to control anonymization behavior.

Copy link

linux-foundation-easycla bot commented Feb 13, 2025

CLA Signed


The committers listed above are authorized under a signed CLA.

@ZacBlanco
Copy link
Contributor

I think this is a great feature, but there is some parallel work going on to solicit community feedback on a more generic implementation of row filtering/column masking that would seem to apply to this use case too. I would recommend reading the RFC and see if that proposal satisfies you/your organization's current needs. I think we need something that handles this use case in Presto, but I'm not sure if it is the right solution to focus solely on a particular file type. It would be great to get your thoughts and facilitate some more discussion on that RFC

@Pavitheran
Copy link
Author

Thanks for reviewing this @ZacBlanco . We’ve actually implemented a similar row filtering/column masking mechanism at our org similar to the design from the RFC. The main reason for introducing this file-format-level anonymization is that we have Parquet column encryption enabled, and we wanted a way to return anonymized values even when the user running the query doesn’t have access to the original column data.

Specifically, this allows:
Users with Parquet column encryption access to request either the original or anonymized data.
Users without access to still retrieve anonymized values rather than being fully restricted.
I believe this is not possible with the plan-level column masking approach (but i may be lacking creativity).

I initially planned to introduce this option in a follow-up PR but can include it here if that would be more helpful. But yeah if this overlaps too much with the broader column masking/row filtering efforts, that's understandable. Let me know your thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants