Skip to content

Task: redaction and anonymization #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kamwoods opened this issue Jun 17, 2022 · 0 comments
Open

Task: redaction and anonymization #7

kamwoods opened this issue Jun 17, 2022 · 0 comments

Comments

@kamwoods
Copy link
Contributor

Operations in core lib:

Redact: Remove the identified feature from the source text. Takes no params.
Replace: Replace an identified entity with a pattern/value supplied via parameter, or optionally for identified entities with the entity category (NAME, ORGANIZATION, etc) output by spaCy.
Mask: Substitute each character of the identified feature with a single char supplied in param. (E.g. "*"). Optionally replace only a subset of the characters as specified (for example, leaving first and last char unmasked) based on supplied start and end char count.
Hash: Hash value to sha256 or md5 (should be designed for extensibility to other hash options).

Managing potential overlaps:

The spaCy EntityRecognizer should not produce overlapping entity spans. However, features identified by regex within the pipeline may produce overlaps. This should (potentially?) be tracked in the db and handled accordingly:

  • No overlap identified: Apply requested ops as normal
  • Full overlap: Warn/log and apply in pipeline order (e.g. if a name is recognized by spaCy and a replacement is requested for named ents, do not apply any further op requested for pattern match). Also applies to substring match (e.g. one match contained within another).
  • Partial overlap: Apply requested op only to partial span
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@kamwoods and others