This Jupyter Notebook demonstrates the process of detecting and redacting Personally Identifiable Information (PII) from text extracted from images. It uses EasyOCR for Optical Character Recognition (OCR) and SpaCy for Named Entity Recognition (NER) to identify and redact sensitive information such as names, emails, phone numbers, and zip codes.
- OCR with EasyOCR: Extract text from images.
- PII Detection with SpaCy: Identify sensitive entities like names, emails, phone numbers, and zip codes.
- PII Redaction: Redact sensitive information using SpaCy and regex patterns.
- Visualization: Display the original image alongside the redacted text.
-
Install the required libraries:
pip install easyocr spacy torch matplotlib pillow
-
Download the SpaCy language model:
python -m spacy download en_core_web_sm
-
Open the notebook
pytorch_training.ipynb
in Jupyter Notebook or JupyterLab and begin training the model. -
Once you like what the model is returning, you can use the
pii_detection_redaction.ipynb
notebook to add a file upload feature and test the PII detection and redaction functionality. -
Follow the steps in the notebook to:
- Extract text from an image.
- Detect and redact PII.
- Visualize the results.
-
Replace the
image_path
variable with the path to your image file to test with your own data.
- Original Image: Displays the uploaded image.
- Redacted Text: Shows the text with sensitive information redacted.
This notebook serves as a prototype for a Flask web application that will allow users to upload images via a web interface and receive redacted results. Stay tuned for the Flask app implementation!
- EasyOCR
- SpaCy
- Torch
- Matplotlib
- Pillow
This project is licensed under the MIT License. See the LICENSE file for more details.