Skip to content

this project implements a python-based solution to detect and redact PII and SPI from text documents, PDF's, and images.

Notifications You must be signed in to change notification settings

manansvr/pii-detection-redaction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PII & SPI: Detection + Redaction System

A comprehensive python-based solution for detecting and redacting personally identifiable information and sensitive personal information from text documents, PDF's, and images using natural language processing and optical character recognition.

Project Scope

This system helps organizations protect sensitive information by automatically detecting and redacting PII/SPI such as names, email addresses, phone numbers, financial information, and more. The system designed for secure document sharing, compliance requirements, and data privacy protection.

Features

Multi-Format Support

  • Text Files: process plain text documents (.txt)
  • PDF Documents: handle both text-based and scanned PDF's using OCR
  • Images: detect and redact PII from image files

Dual Interface

  • Web Application: user-friendly stream-lit interface for interactive processing
  • Command-Line Tools: scriptable CLI for batch processing and automation

Entity Types Detected

  • Personal Names (PERSON)
  • Email Addresses
  • Phone Numbers
  • URL's
  • Credit Card Numbers
  • Social Security Numbers
  • Bank Account Numbers
  • Australian Business Numbers (ABN)
  • Custom Pattern Recognition

Installation

Prerequisites

  • Python 3.8
  • Tesseract OCR (Scanned PDF's): brew install tesseract on macOS

Setup

  1. clone the repository
git clone <repository-url>
cd pii-detection-redaction
  1. install dependencies
pip install -r requirements.txt
  1. download spaCy language model
python -m spacy download en_core_web_sm

# or

python -m spacy download en_core_web_lg

Usage

Web Application

streamlit run src/app.py

then open your browser to http://localhost:8501

Features:

  • upload or paste text for analysis
  • upload PDF files for redaction
  • adjust detection parameters (chunk size, overlap, confidence threshold)
  • preview and download redacted files
  • view command execution logs

Command-Line Interface

Text Detection

basic usage:

python -m textDetector --text "contact manan rathi at [email protected]"

from file:

python -m textDetector --in input.txt --mask-to-file output.txt

advanced options:

python -m textDetector \
  --in input.txt \
  --size 4000 \
  --overlap 300 \
  --min-score 0.3 \
  --mask-to-file redacted.txt \
  --print-text

PDF Redaction

python -m pdfRedactor.cli --in document.pdf --out redacted.pdf

Image Redaction

python -m imageRedactor.analyzer --input image.jpg --output redacted.jpg

📁 Project Structure

pii-detection-redaction/
├── src/
│   ├── app.py                      # streamlit web application
│   ├── common/                     # shared utilities
│   │   ├── common.py               # presidio analyzer builders
│   │   └── __init__.py
│   ├── textDetector/               # text PII detection module
│   │   ├── analyzer.py             # analyzer engine setup
│   │   ├── chunker.py              # long text processing
│   │   ├── anonymize.py            # text anonymization
│   │   ├── formatter.py            # results formatting
│   │   ├── relationships.py        # context-aware masking
│   │   ├── cli.py                  # command-line interface
│   │   └── __main__.py
│   ├── pdfRedactor/                # PDF PII redaction module
│   │   ├── analyzer.py             # PDF text extraction & analysis
│   │   ├── redactor.py             # PDF redaction engine
│   │   ├── cli.py                  # command-line interface
│   │   └── __init__.py
│   └── imageRedactor/              # image PII redaction module
│       ├── analyzer.py             # image analysis
│       ├── redactor.py             # image redaction engine
│       ├── types.py                # data classes
│       └── __init__.py
├── styles/
│   └── theme.css                   # web app styling
├── requirements.txt                # python dependencies
└── README.md                       # this file

Architecture & Flow Diagrams

Web Application Flow

sequenceDiagram
    participant User
    participant Streamlit as app.py
    participant Common as common/common.py
    participant TextDetector as textDetector/*
    participant PDFRedactor as pdfRedactor/*
    participant Session as Session State

    User->>Streamlit: Upload File / Paste Text
    Streamlit->>Streamlit: processFile()
    Streamlit->>Streamlit: buildCommand()
    Streamlit->>Streamlit: runModuleCommand()

    alt Text Processing
        Streamlit->>TextDetector: python -m textDetector
        TextDetector->>Common: buildPresidioAnalyzer()
        Common-->>TextDetector: AnalyzerEngine
        TextDetector->>TextDetector: analyzeLongText()
        TextDetector->>TextDetector: maskWithRelationships()
        TextDetector-->>Streamlit: Redacted File
    else PDF Processing
        Streamlit->>PDFRedactor: python -m pdfRedactor.cli
        PDFRedactor->>Common: buildPresidioAnalyzer()
        Common-->>PDFRedactor: AnalyzerEngine
        PDFRedactor->>PDFRedactor: analyzePdfToBboxes()
        PDFRedactor->>PDFRedactor: writeRedactedPdf()
        PDFRedactor-->>Streamlit: Redacted PDF
    end

    Streamlit->>Session: Store file Bytes & Name
    Streamlit->>Streamlit: displayCommandLogs()
    Streamlit->>Streamlit: renderDownloadAndPreview()
    Streamlit-->>User: Display Results & Preview
Loading

Text Detection Module Flow

sequenceDiagram
    participant CLI as cli.py
    participant Analyzer as analyzer.py
    participant Common as common.py
    participant Chunker as chunker.py
    participant Relationships as relationships.py
    participant Anonymize as anonymize.py
    participant Presidio as Presidio Engine
    participant Spacy as spaCy NLP

    CLI->>CLI: parseArgs()
    CLI->>CLI: readInputText()

    CLI->>Analyzer: buildAnalyzer(language)
    Analyzer->>Common: buildPresidioAnalyzer()
    Common->>Common: pickSpacyModel()
    Common->>Spacy: Load Model
    Spacy-->>Common: NLP Model
    Common->>Presidio: Create AnalyzerEngine
    Presidio-->>Analyzer: AnalyzerEngine

    CLI->>Chunker: analyzeLongText(text, size, overlap)
    loop For Each Chunk
        Chunker->>Presidio: analyze(chunk)
        Presidio->>Spacy: NLP Processing
        Spacy-->>Presidio: Entities
        Presidio-->>Chunker: RecognizerResults
    end
    Chunker->>Chunker: Merge Overlapping Results
    Chunker-->>CLI: All Results

    alt Anonymize Mode
        CLI->>Anonymize: anonymizeText(text, results)
        Anonymize-->>CLI: Type-Based Anonymization
    else Relationship Masking
        CLI->>Relationships: maskWithRelationships(text, results)
        Relationships->>Relationships: assignRelationships()
        Relationships-->>CLI: Context-Aware Masked Text
    end

    CLI-->>CLI: Write Output File
Loading

PDF Redaction Module Flow

sequenceDiagram
    participant CLI as cli.py
    participant Analyzer as analyzer.py
    participant Common as common.py
    participant Redactor as redactor.py
    participant PDFMiner as pdfminer.six
    participant Presidio as Presidio Engine
    participant Pikepdf as pikepdf

    CLI->>CLI: parseArgs()

    CLI->>Analyzer: buildAnalyzer()
    Analyzer->>Common: buildPresidioAnalyzer()
    Common-->>Analyzer: AnalyzerEngine

    CLI->>Analyzer: analyzePdfToBboxes(pdfPath)

    loop For Each Page
        Analyzer->>PDFMiner: extract_pages()
        PDFMiner-->>Analyzer: Page Layout

        loop For Each Text Container
            Analyzer->>Analyzer: Extract Characters & Positions
            Analyzer->>Presidio: analyze(text)
            Presidio-->>Analyzer: Entity Results
            Analyzer->>Analyzer: Map Entities To Bounding Boxes
        end

        Analyzer-->>Analyzer: Page Bounding Boxes
    end

    Analyzer-->>CLI: All Pages With bboxes

    CLI->>Redactor: writeRedactedPdf(srcPdf, dstPdf, bboxes)
    Redactor->>Pikepdf: Open Source PDF

    loop For Each Page With Entities
        Redactor->>Redactor: Build Redaction Rectangles
        Redactor->>Redactor: Generate Overlay Stream
        Redactor->>Pikepdf: Add Redaction Layer
    end

    Redactor->>Pikepdf: Save Redacted PDF
    Pikepdf-->>Redactor: Success
    Redactor-->>CLI: Output File Path
Loading

Image Redaction Module Flow

sequenceDiagram
    participant CLI as analyzer.py
    participant Redactor as redactor.py
    participant Common as common.py
    participant Presidio as Presidio Engine
    participant ImageRedactor as presidio_image_redactor
    participant Tesseract as pytesseract OCR
    participant PIL as Pillow

    CLI->>CLI: Parse Arguments
    CLI->>CLI: hexToRgb() For Colors

    CLI->>Common: buildPresidioAnalyzer()
    Common-->>CLI: AnalyzerEngine

    CLI->>Redactor: ImageRedactor(analyzer, ocrLanguages)

    CLI->>Redactor: redactFile(inputPath, outputPath)
    Redactor->>PIL: Open Image
    PIL-->>Redactor: Image Object

    Redactor->>Tesseract: Extract Text & Positions
    Tesseract-->>Redactor: OCR Results (bboxes)

    Redactor->>Presidio: Analyze(Extracted Text)
    Presidio-->>Redactor: Entity Eesults

    Redactor->>Redactor: Map Entities To Image Coordinates
    Redactor->>ImageRedactor: Apply Redaction Style

    alt Fill Style
        ImageRedactor->>PIL: Draw Filled Rectangles
    else Blur Style
        ImageRedactor->>PIL: Apply Blur Filter
    else Pixelate Style
        ImageRedactor->>PIL: Pixelate Regions
    end

    Redactor->>PIL: Save Redacted Image
    PIL-->>Redactor: Success
    Redactor-->>CLI: Output File Path
Loading

Common Module Interaction

graph TB
    subgraph "Common Module"
        Common[common.py]
        PickModel[pickSpacyModel]
        BuildAnalyzer[buildPresidioAnalyzer]
    end

    subgraph "Text Detection"
        TextAnalyzer[textDetector/analyzer.py]
        TextCLI[textDetector/cli.py]
    end

    subgraph "PDF Redaction"
        PDFAnalyzer[pdfRedactor/analyzer.py]
        PDFCLI[pdfRedactor/cli.py]
    end

    subgraph "Image Redaction"
        ImageAnalyzer[imageRedactor/analyzer.py]
        ImageRedactor[imageRedactor/redactor.py]
    end

    subgraph "Web Application"
        App[app.py]
    end

    subgraph "External Libraries"
        Spacy[spaCy NLP]
        Presidio[Presidio Analyzer]
    end

    TextAnalyzer --> Common
    PDFAnalyzer --> Common
    ImageAnalyzer --> Common

    Common --> PickModel
    Common --> BuildAnalyzer

    BuildAnalyzer --> Spacy
    BuildAnalyzer --> Presidio

    App --> TextCLI
    App --> PDFCLI
    App --> ImageAnalyzer

    TextCLI --> TextAnalyzer
    PDFCLI --> PDFAnalyzer
    ImageAnalyzer --> ImageRedactor

    style Common fill:#e1f5ff
    style BuildAnalyzer fill:#b3e5fc
    style PickModel fill:#b3e5fc
Loading

Configuration

Text Detection Parameters

Parameter Description Default
--size chunk size in characters 5000
--overlap over lap between chunks 300
--min-score minimum confidence threshold 0.0
--lang language code en
--print-text echo input preview False
--anonymize enable anonymization mode False

Detection Confidence Levels

  • 1.0: exact pattern matches (emails, URLs)
  • 0.85+: high confidence (names with context)
  • 0.5-0.85: medium confidence
  • <0.5: low confidence (may include false positives)

Examples

Example 1: Basic Text Redaction

Input:

contact jane smith at [email protected] or call (555) 123-4567

Output:

contact <PERSON_1> at <EMAIL_ADDRESS_1> or call <PHONE_NUMBER_1>

Example 2: Relationship-Aware Masking

Input:

john's email is [email protected]
sarah's phone is 555-1234

Output:

john's email is <John's EMAIL_ADDRESS>
sarah's phone is <Sarah's PHONE_NUMBER>

Example 3: PDF with Multiple Pages

python -m pdfRedactor.cli --in contract.pdf --out contract_redacted.pdf

creates a new PDF with all detected PII regions permanently redacted with black boxes.

Technology Stack

  • Microsoft Presidio: PII detection and anonymization framework
  • spaCy: advanced NLP and named entity recognition
  • Stream Lit: modern web application framework
  • pdfminer.six: PDF text extraction
  • pikepdf: PDF manipulation and editing
  • pytesseract: OCR for scanned documents
  • Pillow: image processing

Requirements

see requirements.txt for complete list:

streamlit>=1.20.0
streamlit-pdf-viewer>=0.0.15
presidio-analyzer>=2.2.0
presidio-anonymizer>=2.2.0
presidio-image-redactor>=0.0.50
spacy>=3.5.0
pdfminer.six>=20221105
pikepdf>=8.0.0
Pillow>=10.0.0
pytesseract>=0.3.10

Privacy & Security

  • all processing happens locally: no data is sent to external servers
  • redacted files are generated with permanent removal (not just visual masking)
  • original files remain unchanged
  • temporary files are automatically cleaned up

Limitations

  • spaCy model accuracy varies by entity type and context
  • scanned PDF processing requires Tesseract OCR installation
  • very large files may require increased memory
  • some context-dependent PII may be missed (requires human review)

Note: this tool aids in PII detection but should not be solely relied upon for compliance. always review redacted documents manually for sensitive use cases.

About

this project implements a python-based solution to detect and redact PII and SPI from text documents, PDF's, and images.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published