Skip to content

Latest commit

 

History

History
574 lines (429 loc) · 20.8 KB

File metadata and controls

574 lines (429 loc) · 20.8 KB
MedTextCN Logo

MedTextCN

Intelligent Chinese Medical Text Analysis Toolkit

Python License Version PRs Welcome

NLP Toolkit for Chinese Medical Text — Named Entity Recognition, Privacy De-identification, Medical Record Structuring — all in one line of code


Language

简体中文 | English | 繁體中文


Introduction

MedTextCN is a Natural Language Processing toolkit specifically designed for Chinese medical text scenarios. With the rapid development of healthcare informatization and smart medicine, unstructured Chinese medical records contain rich clinical information, but also face the dual challenges of privacy protection and data utilization.

MedTextCN aims to address the following core challenges:

  • Medical Named Entity Recognition (NER): Accurately extract diseases, symptoms, drugs, examinations, anatomical sites, and treatment methods from unstructured medical text, providing foundational capabilities for downstream clinical decision support and medical knowledge graph construction.
  • Patient Privacy Protection: Automatically detect Personally Identifiable Information (PII) in medical records with multiple de-identification strategies, helping healthcare institutions meet compliance requirements under the Personal Information Protection Law (PIPL).
  • Medical Record Structuring: Automatically parse free-text medical records into standardized SOAP format (Subjective, Objective, Assessment, Plan), improving data usability and interoperability.

Whether you are a medical AI R&D team, a hospital IT department, or a healthcare data scientist, MedTextCN provides ready-to-use Chinese medical text processing capabilities.


Key Features

🏥 Chinese Medical NER Engine

Built-in knowledge base with 831+ medical entities across 6 categories: diseases (158), symptoms (116), drugs (175), examinations (135), anatomical sites (163), and treatments (84). Based on a hybrid dictionary and rule-based matching strategy, it achieves efficient entity recognition without GPU.

🔒 Chinese PII Detection & De-identification

Supports detection of 8 PII types: ID card numbers, mobile phone numbers, patient names, medical insurance card numbers, email addresses, home addresses, dates of birth, and visit dates. Provides 4 de-identification modes: mask, replace, hash, and remove, compliant with PIPL regulations.

📋 Medical Record Structuring

Intelligently parses 9 medical record section types: chief complaint, present illness history, past history, personal history, family history, physical examination, auxiliary examination, diagnosis, and treatment plan. Automatically outputs standardized SOAP format structured results.

✂️ Medical-Enhanced Tokenizer

Enhanced domain-specific tokenization based on the jieba engine, with a built-in medical vocabulary and entity-priority matching strategy, effectively solving segmentation challenges for long medical terms and professional terminology.

⚡ Batch Processing Engine

Built-in concurrent processing framework supporting parallel processing of large volumes of medical text, with progress callback mechanisms for easy integration into data processing pipelines.

🌐 FastAPI REST Service

Ready-to-use REST API service with 7 API endpoints, covering comprehensive analysis, entity extraction, PII detection, text de-identification, text structuring, health check, and API documentation.

💻 CLI Command-Line Tool

Provides four commands: analyze, serve, demo, and version, enabling direct text analysis, API service startup, demo execution, and version checking from the terminal.

🐳 Docker Containerized Deployment

Complete Dockerfile for one-click build and deployment, simplifying deployment in server and cloud environments.


Quick Start

Requirements

  • Python >= 3.9
  • pip (Python package manager)

Installation

pip install medtextcn

One-Line Quick Start

from medtextcn import analyze_text

# One-line comprehensive analysis (NER + PII Detection + Structuring)
result = analyze_text("Patient Zhang San, male, 65 years old, admitted for diabetes")

print(result)

Example output:

{
  "entities": [
    {"text": "diabetes", "type": "disease", "start": 12, "end": 15}
  ],
  "pii": [
    {"text": "Zhang San", "type": "name", "start": 2, "end": 4}
  ],
  "structured": {
    "subjective": "Patient Zhang San, male, 65 years old, admitted for diabetes"
  }
}

Detailed Usage Guide

1. Medical Named Entity Recognition (NER)

Extract diseases, symptoms, drugs, examinations, anatomical sites, and treatments from medical text.

from medtextcn import extract_entities

entities = extract_entities("Hypertension with coronary heart disease, need ECG examination")

for entity in entities:
    print(f"Entity: {entity['text']}, Type: {entity['type']}, Position: {entity['start']}-{entity['end']}")

Supported Entity Types:

Type Label Built-in Count Examples
Disease disease 158 Diabetes, Hypertension, Coronary heart disease
Symptom symptom 116 Headache, Fever, Cough
Drug drug 175 Amoxicillin, Metformin
Examination examination 135 Blood routine, ECG, CT
Anatomy anatomy 163 Heart, Liver, Left lung
Treatment treatment 84 Surgery, Transfusion, Dialysis

2. PII Detection & De-identification

PII Detection

from medtextcn import detect_pii

pii_results = detect_pii("Patient Li Si, ID card 110101199003076039, phone 13800138000")

for item in pii_results:
    print(f"Type: {item['type']}, Content: {item['text']}")

Text De-identification

from medtextcn import deidentify_text

# Mask mode (default)
safe = deidentify_text("Patient Wang Wu, phone 15098765432", mode="mask")
print(safe)
# Output: Patient Wang *, phone 150****5432

# Replace mode
safe = deidentify_text("Patient Wang Wu, phone 15098765432", mode="replace")
print(safe)
# Output: Patient [Patient Name], phone [Phone Number]

# Hash mode
safe = deidentify_text("Patient Wang Wu, phone 15098765432", mode="hash")
print(safe)
# Output: Patient a3f2e1..., phone 7b8c9d...

# Remove mode
safe = deidentify_text("Patient Wang Wu, phone 15098765432", mode="remove")
print(safe)
# Output: Patient, phone

Supported PII Types:

Type Label Description
Patient Name name Chinese name recognition
ID Card id_card 18-digit ID card number
Phone Number phone 11-digit mobile phone number
Medical Insurance Card medical_card Medical insurance card number
Email email Email address
Address address Chinese address information
Date of Birth birth_date Date of birth
Visit Date visit_date Visit/admission date

3. Medical Record Structuring

Parse free-text medical records into standard SOAP format.

from medtextcn import structure_text

text = """Chief Complaint: Recurrent cough for 3 days.
Present Illness: Cough appeared 3 days ago after catching cold, with white thin sputum, no fever.
Past History: 5-year history of hypertension, controlled with oral antihypertensive drugs.
Physical Examination: Body temperature 36.5°C, coarse breath sounds in both lungs.
Diagnosis: Acute bronchitis.
Treatment Plan: Anti-infection treatment, symptomatic management."""

structured = structure_text(text)
print(structured)

Output:

{
  "subjective": {
    "chief_complaint": "Recurrent cough for 3 days.",
    "present_illness": "Cough appeared 3 days ago after catching cold, with white thin sputum, no fever.",
    "past_history": "5-year history of hypertension, controlled with oral antihypertensive drugs."
  },
  "objective": {
    "physical_exam": "Body temperature 36.5\u00b0C, coarse breath sounds in both lungs."
  },
  "assessment": {
    "diagnosis": "Acute bronchitis."
  },
  "plan": {
    "treatment_plan": "Anti-infection treatment, symptomatic management."
  }
}

4. Medical-Enhanced Tokenization

from medtextcn import tokenize

tokens = tokenize("Patient admitted for acute myocardial infarction, needs coronary angiography")
print(tokens)
# Output: ['Patient', 'admitted', 'for', 'acute myocardial infarction', '...', 'coronary angiography', 'examination']

5. Batch Processing

from medtextcn import batch_analyze

texts = [
    "Patient Zhang San, male, 65 years old, admitted for diabetes",
    "Patient Li Si, female, 45 years old, hypertension with coronary heart disease",
    "Patient Wang Wu, male, 72 years old, acute exacerbation of COPD",
]

results = batch_analyze(texts, max_workers=4, progress_callback=lambda i, n: print(f"Progress: {i}/{n}"))

6. FastAPI REST Service

Start Service

medtextcn serve --host 0.0.0.0 --port 8080

API Endpoints

Method Path Description
POST /api/v1/analyze Comprehensive text analysis
POST /api/v1/entities Entity extraction
POST /api/v1/pii/detect PII detection
POST /api/v1/pii/deidentify Text de-identification
POST /api/v1/structure Text structuring
GET /api/v1/health Health check
GET /docs API Documentation (Swagger UI)

Request Examples

# Comprehensive analysis
curl -X POST http://localhost:8080/api/v1/analyze \
  -H "Content-Type: application/json" \
  -d '{"text": "Patient Zhang San, male, 65 years old, admitted for diabetes"}'

# Entity extraction
curl -X POST http://localhost:8080/api/v1/entities \
  -H "Content-Type: application/json" \
  -d '{"text": "Hypertension with coronary heart disease, need ECG examination"}'

# PII detection
curl -X POST http://localhost:8080/api/v1/pii/detect \
  -H "Content-Type: application/json" \
  -d '{"text": "Patient Li Si, ID card 110101199003076039"}'

# Text de-identification
curl -X POST http://localhost:8080/api/v1/pii/deidentify \
  -H "Content-Type: application/json" \
  -d '{"text": "Patient Wang Wu, phone 15098765432", "mode": "mask"}'

7. CLI Command-Line Tool

# Text analysis
medtextcn analyze "Patient Zhang San, male, 65 years old, admitted for diabetes"

# Start API service
medtextcn serve --host 0.0.0.0 --port 8080

# Run demo
medtextcn demo

# Check version
medtextcn version

8. Docker Deployment

# Build image
docker build -t medtextcn .

# Run container
docker run -p 8080:8080 medtextcn

# Run in background
docker run -d --name medtextcn-server -p 8080:8080 medtextcn

Technical Architecture

MedTextCN adopts a layered modular architecture with clear responsibilities and loose coupling for independent use and extensibility.

+-----------------------------------------------------------+
|                    Access Layer                             |
|  +----------+  +---------------+  +---------------------+  |
|  | CLI Tool |  | FastAPI REST  |  |  Python SDK (API)   |  |
|  +----+-----+  +-------+-------+  +----------+----------+  |
+------+-----------------+---------------------+-------------+
|                    Service Layer                            |
|  +----------+  +----------+  +----------+  +-------------+  |
|  | Analyzer |  | Batch    |  | De-ident |  | Structurer  |  |
|  | Engine   |  | Engine   |  | Engine   |  | Engine      |  |
|  +----+-----+  +----+-----+  +----+-----+  +------+------+  |
+------+--------------+--------------+---------------------+
|                    Core Layer                               |
|  +----------+  +----------+  +----------+  +-------------+  |
|  | NER      |  | PII      |  | Tokenizer|  | Section     |  |
|  | Engine   |  | Detector |  | Engine   |  | Parser      |  |
|  +----+-----+  +----+-----+  +----+-----+  +------+------+  |
+------+--------------+--------------+---------------------+
|                    Data Layer                               |
|  +----------+  +----------+  +----------+  +-------------+  |
|  | Medical  |  | PII      |  | Medical  |  | Section     |  |
|  | Entity KB|  | Rules    |  | Dict     |  | Templates   |  |
|  |  (831+)  |  |  (8 types)|  | (jieba+) |  |  (9 types)  |  |
|  +----------+  +----------+  +----------+  +-------------+  |
+-----------------------------------------------------------+

Module Overview:

Module Responsibility Key Technology
NER Engine Medical entity recognition Dictionary matching + Rule engine
PII Detector Personal information detection Regex + Context analysis
Tokenizer Medical text tokenization jieba + Medical dictionary enhancement
Section Parser Medical record section classification Pattern matching + Keyword extraction
De-identification Engine Text de-identification Multi-strategy de-identification pipeline
Structuring Engine SOAP format output Section classification + Field mapping
Batch Engine Concurrent batch processing ThreadPoolExecutor
REST Service HTTP API FastAPI + Pydantic

Project Structure

medtextcn/
├── docs/
│   ├── logo.jpg                  # Project Logo
│   ├── README.en.md              # English README
│   └── README.zh-TW.md           # Traditional Chinese README
├── medtextcn/
│   ├── __init__.py               # Package entry, exports public API
│   ├── cli.py                    # CLI command-line tool
│   ├── api/
│   │   ├── __init__.py
│   │   ├── app.py                # FastAPI application
│   │   ├── routes.py             # API route definitions
│   │   └── schemas.py            # Pydantic data models
│   ├── core/
│   │   ├── __init__.py
│   │   ├── ner.py                # NER entity recognition engine
│   │   ├── pii.py                # PII detection engine
│   │   ├── tokenizer.py          # Medical-enhanced tokenizer
│   │   └── parser.py             # Medical record section parser
│   ├── services/
│   │   ├── __init__.py
│   │   ├── analyzer.py           # Comprehensive analysis service
│   │   ├── deidentifier.py       # De-identification service
│   │   ├── structurer.py         # Structuring service
│   │   └── batch.py              # Batch processing engine
│   └── data/
│       ├── entities/             # Medical entity dictionaries
│       │   ├── diseases.json     # Disease entities (158)
│       │   ├── symptoms.json     # Symptom entities (116)
│       │   ├── drugs.json        # Drug entities (175)
│       │   ├── examinations.json  # Examination entities (135)
│       │   ├── anatomy.json      # Anatomy entities (163)
│       │   └── treatments.json   # Treatment entities (84)
│       ├── pii_patterns.json     # PII detection rules
│       ├── section_patterns.json # Section classification rules
│       └── medical_words.txt    # Medical tokenization dictionary
├── tests/
│   ├── test_ner.py               # NER engine tests
│   ├── test_pii.py               # PII detection tests
│   ├── test_deidentify.py        # De-identification tests
│   ├── test_structure.py         # Structuring tests
│   ├── test_tokenizer.py         # Tokenizer tests
│   └── test_api.py               # API endpoint tests
├── Dockerfile                     # Docker build file
├── setup.py                      # Package installation config
├── pyproject.toml                # Project metadata
├── README.md                     # Project documentation (Chinese)
├── CONTRIBUTING.md               # Contributing guide
├── LICENSE                       # MIT open source license
└── .github/
    └── workflows/
        └── ci.yml                # CI/CD configuration

Development Guide

Environment Setup

# Clone repository
git clone https://github.com/gitstq/MedTextCN.git
cd MedTextCN

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/macOS
# venv\Scripts\activate   # Windows

# Install development dependencies
pip install -e ".[dev]"

# Install project
pip install -e .

Running Tests

# Run all tests
pytest tests/ -v

# Run specific module tests
pytest tests/test_ner.py -v
pytest tests/test_pii.py -v

# Check test coverage
pytest tests/ --cov=medtextcn --cov-report=html

Code Style

# Code formatting
black medtextcn/ tests/

# Linting
flake8 medtextcn/ tests/

# Import sorting
isort medtextcn/ tests/

Roadmap

v1.1 - Enhancement & Optimization (Planned)

  • Expand medical entity library to 2000+ entities
  • Add surgical records, nursing records, and other section type parsing
  • Support custom entity dictionary loading
  • Performance optimization: 50% improvement in large text processing speed

v1.2 - Model Integration (In Planning)

  • Integrate pre-trained Chinese medical language models (e.g., cBLUE, CMeKG)
  • Provide BERT/BiLSTM-CRF model inference interface
  • Support model fine-tuning and custom training pipelines
  • Entity recognition F1 score improvement to 90%+

v2.0 - Platformization (Long-term Vision)

  • Medical knowledge graph construction tools
  • Multi-modal support (imaging reports, lab reports)
  • Clinical Decision Support (CDS) basic interfaces
  • Distributed deployment and high availability architecture
  • Web-based visual management interface

Contributing

We welcome and appreciate contributions in any form! Whether it's submitting bug reports, improving documentation, or submitting code Pull Requests.

Please read the Contributing Guide for detailed contribution workflows and guidelines.

Quick Contribution Workflow:

  1. Fork this repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Create a Pull Request

License

This project is licensed under the MIT License.

MIT License

Copyright (c) 2024 gitstq (Qiqi)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Acknowledgements

The development of MedTextCN has benefited from the following outstanding open-source projects and technical communities:

  • jieba - Chinese text segmentation, the underlying engine for MedTextCN's medical tokenizer
  • FastAPI - High-performance Python web framework, providing foundational support for the REST API
  • Pydantic - Data validation and serialization, ensuring API data consistency
  • cBLUE - Chinese Biomedical Language Understanding Evaluation benchmark, providing reference for model evaluation
  • CMeKG - Chinese Medical Knowledge Graph, an important reference for entity library construction
  • PyPI - Python Package Index, the distribution platform for MedTextCN

Thanks to all researchers and developers who have contributed to the Chinese medical NLP field.


MedTextCN - Making Chinese Medical Text Analysis Simpler

Made with ❤️ by gitstq (Qiqi)