Intelligent Chinese Medical Text Analysis Toolkit
NLP Toolkit for Chinese Medical Text — Named Entity Recognition, Privacy De-identification, Medical Record Structuring — all in one line of code
MedTextCN is a Natural Language Processing toolkit specifically designed for Chinese medical text scenarios. With the rapid development of healthcare informatization and smart medicine, unstructured Chinese medical records contain rich clinical information, but also face the dual challenges of privacy protection and data utilization.
MedTextCN aims to address the following core challenges:
- Medical Named Entity Recognition (NER): Accurately extract diseases, symptoms, drugs, examinations, anatomical sites, and treatment methods from unstructured medical text, providing foundational capabilities for downstream clinical decision support and medical knowledge graph construction.
- Patient Privacy Protection: Automatically detect Personally Identifiable Information (PII) in medical records with multiple de-identification strategies, helping healthcare institutions meet compliance requirements under the Personal Information Protection Law (PIPL).
- Medical Record Structuring: Automatically parse free-text medical records into standardized SOAP format (Subjective, Objective, Assessment, Plan), improving data usability and interoperability.
Whether you are a medical AI R&D team, a hospital IT department, or a healthcare data scientist, MedTextCN provides ready-to-use Chinese medical text processing capabilities.
Built-in knowledge base with 831+ medical entities across 6 categories: diseases (158), symptoms (116), drugs (175), examinations (135), anatomical sites (163), and treatments (84). Based on a hybrid dictionary and rule-based matching strategy, it achieves efficient entity recognition without GPU.
Supports detection of 8 PII types: ID card numbers, mobile phone numbers, patient names, medical insurance card numbers, email addresses, home addresses, dates of birth, and visit dates. Provides 4 de-identification modes: mask, replace, hash, and remove, compliant with PIPL regulations.
Intelligently parses 9 medical record section types: chief complaint, present illness history, past history, personal history, family history, physical examination, auxiliary examination, diagnosis, and treatment plan. Automatically outputs standardized SOAP format structured results.
Enhanced domain-specific tokenization based on the jieba engine, with a built-in medical vocabulary and entity-priority matching strategy, effectively solving segmentation challenges for long medical terms and professional terminology.
Built-in concurrent processing framework supporting parallel processing of large volumes of medical text, with progress callback mechanisms for easy integration into data processing pipelines.
Ready-to-use REST API service with 7 API endpoints, covering comprehensive analysis, entity extraction, PII detection, text de-identification, text structuring, health check, and API documentation.
Provides four commands: analyze, serve, demo, and version, enabling direct text analysis, API service startup, demo execution, and version checking from the terminal.
Complete Dockerfile for one-click build and deployment, simplifying deployment in server and cloud environments.
- Python >= 3.9
- pip (Python package manager)
pip install medtextcnfrom medtextcn import analyze_text
# One-line comprehensive analysis (NER + PII Detection + Structuring)
result = analyze_text("Patient Zhang San, male, 65 years old, admitted for diabetes")
print(result)Example output:
{
"entities": [
{"text": "diabetes", "type": "disease", "start": 12, "end": 15}
],
"pii": [
{"text": "Zhang San", "type": "name", "start": 2, "end": 4}
],
"structured": {
"subjective": "Patient Zhang San, male, 65 years old, admitted for diabetes"
}
}Extract diseases, symptoms, drugs, examinations, anatomical sites, and treatments from medical text.
from medtextcn import extract_entities
entities = extract_entities("Hypertension with coronary heart disease, need ECG examination")
for entity in entities:
print(f"Entity: {entity['text']}, Type: {entity['type']}, Position: {entity['start']}-{entity['end']}")Supported Entity Types:
| Type | Label | Built-in Count | Examples |
|---|---|---|---|
| Disease | disease |
158 | Diabetes, Hypertension, Coronary heart disease |
| Symptom | symptom |
116 | Headache, Fever, Cough |
| Drug | drug |
175 | Amoxicillin, Metformin |
| Examination | examination |
135 | Blood routine, ECG, CT |
| Anatomy | anatomy |
163 | Heart, Liver, Left lung |
| Treatment | treatment |
84 | Surgery, Transfusion, Dialysis |
from medtextcn import detect_pii
pii_results = detect_pii("Patient Li Si, ID card 110101199003076039, phone 13800138000")
for item in pii_results:
print(f"Type: {item['type']}, Content: {item['text']}")from medtextcn import deidentify_text
# Mask mode (default)
safe = deidentify_text("Patient Wang Wu, phone 15098765432", mode="mask")
print(safe)
# Output: Patient Wang *, phone 150****5432
# Replace mode
safe = deidentify_text("Patient Wang Wu, phone 15098765432", mode="replace")
print(safe)
# Output: Patient [Patient Name], phone [Phone Number]
# Hash mode
safe = deidentify_text("Patient Wang Wu, phone 15098765432", mode="hash")
print(safe)
# Output: Patient a3f2e1..., phone 7b8c9d...
# Remove mode
safe = deidentify_text("Patient Wang Wu, phone 15098765432", mode="remove")
print(safe)
# Output: Patient, phoneSupported PII Types:
| Type | Label | Description |
|---|---|---|
| Patient Name | name |
Chinese name recognition |
| ID Card | id_card |
18-digit ID card number |
| Phone Number | phone |
11-digit mobile phone number |
| Medical Insurance Card | medical_card |
Medical insurance card number |
email |
Email address | |
| Address | address |
Chinese address information |
| Date of Birth | birth_date |
Date of birth |
| Visit Date | visit_date |
Visit/admission date |
Parse free-text medical records into standard SOAP format.
from medtextcn import structure_text
text = """Chief Complaint: Recurrent cough for 3 days.
Present Illness: Cough appeared 3 days ago after catching cold, with white thin sputum, no fever.
Past History: 5-year history of hypertension, controlled with oral antihypertensive drugs.
Physical Examination: Body temperature 36.5°C, coarse breath sounds in both lungs.
Diagnosis: Acute bronchitis.
Treatment Plan: Anti-infection treatment, symptomatic management."""
structured = structure_text(text)
print(structured)Output:
{
"subjective": {
"chief_complaint": "Recurrent cough for 3 days.",
"present_illness": "Cough appeared 3 days ago after catching cold, with white thin sputum, no fever.",
"past_history": "5-year history of hypertension, controlled with oral antihypertensive drugs."
},
"objective": {
"physical_exam": "Body temperature 36.5\u00b0C, coarse breath sounds in both lungs."
},
"assessment": {
"diagnosis": "Acute bronchitis."
},
"plan": {
"treatment_plan": "Anti-infection treatment, symptomatic management."
}
}from medtextcn import tokenize
tokens = tokenize("Patient admitted for acute myocardial infarction, needs coronary angiography")
print(tokens)
# Output: ['Patient', 'admitted', 'for', 'acute myocardial infarction', '...', 'coronary angiography', 'examination']from medtextcn import batch_analyze
texts = [
"Patient Zhang San, male, 65 years old, admitted for diabetes",
"Patient Li Si, female, 45 years old, hypertension with coronary heart disease",
"Patient Wang Wu, male, 72 years old, acute exacerbation of COPD",
]
results = batch_analyze(texts, max_workers=4, progress_callback=lambda i, n: print(f"Progress: {i}/{n}"))medtextcn serve --host 0.0.0.0 --port 8080| Method | Path | Description |
|---|---|---|
| POST | /api/v1/analyze |
Comprehensive text analysis |
| POST | /api/v1/entities |
Entity extraction |
| POST | /api/v1/pii/detect |
PII detection |
| POST | /api/v1/pii/deidentify |
Text de-identification |
| POST | /api/v1/structure |
Text structuring |
| GET | /api/v1/health |
Health check |
| GET | /docs |
API Documentation (Swagger UI) |
# Comprehensive analysis
curl -X POST http://localhost:8080/api/v1/analyze \
-H "Content-Type: application/json" \
-d '{"text": "Patient Zhang San, male, 65 years old, admitted for diabetes"}'
# Entity extraction
curl -X POST http://localhost:8080/api/v1/entities \
-H "Content-Type: application/json" \
-d '{"text": "Hypertension with coronary heart disease, need ECG examination"}'
# PII detection
curl -X POST http://localhost:8080/api/v1/pii/detect \
-H "Content-Type: application/json" \
-d '{"text": "Patient Li Si, ID card 110101199003076039"}'
# Text de-identification
curl -X POST http://localhost:8080/api/v1/pii/deidentify \
-H "Content-Type: application/json" \
-d '{"text": "Patient Wang Wu, phone 15098765432", "mode": "mask"}'# Text analysis
medtextcn analyze "Patient Zhang San, male, 65 years old, admitted for diabetes"
# Start API service
medtextcn serve --host 0.0.0.0 --port 8080
# Run demo
medtextcn demo
# Check version
medtextcn version# Build image
docker build -t medtextcn .
# Run container
docker run -p 8080:8080 medtextcn
# Run in background
docker run -d --name medtextcn-server -p 8080:8080 medtextcnMedTextCN adopts a layered modular architecture with clear responsibilities and loose coupling for independent use and extensibility.
+-----------------------------------------------------------+
| Access Layer |
| +----------+ +---------------+ +---------------------+ |
| | CLI Tool | | FastAPI REST | | Python SDK (API) | |
| +----+-----+ +-------+-------+ +----------+----------+ |
+------+-----------------+---------------------+-------------+
| Service Layer |
| +----------+ +----------+ +----------+ +-------------+ |
| | Analyzer | | Batch | | De-ident | | Structurer | |
| | Engine | | Engine | | Engine | | Engine | |
| +----+-----+ +----+-----+ +----+-----+ +------+------+ |
+------+--------------+--------------+---------------------+
| Core Layer |
| +----------+ +----------+ +----------+ +-------------+ |
| | NER | | PII | | Tokenizer| | Section | |
| | Engine | | Detector | | Engine | | Parser | |
| +----+-----+ +----+-----+ +----+-----+ +------+------+ |
+------+--------------+--------------+---------------------+
| Data Layer |
| +----------+ +----------+ +----------+ +-------------+ |
| | Medical | | PII | | Medical | | Section | |
| | Entity KB| | Rules | | Dict | | Templates | |
| | (831+) | | (8 types)| | (jieba+) | | (9 types) | |
| +----------+ +----------+ +----------+ +-------------+ |
+-----------------------------------------------------------+
Module Overview:
| Module | Responsibility | Key Technology |
|---|---|---|
| NER Engine | Medical entity recognition | Dictionary matching + Rule engine |
| PII Detector | Personal information detection | Regex + Context analysis |
| Tokenizer | Medical text tokenization | jieba + Medical dictionary enhancement |
| Section Parser | Medical record section classification | Pattern matching + Keyword extraction |
| De-identification Engine | Text de-identification | Multi-strategy de-identification pipeline |
| Structuring Engine | SOAP format output | Section classification + Field mapping |
| Batch Engine | Concurrent batch processing | ThreadPoolExecutor |
| REST Service | HTTP API | FastAPI + Pydantic |
medtextcn/
├── docs/
│ ├── logo.jpg # Project Logo
│ ├── README.en.md # English README
│ └── README.zh-TW.md # Traditional Chinese README
├── medtextcn/
│ ├── __init__.py # Package entry, exports public API
│ ├── cli.py # CLI command-line tool
│ ├── api/
│ │ ├── __init__.py
│ │ ├── app.py # FastAPI application
│ │ ├── routes.py # API route definitions
│ │ └── schemas.py # Pydantic data models
│ ├── core/
│ │ ├── __init__.py
│ │ ├── ner.py # NER entity recognition engine
│ │ ├── pii.py # PII detection engine
│ │ ├── tokenizer.py # Medical-enhanced tokenizer
│ │ └── parser.py # Medical record section parser
│ ├── services/
│ │ ├── __init__.py
│ │ ├── analyzer.py # Comprehensive analysis service
│ │ ├── deidentifier.py # De-identification service
│ │ ├── structurer.py # Structuring service
│ │ └── batch.py # Batch processing engine
│ └── data/
│ ├── entities/ # Medical entity dictionaries
│ │ ├── diseases.json # Disease entities (158)
│ │ ├── symptoms.json # Symptom entities (116)
│ │ ├── drugs.json # Drug entities (175)
│ │ ├── examinations.json # Examination entities (135)
│ │ ├── anatomy.json # Anatomy entities (163)
│ │ └── treatments.json # Treatment entities (84)
│ ├── pii_patterns.json # PII detection rules
│ ├── section_patterns.json # Section classification rules
│ └── medical_words.txt # Medical tokenization dictionary
├── tests/
│ ├── test_ner.py # NER engine tests
│ ├── test_pii.py # PII detection tests
│ ├── test_deidentify.py # De-identification tests
│ ├── test_structure.py # Structuring tests
│ ├── test_tokenizer.py # Tokenizer tests
│ └── test_api.py # API endpoint tests
├── Dockerfile # Docker build file
├── setup.py # Package installation config
├── pyproject.toml # Project metadata
├── README.md # Project documentation (Chinese)
├── CONTRIBUTING.md # Contributing guide
├── LICENSE # MIT open source license
└── .github/
└── workflows/
└── ci.yml # CI/CD configuration
# Clone repository
git clone https://github.com/gitstq/MedTextCN.git
cd MedTextCN
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/macOS
# venv\Scripts\activate # Windows
# Install development dependencies
pip install -e ".[dev]"
# Install project
pip install -e .# Run all tests
pytest tests/ -v
# Run specific module tests
pytest tests/test_ner.py -v
pytest tests/test_pii.py -v
# Check test coverage
pytest tests/ --cov=medtextcn --cov-report=html# Code formatting
black medtextcn/ tests/
# Linting
flake8 medtextcn/ tests/
# Import sorting
isort medtextcn/ tests/- Expand medical entity library to 2000+ entities
- Add surgical records, nursing records, and other section type parsing
- Support custom entity dictionary loading
- Performance optimization: 50% improvement in large text processing speed
- Integrate pre-trained Chinese medical language models (e.g., cBLUE, CMeKG)
- Provide BERT/BiLSTM-CRF model inference interface
- Support model fine-tuning and custom training pipelines
- Entity recognition F1 score improvement to 90%+
- Medical knowledge graph construction tools
- Multi-modal support (imaging reports, lab reports)
- Clinical Decision Support (CDS) basic interfaces
- Distributed deployment and high availability architecture
- Web-based visual management interface
We welcome and appreciate contributions in any form! Whether it's submitting bug reports, improving documentation, or submitting code Pull Requests.
Please read the Contributing Guide for detailed contribution workflows and guidelines.
Quick Contribution Workflow:
- Fork this repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Create a Pull Request
This project is licensed under the MIT License.
MIT License
Copyright (c) 2024 gitstq (Qiqi)
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
The development of MedTextCN has benefited from the following outstanding open-source projects and technical communities:
- jieba - Chinese text segmentation, the underlying engine for MedTextCN's medical tokenizer
- FastAPI - High-performance Python web framework, providing foundational support for the REST API
- Pydantic - Data validation and serialization, ensuring API data consistency
- cBLUE - Chinese Biomedical Language Understanding Evaluation benchmark, providing reference for model evaluation
- CMeKG - Chinese Medical Knowledge Graph, an important reference for entity library construction
- PyPI - Python Package Index, the distribution platform for MedTextCN
Thanks to all researchers and developers who have contributed to the Chinese medical NLP field.
MedTextCN - Making Chinese Medical Text Analysis Simpler
Made with ❤️ by gitstq (Qiqi)