Identification and annotation of RNA Polymerase IV (Pol IV) orthologs in plant genomes using a modular bioinformatics pipeline.
- Overview
- Biological Background
- The Problem
- Key Features
- Project Structure
- Functional Modules
- Pipeline Overview
- Methodology
- Key Contributions
- Future Improvements
- Installation
- Notes
RNA Polymerase IV and V are plant-specific enzymes involved in RNA-directed DNA methylation (RdDM) and transposable element (TE) silencing, both essential for genome stability. However, their sequences are highly similar to RNA Polymerase II, making automatic annotation difficult.
This project develops a robust, reproducible, and flexible computational pipeline to accurately identify Pol IV sequences, especially the NRPD1 subunit, across plant species.
RNA polymerases are enzymes responsible for transcription. In plants, three key polymerases are involved:
| Polymerase | Function |
|---|---|
| Pol II | General transcription |
| Pol IV | siRNA production |
| Pol V | Scaffold RNA for methylation |
Pol IV plays a key role in:
- RNA-directed DNA methylation (RdDM) โ epigenetic regulation pathway
- Transposable element silencing โ genome defense mechanism
- Genome stability โ maintaining structural integrity of plant genomes
- Pol II, IV, and V sequences are highly similar at the sequence level
- Automatic annotation is often incorrect, leading to misclassification
- Manual curation is slow and not scalable across multiple plant genomes
๐ There is a need for a reliable computational method to distinguish these polymerases accurately.
- Automated sequence retrieval and database construction
- FASTA and tabular data cleaning & preprocessing
- Local BLAST+ integration (via Python)
- Custom domain database for Pol II / IV / V distinction
- Modular rule-based validation system (JSON configurable)
- Statistical analysis of sequences and datasets
- Unit-tested and version-controlled pipeline
polymerase_checker/
โ
โโโ Functions/ # Core cleaning & utility functions
โ โโโ cleaning.py # String/FASTA/DataFrame cleaning
โ โโโ processing.py # FASTA & tabular data processing
โ โโโ alignment.py # Pairwise alignment & BLAST wrappers
โ โโโ statistics.py # Sequence & dataset statistics
โ
โโโ json_files_domains_files/ # Domain rules and configuration (JSON)
โ โโโ domain_rules.json # Rule-based validation definitions
โ โโโ pol_iv_domains.json # Pol IV conserved domain specs
โ โโโ motif_definitions.json # Distinguishing motifs (Pol II/IV/V)
โ
โโโ testing_files_jupyter/ # Jupyter notebooks for testing & exploration
โ โโโ exploration.ipynb
โ โโโ validation_tests.ipynb
โ
โโโ requirements.txt # Python dependencies
โโโ README.md # Project documentation
Handles normalization and formatting of raw biological data:
- Clean strings (case normalization, trimming, formatting)
- Apply cleaning to lists, DataFrame columns, and FASTA headers
- Normalize sequence formats (uppercase/lowercase consistency)
Reads and structures biological data for downstream analysis:
- Read and clean FASTA files
- Read and clean tabular datasets
- Convert biological sequence objects into dictionaries and structured lists
Core bioinformatics analysis module:
- Pairwise sequence alignment
- Local BLAST database creation
- BLAST search execution (local & remote if applicable)
- Pattern detection in sequences
- Extraction of relevant reads from FASTA files
Quality control and exploratory analysis:
- Missing values analysis
- Column-wise statistics
- Correlation analysis
- FASTA summary: sequence length distribution, GC content, protein-level analysis
Plant Genome Sequences
โ
โผ
โโโโโโโโโโโโโโโโโโโโโ
โ 1. Database โโโโ Custom "Embryophyta" database
โ Construction โ (taxonomy-based filtering)
โโโโโโโโโโฌโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโ
โ 2. Sequence โโโโ Automated extraction
โ Retrieval โ Multi-FASTA generation
โโโโโโโโโโฌโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโ
โ 3. Preprocessing โโโโ Cleaning sequences & headers
โ โ Format validation & stats
โโโโโโโโโโฌโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโ
โ 4. BLAST Analysis โโโโ Local BLAST+ execution
โ โ Parameter tuning & parsing
โโโโโโโโโโฌโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโ
โ 5. Domain โโโโ Conserved domains (AโD, DeCL)
โ Analysis โ Pol II vs IV vs V distinction
โโโโโโโโโโฌโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโ
โ 6. Rule-Based โโโโ JSON-configurable validation
โ Validation โญ โ (core innovation)
โโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โ
Validated Pol IV Orthologs
Built from local genomic resources, containing plant sequences across taxa. Allows filtering by taxonomy and families for targeted searches.
- Local BLAST execution for performance
- Protein-level comparison (
blastp) - Python wrapper for automation, parameter control, and output parsing
- Cleaning FASTA files and fixing sequence inconsistencies
- Statistical validation: length distribution, GC content, molecular properties
Built using known sequences (e.g., Arabidopsis thaliana) to identify conserved domains (AโD, DeCL) and specific motifs.
This enables discrimination between:
| Domain Feature | Pol II | Pol IV | Pol V |
|---|---|---|---|
| Conserved domains AโD | โ | โ | โ |
| DeCL domain | โ | โ | โ |
| Specific motifs | Unique | Unique | Unique |
Each candidate sequence is validated against a set of configurable rules:
| Rule | Description |
|---|---|
| Minimum length | Sequence must meet a minimum length threshold |
| Domain presence | Required conserved domains must be detected |
| Domain order | Domains must appear in the expected order |
| Domain position | Domains must fall within expected positional ranges |
| Conflicting motifs | Absence of motifs specific to Pol II or Pol V |
๐ All rules are stored in JSON files, making the system modular, extensible, and reusable โ no need to modify Python code to adjust validation criteria.
- Entire pipeline automated in Python
- Integration of BLAST, FASTA processing, and domain analysis
- Format conversion for downstream analysis
- Unit tests for all validation functions
- Git used for version tracking, code stability, and collaboration
- Development of a modular bioinformatics pipeline for polymerase annotation
- Automation of the sequence annotation workflow from retrieval to validation
- Creation of a flexible JSON-based rule validation system (core innovation)
- Improved accuracy in identifying Pol IV orthologs across plant species
- Phylogenetic analysis integration
- Expansion to Pol V and other polymerase complexes
- Additional statistical validation methods
- Broader database integration (e.g., Phytozome)
- Full pipeline automation (reduce remaining manual steps)
# Clone the repository
git clone https://github.com/YOUR_USERNAME/polymerase_checker.git
cd polymerase_checker
# Install dependencies
pip install -r requirements.txt- All functions are modular and can be used independently
- JSON configuration files allow rule modification without touching the codebase
- The pipeline is designed for reproducibility and extensibility across plant species