🧬 Polymerase Checker

Identification and annotation of RNA Polymerase IV (Pol IV) orthologs in plant genomes using a modular bioinformatics pipeline.

📌 Table of Contents

Overview
Biological Background
The Problem
Key Features
Project Structure
Functional Modules
Pipeline Overview
Methodology
Key Contributions
Future Improvements
Installation
Notes

🔬 Overview

RNA Polymerase IV and V are plant-specific enzymes involved in RNA-directed DNA methylation (RdDM) and transposable element (TE) silencing, both essential for genome stability. However, their sequences are highly similar to RNA Polymerase II, making automatic annotation difficult.

This project develops a robust, reproducible, and flexible computational pipeline to accurately identify Pol IV sequences, especially the NRPD1 subunit, across plant species.

🧬 Biological Background

RNA polymerases are enzymes responsible for transcription. In plants, three key polymerases are involved:

Polymerase	Function
Pol II	General transcription
Pol IV	siRNA production
Pol V	Scaffold RNA for methylation

Pol IV plays a key role in:

RNA-directed DNA methylation (RdDM) — epigenetic regulation pathway
Transposable element silencing — genome defense mechanism
Genome stability — maintaining structural integrity of plant genomes

⚠️ The Problem

Pol II, IV, and V sequences are highly similar at the sequence level
Automatic annotation is often incorrect, leading to misclassification
Manual curation is slow and not scalable across multiple plant genomes

👉 There is a need for a reliable computational method to distinguish these polymerases accurately.

🧠 Key Features

Automated sequence retrieval and database construction
FASTA and tabular data cleaning & preprocessing
Local BLAST+ integration (via Python)
Custom domain database for Pol II / IV / V distinction
Modular rule-based validation system (JSON configurable)
Statistical analysis of sequences and datasets
Unit-tested and version-controlled pipeline

📂 Project Structure

polymerase_checker/
│
├── Functions/                      # Core cleaning & utility functions
│   ├── cleaning.py                 # String/FASTA/DataFrame cleaning
│   ├── processing.py               # FASTA & tabular data processing
│   ├── alignment.py                # Pairwise alignment & BLAST wrappers
│   └── statistics.py               # Sequence & dataset statistics
│
├── json_files_domains_files/       # Domain rules and configuration (JSON)
│   ├── domain_rules.json           # Rule-based validation definitions
│   ├── pol_iv_domains.json         # Pol IV conserved domain specs
│   └── motif_definitions.json      # Distinguishing motifs (Pol II/IV/V)
│
├── testing_files_jupyter/          # Jupyter notebooks for testing & exploration
│   ├── exploration.ipynb
│   └── validation_tests.ipynb
│
├── requirements.txt                # Python dependencies
└── README.md                       # Project documentation

⚙️ Functional Modules

🧹 Cleaning Functions

Handles normalization and formatting of raw biological data:

Clean strings (case normalization, trimming, formatting)
Apply cleaning to lists, DataFrame columns, and FASTA headers
Normalize sequence formats (uppercase/lowercase consistency)

🔄 Processing Functions

Reads and structures biological data for downstream analysis:

Read and clean FASTA files
Read and clean tabular datasets
Convert biological sequence objects into dictionaries and structured lists

🧬 Alignment & Sequence Analysis

Core bioinformatics analysis module:

Pairwise sequence alignment
Local BLAST database creation
BLAST search execution (local & remote if applicable)
Pattern detection in sequences
Extraction of relevant reads from FASTA files

📊 Statistics

Quality control and exploratory analysis:

Missing values analysis
Column-wise statistics
Correlation analysis
FASTA summary: sequence length distribution, GC content, protein-level analysis

🧪 Pipeline Overview

Plant Genome Sequences
        │
        ▼
┌───────────────────┐
│ 1. Database       │──→ Custom "Embryophyta" database
│    Construction   │    (taxonomy-based filtering)
└────────┬──────────┘
         │
         ▼
┌───────────────────┐
│ 2. Sequence       │──→ Automated extraction
│    Retrieval      │    Multi-FASTA generation
└────────┬──────────┘
         │
         ▼
┌───────────────────┐
│ 3. Preprocessing  │──→ Cleaning sequences & headers
│                   │    Format validation & stats
└────────┬──────────┘
         │
         ▼
┌───────────────────┐
│ 4. BLAST Analysis │──→ Local BLAST+ execution
│                   │    Parameter tuning & parsing
└────────┬──────────┘
         │
         ▼
┌───────────────────┐
│ 5. Domain         │──→ Conserved domains (A–D, DeCL)
│    Analysis       │    Pol II vs IV vs V distinction
└────────┬──────────┘
         │
         ▼
┌───────────────────┐
│ 6. Rule-Based     │──→ JSON-configurable validation
│    Validation ⭐  │    (core innovation)
└───────────────────┘
         │
         ▼
    ✅ Validated Pol IV Orthologs

🔬 Methodology

1. Custom Database — Embryophyta

Built from local genomic resources, containing plant sequences across taxa. Allows filtering by taxonomy and families for targeted searches.

2. Sequence Alignment (BLAST+)

Local BLAST execution for performance
Protein-level comparison (blastp)
Python wrapper for automation, parameter control, and output parsing

3. Data Preprocessing

Cleaning FASTA files and fixing sequence inconsistencies
Statistical validation: length distribution, GC content, molecular properties

4. Domain Database Construction

Built using known sequences (e.g., Arabidopsis thaliana) to identify conserved domains (A–D, DeCL) and specific motifs.

This enables discrimination between:

Domain Feature	Pol II	Pol IV	Pol V
Conserved domains A–D	✅	✅	✅
DeCL domain	❌	✅	❌
Specific motifs	Unique	Unique	Unique

5. Rule-Based Validation System ⭐ (Core Innovation)

Each candidate sequence is validated against a set of configurable rules:

Rule	Description
Minimum length	Sequence must meet a minimum length threshold
Domain presence	Required conserved domains must be detected
Domain order	Domains must appear in the expected order
Domain position	Domains must fall within expected positional ranges
Conflicting motifs	Absence of motifs specific to Pol II or Pol V

👉 All rules are stored in JSON files, making the system modular, extensible, and reusable — no need to modify Python code to adjust validation criteria.

6. Automation & Integration

Entire pipeline automated in Python
Integration of BLAST, FASTA processing, and domain analysis
Format conversion for downstream analysis

7. Testing & Version Control

Unit tests for all validation functions
Git used for version tracking, code stability, and collaboration

🚀 Key Contributions

Development of a modular bioinformatics pipeline for polymerase annotation
Automation of the sequence annotation workflow from retrieval to validation
Creation of a flexible JSON-based rule validation system (core innovation)
Improved accuracy in identifying Pol IV orthologs across plant species

🔮 Future Improvements

Phylogenetic analysis integration
Expansion to Pol V and other polymerase complexes
Additional statistical validation methods
Broader database integration (e.g., Phytozome)
Full pipeline automation (reduce remaining manual steps)

🛠️ Installation

# Clone the repository
git clone https://github.com/YOUR_USERNAME/polymerase_checker.git
cd polymerase_checker

# Install dependencies
pip install -r requirements.txt

📎 Notes

All functions are modular and can be used independently
JSON configuration files allow rule modification without touching the codebase
The pipeline is designed for reproducibility and extensibility across plant species

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 Polymerase Checker

📌 Table of Contents

🔬 Overview

🧬 Biological Background

⚠️ The Problem

🧠 Key Features

📂 Project Structure

⚙️ Functional Modules

🧹 Cleaning Functions

🔄 Processing Functions

🧬 Alignment & Sequence Analysis

📊 Statistics

🧪 Pipeline Overview

🔬 Methodology

1. Custom Database — Embryophyta

2. Sequence Alignment (BLAST+)

3. Data Preprocessing

4. Domain Database Construction

5. Rule-Based Validation System ⭐ (Core Innovation)

6. Automation & Integration

7. Testing & Version Control

🚀 Key Contributions

🔮 Future Improvements

🛠️ Installation

📎 Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
Functions		Functions
json_files_domains_files		json_files_domains_files
testing_files_jupyter		testing_files_jupyter
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🧬 Polymerase Checker

📌 Table of Contents

🔬 Overview

🧬 Biological Background

⚠️ The Problem

🧠 Key Features

📂 Project Structure

⚙️ Functional Modules

🧹 Cleaning Functions

🔄 Processing Functions

🧬 Alignment & Sequence Analysis

📊 Statistics

🧪 Pipeline Overview

🔬 Methodology

1. Custom Database — Embryophyta

2. Sequence Alignment (BLAST+)

3. Data Preprocessing

4. Domain Database Construction

5. Rule-Based Validation System ⭐ (Core Innovation)

6. Automation & Integration

7. Testing & Version Control

🚀 Key Contributions

🔮 Future Improvements

🛠️ Installation

📎 Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages