Skip to content

FatineHic/Polymerase-checker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

110 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿงฌ Polymerase Checker

Identification and annotation of RNA Polymerase IV (Pol IV) orthologs in plant genomes using a modular bioinformatics pipeline.

Python BLAST+


๐Ÿ“Œ Table of Contents


๐Ÿ”ฌ Overview

RNA Polymerase IV and V are plant-specific enzymes involved in RNA-directed DNA methylation (RdDM) and transposable element (TE) silencing, both essential for genome stability. However, their sequences are highly similar to RNA Polymerase II, making automatic annotation difficult.

This project develops a robust, reproducible, and flexible computational pipeline to accurately identify Pol IV sequences, especially the NRPD1 subunit, across plant species.


๐Ÿงฌ Biological Background

RNA polymerases are enzymes responsible for transcription. In plants, three key polymerases are involved:

Polymerase Function
Pol II General transcription
Pol IV siRNA production
Pol V Scaffold RNA for methylation

Pol IV plays a key role in:

  • RNA-directed DNA methylation (RdDM) โ€” epigenetic regulation pathway
  • Transposable element silencing โ€” genome defense mechanism
  • Genome stability โ€” maintaining structural integrity of plant genomes

โš ๏ธ The Problem

  • Pol II, IV, and V sequences are highly similar at the sequence level
  • Automatic annotation is often incorrect, leading to misclassification
  • Manual curation is slow and not scalable across multiple plant genomes

๐Ÿ‘‰ There is a need for a reliable computational method to distinguish these polymerases accurately.


๐Ÿง  Key Features

  • Automated sequence retrieval and database construction
  • FASTA and tabular data cleaning & preprocessing
  • Local BLAST+ integration (via Python)
  • Custom domain database for Pol II / IV / V distinction
  • Modular rule-based validation system (JSON configurable)
  • Statistical analysis of sequences and datasets
  • Unit-tested and version-controlled pipeline

๐Ÿ“‚ Project Structure

polymerase_checker/
โ”‚
โ”œโ”€โ”€ Functions/                      # Core cleaning & utility functions
โ”‚   โ”œโ”€โ”€ cleaning.py                 # String/FASTA/DataFrame cleaning
โ”‚   โ”œโ”€โ”€ processing.py               # FASTA & tabular data processing
โ”‚   โ”œโ”€โ”€ alignment.py                # Pairwise alignment & BLAST wrappers
โ”‚   โ””โ”€โ”€ statistics.py               # Sequence & dataset statistics
โ”‚
โ”œโ”€โ”€ json_files_domains_files/       # Domain rules and configuration (JSON)
โ”‚   โ”œโ”€โ”€ domain_rules.json           # Rule-based validation definitions
โ”‚   โ”œโ”€โ”€ pol_iv_domains.json         # Pol IV conserved domain specs
โ”‚   โ””โ”€โ”€ motif_definitions.json      # Distinguishing motifs (Pol II/IV/V)
โ”‚
โ”œโ”€โ”€ testing_files_jupyter/          # Jupyter notebooks for testing & exploration
โ”‚   โ”œโ”€โ”€ exploration.ipynb
โ”‚   โ””โ”€โ”€ validation_tests.ipynb
โ”‚
โ”œโ”€โ”€ requirements.txt                # Python dependencies
โ””โ”€โ”€ README.md                       # Project documentation

โš™๏ธ Functional Modules

๐Ÿงน Cleaning Functions

Handles normalization and formatting of raw biological data:

  • Clean strings (case normalization, trimming, formatting)
  • Apply cleaning to lists, DataFrame columns, and FASTA headers
  • Normalize sequence formats (uppercase/lowercase consistency)

๐Ÿ”„ Processing Functions

Reads and structures biological data for downstream analysis:

  • Read and clean FASTA files
  • Read and clean tabular datasets
  • Convert biological sequence objects into dictionaries and structured lists

๐Ÿงฌ Alignment & Sequence Analysis

Core bioinformatics analysis module:

  • Pairwise sequence alignment
  • Local BLAST database creation
  • BLAST search execution (local & remote if applicable)
  • Pattern detection in sequences
  • Extraction of relevant reads from FASTA files

๐Ÿ“Š Statistics

Quality control and exploratory analysis:

  • Missing values analysis
  • Column-wise statistics
  • Correlation analysis
  • FASTA summary: sequence length distribution, GC content, protein-level analysis

๐Ÿงช Pipeline Overview

Plant Genome Sequences
        โ”‚
        โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 1. Database       โ”‚โ”€โ”€โ†’ Custom "Embryophyta" database
โ”‚    Construction   โ”‚    (taxonomy-based filtering)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 2. Sequence       โ”‚โ”€โ”€โ†’ Automated extraction
โ”‚    Retrieval      โ”‚    Multi-FASTA generation
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 3. Preprocessing  โ”‚โ”€โ”€โ†’ Cleaning sequences & headers
โ”‚                   โ”‚    Format validation & stats
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 4. BLAST Analysis โ”‚โ”€โ”€โ†’ Local BLAST+ execution
โ”‚                   โ”‚    Parameter tuning & parsing
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 5. Domain         โ”‚โ”€โ”€โ†’ Conserved domains (Aโ€“D, DeCL)
โ”‚    Analysis       โ”‚    Pol II vs IV vs V distinction
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 6. Rule-Based     โ”‚โ”€โ”€โ†’ JSON-configurable validation
โ”‚    Validation โญ  โ”‚    (core innovation)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
    โœ… Validated Pol IV Orthologs

๐Ÿ”ฌ Methodology

1. Custom Database โ€” Embryophyta

Built from local genomic resources, containing plant sequences across taxa. Allows filtering by taxonomy and families for targeted searches.


2. Sequence Alignment (BLAST+)

  • Local BLAST execution for performance
  • Protein-level comparison (blastp)
  • Python wrapper for automation, parameter control, and output parsing

3. Data Preprocessing

  • Cleaning FASTA files and fixing sequence inconsistencies
  • Statistical validation: length distribution, GC content, molecular properties

4. Domain Database Construction

Built using known sequences (e.g., Arabidopsis thaliana) to identify conserved domains (Aโ€“D, DeCL) and specific motifs.

This enables discrimination between:

Domain Feature Pol II Pol IV Pol V
Conserved domains Aโ€“D โœ… โœ… โœ…
DeCL domain โŒ โœ… โŒ
Specific motifs Unique Unique Unique

5. Rule-Based Validation System โญ (Core Innovation)

Each candidate sequence is validated against a set of configurable rules:

Rule Description
Minimum length Sequence must meet a minimum length threshold
Domain presence Required conserved domains must be detected
Domain order Domains must appear in the expected order
Domain position Domains must fall within expected positional ranges
Conflicting motifs Absence of motifs specific to Pol II or Pol V

๐Ÿ‘‰ All rules are stored in JSON files, making the system modular, extensible, and reusable โ€” no need to modify Python code to adjust validation criteria.


6. Automation & Integration

  • Entire pipeline automated in Python
  • Integration of BLAST, FASTA processing, and domain analysis
  • Format conversion for downstream analysis

7. Testing & Version Control

  • Unit tests for all validation functions
  • Git used for version tracking, code stability, and collaboration

๐Ÿš€ Key Contributions

  • Development of a modular bioinformatics pipeline for polymerase annotation
  • Automation of the sequence annotation workflow from retrieval to validation
  • Creation of a flexible JSON-based rule validation system (core innovation)
  • Improved accuracy in identifying Pol IV orthologs across plant species

๐Ÿ”ฎ Future Improvements

  • Phylogenetic analysis integration
  • Expansion to Pol V and other polymerase complexes
  • Additional statistical validation methods
  • Broader database integration (e.g., Phytozome)
  • Full pipeline automation (reduce remaining manual steps)

๐Ÿ› ๏ธ Installation

# Clone the repository
git clone https://github.com/YOUR_USERNAME/polymerase_checker.git
cd polymerase_checker

# Install dependencies
pip install -r requirements.txt

๐Ÿ“Ž Notes

  • All functions are modular and can be used independently
  • JSON configuration files allow rule modification without touching the codebase
  • The pipeline is designed for reproducibility and extensibility across plant species

About

Modular Python pipeline for identifying RNA Polymerase IV (NRPD1) orthologs in plant genomes. Features BLAST+ integration, custom domain analysis, and a JSON-configurable rule-based validation system for Pol II/IV/V distinction.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors