Bioinformatics Toolkit

Command-line tools for sequence analysis, BLAST processing, protein domain analysis, genome assembly QC, and visualization.

Overview

Scripts and utilities for working with genomic and proteomic data, from raw sequence processing to visualization. Includes support for HPC cluster environments.

Key Features

Sequence Manipulation: FASTA/FASTQ processing, header modification, length filtering
BLAST Suite: Result parsing, hit filtering, taxonomic analysis
Domain Analysis: PFAM/HMM processing, motif discovery with ML
Assembly QC: N50 calculation, k-mer analysis, contamination detection
Visualization: Circos plots, UMAP projections, custom graphics
HPC Integration: SLURM job scripts for cluster computing

Repository Structure

bioinformatics-toolkit/
├── sequence-analysis/      # FASTA/Q manipulation, format conversion
├── blast-tools/           # BLAST processing and analysis
├── domain-analysis/       # Protein domain and motif tools
├── assembly-tools/        # Genome assembly statistics
├── visualization/         # Data visualization scripts
├── hpc-scripts/          # SLURM batch job scripts
└── utils/                # General bioinformatics utilities

Quick Start

Prerequisites

# Core requirements
python >= 3.7
perl >= 5.10
bash >= 4.0

# Bioinformatics software
blast+ >= 2.10.0
hmmer >= 3.3
diamond >= 2.0.0
circos >= 0.69

# Python packages
pip install biopython pandas numpy matplotlib umap-learn

Installation

# Clone the repository
git clone https://github.com/olympus-terminal/bioinformatics-toolkit.git
cd bioinformatics-toolkit

# Make scripts executable
find . -name "*.sh" -o -name "*.py" -o -name "*.pl" | xargs chmod +x

# Add to PATH (optional)
export PATH="$PATH:$(pwd)"

Tool Categories

Sequence Analysis (`sequence-analysis/`)

Tool	Description	Usage
`filter_fasta_on_length`	Filter sequences by length	`./filter_fasta_on_length input.fa 1000 > filtered.fa`
`rename-fasta-headers-simple.sh`	Batch rename FASTA headers	`./rename-fasta-headers-simple.sh sequences.fa prefix`
`interleave.sh`	Interleave paired-end reads	`./interleave.sh read1.fq read2.fq > interleaved.fq`
`remove_line_wraps_fa.py`	Convert multi-line to single-line FASTA	`python remove_line_wraps_fa.py wrapped.fa > unwrapped.fa`
`csv2fa.py`	Convert CSV to FASTA format	`python csv2fa.py data.csv > sequences.fa`
`count-amino-acid-residues`	Count AA composition	`./count-amino-acid-residues proteins.fa`

BLAST Tools (`blast-tools/`)

Tool	Description	Usage
`ExtractSeqFromBLASTresults.sh`	Extract sequences from BLAST hits	`./ExtractSeqFromBLASTresults.sh blast.out seqs.fa`
`TakeOnlyBestBLASThit.sh`	Filter best hits only	`./TakeOnlyBestBLASThit.sh results.blast6`
`tally_blastp_by_genera.sh`	Summarize by taxonomic genera	`./tally_blastp_by_genera.sh blastp.out`
`make_taxid_diamond-blastdb`	Create DIAMOND DB with taxonomy	`./make_taxid_diamond-blastdb proteins.fa taxdb`

Domain Analysis (`domain-analysis/`)

Tool	Description	Usage
`extractPFAMs.sh`	Extract PFAM domains	`./extractPFAMs.sh hmmscan.out`
`Find-ECs-from-PFAMs.sh`	Map EC numbers from domains	`./Find-ECs-from-PFAMs.sh pfam_results.txt`
`LAAASR_motifMinerPro.py`	ML-based motif discovery	`python LAAASR_motifMinerPro.py -i seqs.fa -m model/`
`pivot_table_pfams`	Create PFAM pivot tables	`./pivot_table_pfams domain_counts.txt`

Assembly Tools (`assembly-tools/`)

Tool	Description	Usage
`n50calc.py`	Calculate assembly statistics	`python n50calc.py contigs.fa`
`cid_make_unitigs`	Process unitigs/contigs	`./cid_make_unitigs assembly.fa`

Visualization (`visualization/`)

Tool	Description	Usage
`circos-make.py`	Generate Circos configuration	`python circos-make.py -i data.txt -o circos.conf`
`run_pycirclize_v8.py`	Create circular plots	`python run_pycirclize_v8.py -i matrix.csv -o plot.png`
`UMAP-matrix-1.py`	UMAP dimensionality reduction	`python UMAP-matrix-1.py expression.csv`

HPC Scripts (`hpc-scripts/`)

Pre-configured SLURM job scripts for common bioinformatics tasks:

BLEACH_3long.sbatch - Long-read contamination screening
meryl_merqury_long-reads-k-mers-hapQC.sbatch - K-mer based QC
gmap.sbatch - GMAP alignment pipeline
COUNT_AA.sbatch - Parallel amino acid counting

Example Workflows

1. Genome Assembly QC Pipeline

# Calculate assembly statistics
python assembly-tools/n50calc.py genome.fa > stats.txt

# Check for contamination
sbatch hpc-scripts/QuickCheckContam.sbatch genome.fa

# Analyze k-mer distribution
sbatch hpc-scripts/meryl_merqury_long-reads-k-mers-hapQC.sbatch reads.fq genome.fa

2. Protein Domain Analysis

# Run PFAM scan (using HMMER)
hmmscan --tblout domains.tbl Pfam-A.hmm proteins.fa

# Extract and format results
./domain-analysis/extractPFAMs.sh domains.tbl > pfam_hits.txt
./domain-analysis/format_pfams.sh pfam_hits.txt > formatted.txt

# Find enzyme classifications
./domain-analysis/Find-ECs-from-PFAMs.sh formatted.txt > ec_numbers.txt

3. BLAST Analysis Pipeline

# Run BLASTP
blastp -query proteins.fa -db nr -out blast.out -outfmt 6 -num_threads 8

# Filter best hits
./blast-tools/TakeOnlyBestBLASThit.sh blast.out > best_hits.txt

# Summarize by genera
./blast-tools/tally_blastp_by_genera.sh best_hits.txt > genera_summary.txt

Advanced Features

Machine Learning Motif Discovery

The LAAASR_motifMinerPro.py tool uses GPTNeoX models for advanced motif discovery:

python domain-analysis/LAAASR_motifMinerPro.py \
    -i sequences.fa \
    -m gpt-neox-model/ \
    -o motifs_discovered.txt \
    --min-support 0.1

Contamination Detection

The BLEACH pipeline identifies potential contamination in sequencing data:

# For long reads
sbatch hpc-scripts/BLEACH_3long.sbatch sample.fastq reference.fa

# For short reads (5x5 layout)
sbatch hpc-scripts/BLEACH_5x5L.sbatch sample_R1.fq sample_R2.fq

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/new-tool)
Commit your changes (git commit -am 'Add new tool')
Push to the branch (git push origin feature/new-tool)
Create a Pull Request

Contribution Guidelines

Include usage documentation for new tools
Add example commands in tool headers
Follow existing naming conventions
Test on Linux and macOS when possible

Citations

If you use these tools in your research, please cite:

@software{bioinformatics_toolkit,
  author = {olympus-terminal},
  title = {Bioinformatics Toolkit},
  url = {https://github.com/olympus-terminal/bioinformatics-toolkit},
  year = {2024}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

External Resources

Tips & Tricks

Many scripts support parallel processing - check for -threads or -j options
Use SLURM scripts as templates for your own HPC workflows
Combine tools with Unix pipes for complex analyses
Check script headers for detailed usage examples

Troubleshooting

Common issues and solutions:

Permission denied: Run chmod +x script_name.sh
Module not found: Install Python dependencies with pip
SLURM errors: Check partition names and resource requirements
Memory issues: Use HPC scripts for large datasets

Contact

For questions, issues, or collaborations:

GitHub Issues: Create an issue
GitHub: @olympus-terminal

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assembly-tools		assembly-tools
blast-tools		blast-tools
domain-analysis		domain-analysis
hpc-scripts		hpc-scripts
sequence-analysis		sequence-analysis
utils		utils
visualization		visualization
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
GetGeneNCBI.py		GetGeneNCBI.py
LICENSE		LICENSE
README.md		README.md
fetch_pubmed_fulltext.py		fetch_pubmed_fulltext.py
split_fasta.sh		split_fasta.sh

License

olympus-terminal/bioinformatics-toolkit

Folders and files

Latest commit

History

Repository files navigation

Bioinformatics Toolkit

Overview

Key Features

Repository Structure

Quick Start

Prerequisites

Installation

Tool Categories

Sequence Analysis (sequence-analysis/)

BLAST Tools (blast-tools/)

Domain Analysis (domain-analysis/)

Assembly Tools (assembly-tools/)

Visualization (visualization/)

HPC Scripts (hpc-scripts/)

Example Workflows

1. Genome Assembly QC Pipeline

2. Protein Domain Analysis

3. BLAST Analysis Pipeline

Advanced Features

Machine Learning Motif Discovery

Contamination Detection

Contributing

Contribution Guidelines

Citations

License

External Resources

Tips & Tricks

Troubleshooting

Contact

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Sequence Analysis (`sequence-analysis/`)

BLAST Tools (`blast-tools/`)

Domain Analysis (`domain-analysis/`)

Assembly Tools (`assembly-tools/`)

Visualization (`visualization/`)

HPC Scripts (`hpc-scripts/`)

Packages