Skip to content

VENKATESH-282/bioinformatics-fasta-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Bioinformatics FASTA Parser and Analyzer

Welcome to the FASTA Sequence Parser and Analyzer, a friendly Python tool designed to make working with FASTA files a breeze!
Whether you're diving into DNA or protein sequences, this script helps you parse files, crunch numbers like sequence length and GC content, and whip up neat reports and visuals.

Built with BioPython, Pandas, and Matplotlib, it’s perfect for bioinformaticians, students, or anyone curious about sequence analysis.


About

This tool processes FASTA files—those handy formats used in bioinformatics to store nucleotide or protein sequences.
It extracts key stats like sequence length, GC percentage, and nucleotide frequencies, then wraps it all up in a tidy CSV report and a histogram of sequence lengths.

Whether you're analyzing one file or a batch, this script has you covered with a simple command-line interface.


Features

  • Parse FASTA Files: Reads sequences and metadata using BioPython’s SeqIO.
  • Compute Stats: Calculates sequence length, GC content, and percentages of A, T, G, C, and non-standard bases.
  • Summarize Data: Provides dataset-wide stats like total bases, average length, and standard deviation.
  • Visualize Results: Generates a clear histogram of sequence lengths.
  • Batch Processing: Handles multiple FASTA files in one go.
  • Custom Output: Saves results to a directory of your choice (defaults to results/).

Installation

You’ll need Python 3.6+ and a few dependencies.

Install the required packages:

pip install biopython pandas matplotlib

Clone or download this repository:

git clone https://github.com/[YourGitHubUsername]/fasta-analyzer.git
cd fasta-analyzer

Prerequisites:

  • Python 3.6 or higher
  • BioPython (biopython)
  • Pandas (pandas)
  • Matplotlib (matplotlib)

Usage

Run the script from your terminal, point it to your FASTA file(s), and optionally pick an output folder.

Basic command:

python scripts/fasta_analyzer.py data/sample.fasta --output my_results

Options

  • fasta_files: Path(s) to your FASTA file(s). You can list multiple files!
  • --output: Where to save the results (defaults to results/).

What You Get

  • CSV file (sequence_stats.csv) → Per-sequence stats (ID, length, GC content, nucleotide frequencies).
  • Histogram (length_hist.png) → Distribution of sequence lengths.
  • Console Summary → Dataset-wide stats.

Example

If you have a FASTA file called sample.fasta in the data/ folder, run:

python scripts/fasta_analyzer.py data/sample.fasta

Expected output:

Processing data/sample.fasta...
 - 3 sequences, 200 total bases
Per-sequence stats saved to results/sequence_stats.csv
Dataset Summary:
 Num Sequences: 3
 Total Bases: 200
 Avg Length: 66.67
 Avg Gc Percent: 45.0
 Std Length: 32.46
Length histogram saved to results/length_hist.png

Check the results/ folder for your CSV and histogram!


Project Structure

fasta-analyzer/
├── scripts/
│   └── fasta_analyzer.py      # The main script
├── data/
│   └── sample.fasta           # Your input FASTA files
├── results/
│   ├── sequence_stats.csv     # Per-sequence stats
│   └── length_hist.png        # Length distribution plot
└── README.md                  # You’re reading it!

How It Works

  1. Parsing → BioPython’s SeqIO.parse reads FASTA files.
  2. Stats Calculation → Sequence length, GC content (Bio.SeqUtils.gc_fraction), and nucleotide frequencies.
  3. Aggregation → Pandas summarizes stats across sequences.
  4. Visualization → Matplotlib plots a histogram of sequence lengths.

Testing

  1. Place a FASTA file in the data/ folder (e.g., sample.fasta).

  2. Run:

    python scripts/fasta_analyzer.py data/sample.fasta
  3. Check results:

    • Console output for summary
    • results/sequence_stats.csv for detailed stats
    • results/length_hist.png for histogram

Ideas for Expansion

  • Add RNA support (handle U instead of T).
  • Flag invalid bases with warnings.
  • Speed up with multi-threading for huge files.
  • Add codon usage / sequence complexity metrics.
  • Export results in JSON or Excel.

Author

Venkatesh R 📧 Email: [email protected] 🌐 GitHub: YourGitHubProfile


License

This project is licensed under the MIT License. See the LICENSE file for details.

About

Python tool using Biopython to parse FASTA files, calculate sequence statistics (e.g., GC content, sequence length), and generate summary reports. This project works on sequence processing and Python skills.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages