Welcome to the FASTA Sequence Parser and Analyzer, a friendly Python tool designed to make working with FASTA files a breeze!
Whether you're diving into DNA or protein sequences, this script helps you parse files, crunch numbers like sequence length and GC content, and whip up neat reports and visuals.
Built with BioPython, Pandas, and Matplotlib, it’s perfect for bioinformaticians, students, or anyone curious about sequence analysis.
This tool processes FASTA files—those handy formats used in bioinformatics to store nucleotide or protein sequences.
It extracts key stats like sequence length, GC percentage, and nucleotide frequencies, then wraps it all up in a tidy CSV report and a histogram of sequence lengths.
Whether you're analyzing one file or a batch, this script has you covered with a simple command-line interface.
- Parse FASTA Files: Reads sequences and metadata using BioPython’s
SeqIO. - Compute Stats: Calculates sequence length, GC content, and percentages of A, T, G, C, and non-standard bases.
- Summarize Data: Provides dataset-wide stats like total bases, average length, and standard deviation.
- Visualize Results: Generates a clear histogram of sequence lengths.
- Batch Processing: Handles multiple FASTA files in one go.
- Custom Output: Saves results to a directory of your choice (defaults to
results/).
You’ll need Python 3.6+ and a few dependencies.
Install the required packages:
pip install biopython pandas matplotlibClone or download this repository:
git clone https://github.com/[YourGitHubUsername]/fasta-analyzer.git
cd fasta-analyzer- Python 3.6 or higher
- BioPython (
biopython) - Pandas (
pandas) - Matplotlib (
matplotlib)
Run the script from your terminal, point it to your FASTA file(s), and optionally pick an output folder.
Basic command:
python scripts/fasta_analyzer.py data/sample.fasta --output my_resultsfasta_files: Path(s) to your FASTA file(s). You can list multiple files!--output: Where to save the results (defaults toresults/).
- CSV file (
sequence_stats.csv) → Per-sequence stats (ID, length, GC content, nucleotide frequencies). - Histogram (
length_hist.png) → Distribution of sequence lengths. - Console Summary → Dataset-wide stats.
If you have a FASTA file called sample.fasta in the data/ folder, run:
python scripts/fasta_analyzer.py data/sample.fastaExpected output:
Processing data/sample.fasta...
- 3 sequences, 200 total bases
Per-sequence stats saved to results/sequence_stats.csv
Dataset Summary:
Num Sequences: 3
Total Bases: 200
Avg Length: 66.67
Avg Gc Percent: 45.0
Std Length: 32.46
Length histogram saved to results/length_hist.png
Check the results/ folder for your CSV and histogram!
fasta-analyzer/
├── scripts/
│ └── fasta_analyzer.py # The main script
├── data/
│ └── sample.fasta # Your input FASTA files
├── results/
│ ├── sequence_stats.csv # Per-sequence stats
│ └── length_hist.png # Length distribution plot
└── README.md # You’re reading it!
- Parsing → BioPython’s
SeqIO.parsereads FASTA files. - Stats Calculation → Sequence length, GC content (
Bio.SeqUtils.gc_fraction), and nucleotide frequencies. - Aggregation → Pandas summarizes stats across sequences.
- Visualization → Matplotlib plots a histogram of sequence lengths.
-
Place a FASTA file in the
data/folder (e.g.,sample.fasta). -
Run:
python scripts/fasta_analyzer.py data/sample.fasta
-
Check results:
- Console output for summary
results/sequence_stats.csvfor detailed statsresults/length_hist.pngfor histogram
- Add RNA support (handle
Uinstead ofT). - Flag invalid bases with warnings.
- Speed up with multi-threading for huge files.
- Add codon usage / sequence complexity metrics.
- Export results in JSON or Excel.
Venkatesh R 📧 Email: [email protected] 🌐 GitHub: YourGitHubProfile
This project is licensed under the MIT License. See the LICENSE file for details.