GenomeKit

A comprehensive Python package for genome analysis, including SNP checking, microarray generation, and clinical variant annotation. GenomeKit provides a unified, modular toolkit for processing genomic data using various reference databases and third-party tools.

Features

Microarray Generation: Convert BAM/CRAM files to various microarray formats (23andMe, Ancestry, FTDNA) with smart checkpointing to resume interrupted operations and avoid redundant processing.
SNP Checking: Query clinically relevant SNPs with parallel processing for VCF files and display color-coded genotype interpretations (protective, risk, carrier status) with detailed explanations.

oks

Clinical Variant Analysis: Integrate OpenCravat for comprehensive variant annotation with data from multiple sources (ClinVar, gnomAD, COSMIC, OMIM) and generate reports of clinically significant variants.

Quick Start

1. Clone the Repository

git clone [email protected]:skryl/genome.git
cd genome

2. Install Dependencies and Reference Data

bash install_dependencies.sh

Use bash install_dependencies.sh --test for a dry run (shows actions, does not modify system).

3. Run the CLI Tool

You can use the CLI directly without installation using the included genome script:

# Make the script executable (first time only)
chmod +x genome

# View available commands
./genome --help

Or install the package and use it globally:

# Install the package
pip install .

# Run commands
genomekit --help

Command Examples

Microarray Generation

Generate microarray files from BAM/CRAM files:

# Generate a specific microarray format
./genome microarray --bam path/to/file.bam --formats CombinedKit--outdir ./output

# Generate multiple formats
./genome microarray --cram path/to/file.cram --formats CombinedKit 23andMe_V5 Ancestry_V2 --outdir ./output

# Use GATK instead of bcftools
./genome microarray --bam path/to/file.bam --formats 23andMe_V5 --outdir ./output --use-gatk

SNP Checking

Check SNPs in a VCF or microarray file:

# Check all SNPs in a 23andMe microarray file
./genome snps path/to/snp_report.txt

# Check all SNPs in a VCF file
./genome snps path/to/annotated_vcf.vcf.gz

# Check SNPs for a specific section
./genome snps path/to/annotated_vcf.vcf.gz --section metabolism

# Use a different reference directory (must contain snp_catalog.json)
./genome snps path/to/annotated_vcf.vcf.gz --refdir /path/to/reference --debug

OpenCravat Analysis

Run OpenCravat analysis on VCF files and extract clinical variants:

# Run basic OpenCravat analysis
./genome oc run --vcf path/to/file.vcf --outdir ./clinical_analysis

# Run with specific annotators
./genome oc run --vcf path/to/file.vcf --annotators clinvar gnomad cosmic --outdir ./clinical_analysis

# Find clinically significant variants
./genome oc find-clinvar --outdir ./clinical_analysis --limit 50

Directory Structure

├── data/
│   ├── dbsnp/                 # dbSNP VCF files (e.g., dbsnp_156_hg38.vcf.gz)
│   ├── reference/             # Reference genomes (e.g., hs38d1.fna.gz)
├── scripts/                   # Analysis and utility scripts
├── requirements.txt           # Python dependencies
├── install_dependencies.sh    # Automated setup script
└── README.md                  # This file

Requirements

System: bash, wget, bcftools, htslib, samtools, python3, (optionally: Homebrew or apt-get)
Python: See requirements.txt
Tools: OpenCravat (oc), GATK (manual install for some workflows)

Notes

Reference and dbSNP files are only downloaded if not already present.
All VCF and FASTA files are indexed for rapid access.
OpenCravat modules (clinvar, gnomad, cosmic, omim, dbsnp) are installed automatically.

Data Processing Pipeline

The analysis pipeline uses the unified CLI tool (via the top-level genome script) that supports microarray generation, SNP checking, and OpenCravat analysis.

1. Generate Microarray Files from BAM/CRAM

# Generate microarray files from CRAM (default bcftools approach)
./genome microarray \
  --cram data/nebula/sample.cram \
  --formats CombinedKit \
  --outdir data/output/microarray \
  --refdir data/reference

# Generate multiple formats at once
./genome microarray \
  --bam data/nebula/sample.bam \
  --formats CombinedKit 23andMe_V5 Ancestry_V2 FTDNA_V3 \
  --outdir data/output/microarray

# Alternative: Use GATK HaplotypeCaller for variant calling
./genome microarray \
  --cram data/nebula/sample.cram \
  --formats CombinedKit 23andMe_V5 \
  --outdir data/output/microarray \
  --use-gatk

The tool will:

Process the BAM/CRAM file to call variants
Generate microarray files in requested formats (e.g., 23andMe_V5, Ancestry_V2)
Reuse existing intermediate files when possible for faster processing
Store all intermediate files in the specified output directory's temp subdirectory

2. Analyze SNPs from VCF or Microarray Files

# Analyze all SNP categories from microarray
./genome snps data/output/microarray/CombinedKit.txt

# Analyze all SNP categories from VCF
./genome snps data/output/microarray/sample_annotated.vcf.gz

# Check only specific SNP categories
./genome snps data/output/microarray/CombinedKit.txt --section metabolism
./genome snps data/output/microarray/CombinedKit.txt --section cardiovascular

# Available sections: metabolism, inflammation, cardiovascular, neurotransmitters,
# antioxidant, vitamin, drug, athletic, longevity, sleep, cognitive, nutrition

The SNP analysis will:

Query your VCF or microarray file for clinically relevant SNPs
Display color-coded results (green/yellow/red) based on genotype interpretation
Provide detailed explanations for each variant
Support strand complementarity checking for accurate genotyping

3. Run OpenCravat Analysis

# Run OpenCravat analysis on a VCF file
./genome oc run --vcf data/output/microarray/sample_annotated_vcf.vcf.gz --outdir data/clinical_analysis

# Run with specific annotators
./genome oc run --vcf data/output/microarray/sample_annotated.vcf \
  --annotators clinvar gnomad cosmic omim dbsnp \
  --outdir data/clinical_analysis

# Find clinically significant variants from OpenCravat results
./genome oc find-clinvar --outdir data/clinical_analysis --limit 50

# List all installed OpenCravat annotators
./genome oc list-annotators

# Generate report with all variants and OpenCravat annotations
./genome oc generate-report --outdir data/clinical_analysis --format excel

The OpenCravat analysis will:

Annotate variants in your VCF file with data from various sources (ClinVar, gnomAD, COSMIC, OMIM, dbSNP)
Generate comprehensive reports in various formats (TSV, Excel, PDF)
Identify clinically significant variants by pathogenicity and review status
Allow filtering based on multiple annotation sources

License

MIT License. See LICENSE for details.

Author

skryl

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data/reference		data/reference
examples		examples
genomekit		genomekit
scripts		scripts
.gitignore		.gitignore
MANIFEST.in		MANIFEST.in
PACKAGING.md		PACKAGING.md
README.md		README.md
genome		genome
install_dependencies.sh		install_dependencies.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GenomeKit

Features

Quick Start

1. Clone the Repository

2. Install Dependencies and Reference Data

3. Run the CLI Tool

Command Examples

Microarray Generation

SNP Checking

OpenCravat Analysis

Directory Structure

Requirements

Notes

Data Processing Pipeline

1. Generate Microarray Files from BAM/CRAM

2. Analyze SNPs from VCF or Microarray Files

3. Run OpenCravat Analysis

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GenomeKit

Features

Quick Start

1. Clone the Repository

2. Install Dependencies and Reference Data

3. Run the CLI Tool

Command Examples

Microarray Generation

SNP Checking

OpenCravat Analysis

Directory Structure

Requirements

Notes

Data Processing Pipeline

1. Generate Microarray Files from BAM/CRAM

2. Analyze SNPs from VCF or Microarray Files

3. Run OpenCravat Analysis

License

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages