DEGA is a reproducible pipeline and interactive Jupyter notebook for performing differential gene expression (DGE) analysis on gene expression datasets. The repository includes a fully documented notebook (DEGA.ipynb) present in colab folder that runs the analysis, plus an output folder containing publication-ready tables, plots, and summaries generated by the notebook.
-
2025-08-05 — Analysis run and outputs exported. This release includes:
publication_ready_results.csvandcomprehensive_deg_results.csv.- Figures:
volcano_plot.png,ma_plot.png,top_genes_heatmap.png,top_genes_boxplots.png,exploratory_analysis.png,quality_assessment.png, and more. statistical_summary.txtsummarizing the key metrics from the run.
DEGA is intended for researchers who want a clear, reproducible workflow to go from raw or pre-processed expression tables to:
- Quality assessment & exploratory data analysis (PCA, clustering, QC plots)
- Differential expression testing (fold-change, adjusted p-values)
- Visualization (volcano, MA, heatmaps, boxplots)
- Export of publication-ready tables
The repository contain notebooks that performs the entire analysis and writes output files which can be seen in a outputs folder.
--
The notebook requires a standard Python stack. The first cell installs and imports the dependencies used in the analysis. Recommended to create an isolated environment:
# create environment (conda recommended)
conda create -n dega python=3.10 -y
conda activate dega
# install core packages
pip install --upgrade pip
pip install jupyterlab geoquery GEOparse pandas numpy scipy matplotlib seaborn scikit-learn rpy2
# optional extras used by the notebook for exporting/figures
pip install openpyxl xlsxwriter plotly kaleido adjustTextThe notebook's first cell includes
pip installstatements so it can be run in a fresh Colab/Binder session as well.
Open the notebook in Google Colab (or run locally) — the setup cell will install dependencies automatically.
DEGA.ipynb is divided into the following high-level sections (run in this order):
- Install and import libraries — ensures all Python/R dependencies are available.
- Load data & sample — load expression matrices and the sample metadata file (or download via GEO if configured).
- Preprocessing & filtering — low-expression filtering and optional normalization steps.
- Exploratory data analysis — PCA, sample QC, sample clustering, QC plots.
- Differential expression testing — statistical tests, p-value adjustment, fold-change calculation.
- Post-processing & filtering — select significant genes by p-value and log2 fold-change thresholds.
- Visualization — volcano plot, MA-plot, heatmaps, boxplots for top genes.
- Export results — write
comprehensive_deg_results.csv,publication_ready_results.csv, figures, and astatistical_summary.txt.
Notes:
- The notebook defines threshold variables (e.g.
p_threshold,log2fc_threshold) near the DGE section — adjust them before running the visualization cells. - The notebook prints progress and places outputs in the local working directory (see
deg_analysis_outputzip for an example layout).
comprehensive_deg_results.csv— full results table containing expression means, standard deviations, log2 fold-change, raw and adjusted p-values, and flags for significance/regulation.publication_ready_results.csv— curated table ready for inclusion in papers/supplementary material.supplementary_all_genes_analysis.csv— additional summary metrics for all genes.expression_filtered.csv— filtered expression matrix used for downstream analysis.statistical_summary.txt— short text summary (date of analysis, number of genes analyzed, counts of up/downregulated genes, etc.).- Figures:
volcano_plot.png,ma_plot.png,top_genes_heatmap.png,top_genes_boxplots.png,exploratory_analysis.png,quality_assessment.png,expression_clusters.png,treatment_effect_preview.png,comprehensive_treatment_validation.png.
The latest statistical_summary.txt (analysis date: 2025-08-05) reports:
- Total genes analyzed: 1000
- Significant genes: 1 (Percent significant: 0.10%)
- Upregulated genes: 54
- Downregulated genes: 77
- Mean fold change (significant): 2.04
See
deg_analysis_output/statistical_summary.txtfor the full summary and top gene lists.
The notebook also demonstrates how to regenerate figures and adjust significance thresholds.
repo-root/
├── colab # Main analysis notebook
├── notebooks # each cell in a separate file
├── outputs # expected results(exported CSVs and figures)
├── requirements.txt # rquirements to use the repo
└── README.md # This file
- The notebook tries to install exact Python packages at runtime (see the first cell).
- For full reproducibility, record the output of
pip freezeor export the conda environment before running the analysis. - If results are to be used in publications, set random seeds and record software versions used (the notebook prints the analysis date in
statistical_summary.txt).
Contributions and issues are welcome. Please open an issue describing the request or submit a pull request with tests and updated notebook outputs where appropriate. Suggested improvements:
- Add a command-line wrapper to run the pipeline headlessly.
- Add unit tests for core pre-processing functions.
- Add support for common normalization methods (DESeq2 via rpy2, limma-voom, edgeR).