Tools to use and expand the capabilities of the original GenePT. This repository contains utilities and notebooks for working with gene embeddings and single-cell RNA sequencing data.
This project builds upon the GenePT paper and provides tools to:
- Compare different embedding approaches (GenePT vs scGPT)
- Work with large single-cell datasets like Tabula Sapiens
- Generate composable embeddings across different dimensions
- Perform cell type classification using embeddings
The following image shows a detailed summary of the results of the comparison between GenePT and scGPT zero-shot classification so far:
We used a Google Sheet to format the output
- Python 3.10 (required for scGPT compatibility)
- Standard scientific Python packages (pandas, numpy, scikit-learn)
- Special dependencies:
- scGPT
- AnnData
- Hugging Face datasets/models
# Create venv
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
# Install in development mode with all tools
pip install -e ".[dev]"
# Format code
black .
isort --gitignore .
# Run tests
pytest
GenePT-tools/
├── src/ # utility functions
└── notebooks/ # analysis notebooks
Take a look at generate_genept_embeddings.ipynb
to see how to generate a GenePT embeddings and dataset and upload them to HuggingFace Hub. create_hf_repos.ipynb
will create a new repository for the embeddings and dataset.
Take a look at tabula_sapiens_*.ipynb
for a comparison of cell type classification using GenePT and scGPT embeddings.
Notebook | Description |
---|---|
generate_genept_embeddings.ipynb |
Generates the GenePT embeddings and dataset for upload to HuggingFace Hub |
tabula_sapiens_embed_genept.ipynb |
Evaluates GenePT embeddings' cell classification performance on Tabula Sapiens |
create_hf_repos.ipynb |
Creates the initial HuggingFace repositories for the GenePT embeddings and dataset |
tabula_sapiens_eda.ipynb |
Exploratory analysis of the Tabula Sapiens single-cell dataset |
tabula_sapiens_embed_genept.ipynb |
Embed a subset of the Tabula Sapiens dataset using GenePT embeddings |
tabula_sapiens_embed_scgpt.ipynb |
Embed a subset of the Tabula Sapiens dataset using scGPT embeddings |
tabula_sapiens_analysis_all.ipynb |
A comparison of GenePT and scGPT embeddings for cell type classification on TS |
- Support for loading and processing large sparse AnnData files
- Integration with Hugging Face datasets
- GenePT original embeddings
- scGPT embeddings
- Composable embeddings across different dimensions:
- Associated genes
- Aging related information
- Drug interactions
- Pathways and biological processes
- Cell type classification
- Embedding comparison utilities
- Visualization tools for high-dimensional data
- Exact comparison between scGPT and GenePT embeddings
- Minimum cell count filtering per cell type
- AnnData integration
- Original GenePT embeddings support
- Prompt improvements
- Remove aging
- Add cell type
- Add tissue type
- Add dysfunctional cell type
- scGPT with batch tokens
- scGPT with modality tokens
- scGPT with combined batch/modality tokens
- Complete Tabula Sapiens cell embedding
- Cell-document bidirectional lookups
- Cell separation analysis
This is a preliminary repository with work in progress. Code is mostly untested but being actively developed. Contributions and collaborations are welcome.
This project is licensed under the MIT License. The original GenePT weights are governed by the license of the original GenePT repository.