Search genomes for chemical synthetic potential with chemsearch

chemsearch is a Python package that allows you to query genomes for their potential to synthesize specific chemical compounds. This tool is designed for researchers in bioinformatics and synthetic biology who want to identify microorganisms capable of producing valuable natural products or other chemicals of interest.

About the Project

The core functionality of chemsearch is to bridge the gap between chemical structures and genomic data. It does this by:

Preprocessing Biochemical Data: It takes reaction and compound data from sources like ModelSEED and Rhea to build a comprehensive database that links chemical compounds (represented by SMILES strings) to the enzymatic reactions (EC numbers) that produce them.
Querying by Chemical Structure: You can provide one or more chemical structures (as SMILES strings) and a similarity threshold to find related compounds in the database. This is useful for finding enzymes that might produce compounds similar to your target molecule.
Identifying Promising Genomes: chemsearch then searches a collection of genomes for those that contain the EC numbers identified in the previous step. This provides a list of candidate organisms that may have the metabolic pathways to synthesize your compound of interest.

Getting Started

To use chemsearch, you need to have Python 3.9 or higher installed, along with the following packages:

rdkit: For chemical informatics and fingerprinting.
pandas: For data manipulation.
ipykernel: To run the example notebook.

You can install these dependencies using pip:

pip install rdkit-pypi pandas ipykernel

Usage

The best way to understand how to use chemsearch is to follow the example notebook. Here's a summary of the workflow:

1. Preprocess Reaction Data

First, you need to preprocess the reaction and compound data to create a database that can be queried. The chemsearch.preprocess module provides functions for this purpose.

from pathlib import Path
from chemsearch.preprocess import preprocess_reactions, process_reactions_to_dataframe

reactions_json = Path("data/modelseed/reactions.json")
compounds_json = Path("data/modelseed/compounds.json")

reactions = preprocess_reactions(reactions_json, compounds_json, complete_smiles=True)

df = process_reactions_to_dataframe(reactions)
df.to_csv("data/compound_ec_database.tsv", sep="\t", index=False)

2. Query the Database by Chemical Structure

Once you have the database, you can query it using SMILES strings. The chemsearch.query module contains the necessary functions.

from chemsearch.query import query_reaction_database_by_smiles

# Load the preprocessed database
rxn_df = pd.read_csv("data/compound_ec_database_plus_rhea.tsv", sep="\t")

# Define your target compounds and similarity threshold
target_smiles = ["CC1=C(/C=C/C(C)=C/C=C/C(C)=C/C=C/C=C(C)/C=C/C=C(C(C)(C)C)C1)C", "O=C1C(=C(C=CC(=CC=CC(=CC=CC=C(C=CC=C(C=CC2=C(C)C(O)CC2(C)C)C)C)C)C)C)C(C)(C)C1"]
threshold = 0.7

# Query the database
results = query_reaction_database_by_smiles(rxn_df, target_smiles, threshold, only_best_hit=True)

3. Identify Genomes with Synthetic Potential

Finally, you can use the results from the previous step to find genomes that are likely to produce your target compounds.

from chemsearch.preprocess import parse_genome_ec_numbers
from chemsearch.query import extract_genome_hits

# Load genome data
directory_path = "data/genomes_ec"
genome_data = parse_genome_ec_numbers(directory_path)

# Find genomes with matching EC numbers
hits = extract_genome_hits(results, genome_data, taxonomy='data/taxonomy_mapping.tsv')

The hits DataFrame will contain a list of genomes, the enzymes they contain, and the compounds they are predicted to produce.

MCP Server Integration

chemsearch includes an MCP (Model Context Protocol) server that enables natural language queries through AI assistants like Claude Code. For setup instructions and configuration details, see the MCP Server Documentation.

MCP Server Demo: Natural Language Queries

The chemsearch package also includes an MCP (Model Context Protocol) server that enables natural language queries for chemical compound synthesis potential. Here's a demo showcasing how you can ask questions about organisms that produce specific compounds:

Demo: "Which organisms produce beta-carotenes and display their taxonomy?"

User: Which organisms produce beta-carotenes? Display their taxonomy.

Claude Code (via MCP): I'll search for organisms that can produce beta-carotene using the ChEBI identifier for beta-carotene (CHEBI:17579) and analyze the archaea genome database.

[MCP server configured with archaea genome database: /home/robaina/Documents/chemSearch/data/refseq/archaea]

Query Results: Found 3 archaea genomes with beta-carotene production potential:

Organism	Genome ID	Taxonomy	EC Numbers	Similarity Score
Halobacterium rubrum	GCF_029872365.1	Archaea; Halobacterium rubrum; strain=R9-3	5.5.1.19	1.0 (exact match)
Halorussus pelagicus	GCF_004087835.1	Archaea; Halorussus pelagicus; strain=RC-68	1.13.11.63	1.0 (exact match)
Halobacterium salinarum	GCF_004799605.1	Archaea; Halobacterium salinarum; strain=91-R6	5.5.1.19	1.0 (exact match)

Scientific Summary: Why These Archaea Produce Beta-Carotene

These halophilic (salt-loving) archaea produce beta-carotene as part of their sophisticated carotenoid biosynthesis pathways, serving multiple critical biological functions:

Evolutionary Adaptation to Extreme Environments

Osmotic Protection: These organisms thrive in hypersaline environments (>15% salt concentration) where beta-carotene and its derivatives strengthen cell membranes against osmotic stress
UV Radiation Shield: Living in salt flats and solar salterns exposes them to intense UV radiation; carotenoids provide photoprotection
Oxidative Stress Defense: Extreme salinity generates reactive oxygen species; beta-carotene acts as a powerful antioxidant

Metabolic Integration

Dual-Purpose Biosynthesis: Beta-carotene serves as a precursor for retinal, the chromophore in bacteriorhodopsin (a light-driven proton pump for energy production)
Regulatory Balance: These organisms carefully regulate the conversion of lycopene to either beta-carotene (for retinal synthesis) or bacterioruberin (C50 carotenoids for membrane protection)
Resource Optimization: Sophisticated regulatory mechanisms ensure optimal allocation between protective pigments and functional cofactors

Biotechnological Significance

High Production Capacity: Halophilic archaea can accumulate carotenoids up to 335 μg/g dry mass
Unique Pigments: They produce bacterioruberin (C50 carotenoid) found almost exclusively in these organisms
Industrial Applications: These carotenoids have superior antioxidant properties compared to conventional beta-carotene and show promise for food, pharmaceutical, and cosmetic applications

Enzymatic Machinery

The identified EC numbers correspond to:

EC 5.5.1.19: Lycopene beta-cyclase (converts lycopene to beta-carotene)
EC 1.13.11.63: Beta-carotene 15,15'-dioxygenase (cleaves beta-carotene to retinal)

These enzymes represent key steps in the carotenoid biosynthetic pathway, confirming these organisms' genetic capacity for beta-carotene production and utilization.

This natural language query capability demonstrates how the MCP server enables intuitive exploration of complex genomic and chemical data, making sophisticated bioinformatics analyses accessible through conversational interfaces.

License

This project is licensed under the GPL-3.0 License. See the LICENSE file for details.

Contact

Semidán Robaina Estévez - [email protected]

Project Link: https://github.com/Robaina/chemsearch

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
data		data
docs		docs
notebooks		notebooks
src		src
tests		tests
web-ui		web-ui
.gitignore		.gitignore
.mcp.json		.mcp.json
.mcp_config.json		.mcp_config.json
CLAUDE.md		CLAUDE.md
GEMINI.md		GEMINI.md
LICENSE		LICENSE
QWEN.md		QWEN.md
README.md		README.md
SETUP_MCP.md		SETUP_MCP.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Search genomes for chemical synthetic potential with chemsearch

About the Project

Getting Started

Usage

1. Preprocess Reaction Data

2. Query the Database by Chemical Structure

3. Identify Genomes with Synthetic Potential

MCP Server Integration

MCP Server Demo: Natural Language Queries

Demo: "Which organisms produce beta-carotenes and display their taxonomy?"

Scientific Summary: Why These Archaea Produce Beta-Carotene

Evolutionary Adaptation to Extreme Environments

Metabolic Integration

Biotechnological Significance

Enzymatic Machinery

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Robaina/chemSearch

Folders and files

Latest commit

History

Repository files navigation

Search genomes for chemical synthetic potential with chemsearch

About the Project

Getting Started

Usage

1. Preprocess Reaction Data

2. Query the Database by Chemical Structure

3. Identify Genomes with Synthetic Potential

MCP Server Integration

MCP Server Demo: Natural Language Queries

Demo: "Which organisms produce beta-carotenes and display their taxonomy?"

Scientific Summary: Why These Archaea Produce Beta-Carotene

Evolutionary Adaptation to Extreme Environments

Metabolic Integration

Biotechnological Significance

Enzymatic Machinery

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages