chemsearch is a Python package that allows you to query genomes for their potential to synthesize specific chemical compounds. This tool is designed for researchers in bioinformatics and synthetic biology who want to identify microorganisms capable of producing valuable natural products or other chemicals of interest.
The core functionality of chemsearch is to bridge the gap between chemical structures and genomic data. It does this by:
- Preprocessing Biochemical Data: It takes reaction and compound data from sources like ModelSEED and Rhea to build a comprehensive database that links chemical compounds (represented by SMILES strings) to the enzymatic reactions (EC numbers) that produce them.
- Querying by Chemical Structure: You can provide one or more chemical structures (as SMILES strings) and a similarity threshold to find related compounds in the database. This is useful for finding enzymes that might produce compounds similar to your target molecule.
- Identifying Promising Genomes: chemsearchthen searches a collection of genomes for those that contain the EC numbers identified in the previous step. This provides a list of candidate organisms that may have the metabolic pathways to synthesize your compound of interest.
To use chemsearch, you need to have Python 3.9 or higher installed, along with the following packages:
- rdkit: For chemical informatics and fingerprinting.
- pandas: For data manipulation.
- ipykernel: To run the example notebook.
You can install these dependencies using pip:
pip install rdkit-pypi pandas ipykernelThe best way to understand how to use chemsearch is to follow the example notebook. Here's a summary of the workflow:
First, you need to preprocess the reaction and compound data to create a database that can be queried. The chemsearch.preprocess module provides functions for this purpose.
from pathlib import Path
from chemsearch.preprocess import preprocess_reactions, process_reactions_to_dataframe
reactions_json = Path("data/modelseed/reactions.json")
compounds_json = Path("data/modelseed/compounds.json")
reactions = preprocess_reactions(reactions_json, compounds_json, complete_smiles=True)
df = process_reactions_to_dataframe(reactions)
df.to_csv("data/compound_ec_database.tsv", sep="\t", index=False)Once you have the database, you can query it using SMILES strings. The chemsearch.query module contains the necessary functions.
from chemsearch.query import query_reaction_database_by_smiles
# Load the preprocessed database
rxn_df = pd.read_csv("data/compound_ec_database_plus_rhea.tsv", sep="\t")
# Define your target compounds and similarity threshold
target_smiles = ["CC1=C(/C=C/C(C)=C/C=C/C(C)=C/C=C/C=C(C)/C=C/C=C(C(C)(C)C)C1)C", "O=C1C(=C(C=CC(=CC=CC(=CC=CC=C(C=CC=C(C=CC2=C(C)C(O)CC2(C)C)C)C)C)C)C)C(C)(C)C1"]
threshold = 0.7
# Query the database
results = query_reaction_database_by_smiles(rxn_df, target_smiles, threshold, only_best_hit=True)Finally, you can use the results from the previous step to find genomes that are likely to produce your target compounds.
from chemsearch.preprocess import parse_genome_ec_numbers
from chemsearch.query import extract_genome_hits
# Load genome data
directory_path = "data/genomes_ec"
genome_data = parse_genome_ec_numbers(directory_path)
# Find genomes with matching EC numbers
hits = extract_genome_hits(results, genome_data, taxonomy='data/taxonomy_mapping.tsv')The hits DataFrame will contain a list of genomes, the enzymes they contain, and the compounds they are predicted to produce.
chemsearch includes an MCP (Model Context Protocol) server that enables natural language queries through AI assistants like Claude Code. For setup instructions and configuration details, see the MCP Server Documentation.
The chemsearch package also includes an MCP (Model Context Protocol) server that enables natural language queries for chemical compound synthesis potential. Here's a demo showcasing how you can ask questions about organisms that produce specific compounds:
User: Which organisms produce beta-carotenes? Display their taxonomy.
Claude Code (via MCP): I'll search for organisms that can produce beta-carotene using the ChEBI identifier for beta-carotene (CHEBI:17579) and analyze the archaea genome database.
[MCP server configured with archaea genome database: /home/robaina/Documents/chemSearch/data/refseq/archaea]
Query Results: Found 3 archaea genomes with beta-carotene production potential:
| Organism | Genome ID | Taxonomy | EC Numbers | Similarity Score | 
|---|---|---|---|---|
| Halobacterium rubrum | GCF_029872365.1 | Archaea; Halobacterium rubrum; strain=R9-3 | 5.5.1.19 | 1.0 (exact match) | 
| Halorussus pelagicus | GCF_004087835.1 | Archaea; Halorussus pelagicus; strain=RC-68 | 1.13.11.63 | 1.0 (exact match) | 
| Halobacterium salinarum | GCF_004799605.1 | Archaea; Halobacterium salinarum; strain=91-R6 | 5.5.1.19 | 1.0 (exact match) | 
These halophilic (salt-loving) archaea produce beta-carotene as part of their sophisticated carotenoid biosynthesis pathways, serving multiple critical biological functions:
- Osmotic Protection: These organisms thrive in hypersaline environments (>15% salt concentration) where beta-carotene and its derivatives strengthen cell membranes against osmotic stress
- UV Radiation Shield: Living in salt flats and solar salterns exposes them to intense UV radiation; carotenoids provide photoprotection
- Oxidative Stress Defense: Extreme salinity generates reactive oxygen species; beta-carotene acts as a powerful antioxidant
- Dual-Purpose Biosynthesis: Beta-carotene serves as a precursor for retinal, the chromophore in bacteriorhodopsin (a light-driven proton pump for energy production)
- Regulatory Balance: These organisms carefully regulate the conversion of lycopene to either beta-carotene (for retinal synthesis) or bacterioruberin (C50 carotenoids for membrane protection)
- Resource Optimization: Sophisticated regulatory mechanisms ensure optimal allocation between protective pigments and functional cofactors
- High Production Capacity: Halophilic archaea can accumulate carotenoids up to 335 μg/g dry mass
- Unique Pigments: They produce bacterioruberin (C50 carotenoid) found almost exclusively in these organisms
- Industrial Applications: These carotenoids have superior antioxidant properties compared to conventional beta-carotene and show promise for food, pharmaceutical, and cosmetic applications
The identified EC numbers correspond to:
- EC 5.5.1.19: Lycopene beta-cyclase (converts lycopene to beta-carotene)
- EC 1.13.11.63: Beta-carotene 15,15'-dioxygenase (cleaves beta-carotene to retinal)
These enzymes represent key steps in the carotenoid biosynthetic pathway, confirming these organisms' genetic capacity for beta-carotene production and utilization.
This natural language query capability demonstrates how the MCP server enables intuitive exploration of complex genomic and chemical data, making sophisticated bioinformatics analyses accessible through conversational interfaces.
This project is licensed under the GPL-3.0 License. See the LICENSE file for details.
Semidán Robaina Estévez - [email protected]
Project Link: https://github.com/Robaina/chemsearch
