Skip to content

Robaina/chemSearch

Repository files navigation

Search genomes for chemical synthetic potential with chemsearch

chemsearch is a Python package that allows you to query genomes for their potential to synthesize specific chemical compounds. This tool is designed for researchers in bioinformatics and synthetic biology who want to identify microorganisms capable of producing valuable natural products or other chemicals of interest.

About the Project

The core functionality of chemsearch is to bridge the gap between chemical structures and genomic data. It does this by:

  1. Preprocessing Biochemical Data: It takes reaction and compound data from sources like ModelSEED and Rhea to build a comprehensive database that links chemical compounds (represented by SMILES strings) to the enzymatic reactions (EC numbers) that produce them.
  2. Querying by Chemical Structure: You can provide one or more chemical structures (as SMILES strings) and a similarity threshold to find related compounds in the database. This is useful for finding enzymes that might produce compounds similar to your target molecule.
  3. Identifying Promising Genomes: chemsearch then searches a collection of genomes for those that contain the EC numbers identified in the previous step. This provides a list of candidate organisms that may have the metabolic pathways to synthesize your compound of interest.

Getting Started

To use chemsearch, you need to have Python 3.9 or higher installed, along with the following packages:

  • rdkit: For chemical informatics and fingerprinting.
  • pandas: For data manipulation.
  • ipykernel: To run the example notebook.

You can install these dependencies using pip:

pip install rdkit-pypi pandas ipykernel

Usage

The best way to understand how to use chemsearch is to follow the example notebook. Here's a summary of the workflow:

1. Preprocess Reaction Data

First, you need to preprocess the reaction and compound data to create a database that can be queried. The chemsearch.preprocess module provides functions for this purpose.

from pathlib import Path
from chemsearch.preprocess import preprocess_reactions, process_reactions_to_dataframe

reactions_json = Path("data/modelseed/reactions.json")
compounds_json = Path("data/modelseed/compounds.json")

reactions = preprocess_reactions(reactions_json, compounds_json, complete_smiles=True)

df = process_reactions_to_dataframe(reactions)
df.to_csv("data/compound_ec_database.tsv", sep="\t", index=False)

2. Query the Database by Chemical Structure

Once you have the database, you can query it using SMILES strings. The chemsearch.query module contains the necessary functions.

from chemsearch.query import query_reaction_database_by_smiles

# Load the preprocessed database
rxn_df = pd.read_csv("data/compound_ec_database_plus_rhea.tsv", sep="\t")

# Define your target compounds and similarity threshold
target_smiles = ["CC1=C(/C=C/C(C)=C/C=C/C(C)=C/C=C/C=C(C)/C=C/C=C(C(C)(C)C)C1)C", "O=C1C(=C(C=CC(=CC=CC(=CC=CC=C(C=CC=C(C=CC2=C(C)C(O)CC2(C)C)C)C)C)C)C)C(C)(C)C1"]
threshold = 0.7

# Query the database
results = query_reaction_database_by_smiles(rxn_df, target_smiles, threshold, only_best_hit=True)

3. Identify Genomes with Synthetic Potential

Finally, you can use the results from the previous step to find genomes that are likely to produce your target compounds.

from chemsearch.preprocess import parse_genome_ec_numbers
from chemsearch.query import extract_genome_hits

# Load genome data
directory_path = "data/genomes_ec"
genome_data = parse_genome_ec_numbers(directory_path)

# Find genomes with matching EC numbers
hits = extract_genome_hits(results, genome_data, taxonomy='data/taxonomy_mapping.tsv')

The hits DataFrame will contain a list of genomes, the enzymes they contain, and the compounds they are predicted to produce.

MCP Server Integration

chemsearch includes an MCP (Model Context Protocol) server that enables natural language queries through AI assistants like Claude Code. For setup instructions and configuration details, see the MCP Server Documentation.

MCP Server Demo: Natural Language Queries

The chemsearch package also includes an MCP (Model Context Protocol) server that enables natural language queries for chemical compound synthesis potential. Here's a demo showcasing how you can ask questions about organisms that produce specific compounds:

Demo: "Which organisms produce beta-carotenes and display their taxonomy?"

User: Which organisms produce beta-carotenes? Display their taxonomy.

Claude Code (via MCP): I'll search for organisms that can produce beta-carotene using the ChEBI identifier for beta-carotene (CHEBI:17579) and analyze the archaea genome database.

[MCP server configured with archaea genome database: /home/robaina/Documents/chemSearch/data/refseq/archaea]

Query Results: Found 3 archaea genomes with beta-carotene production potential:

Organism Genome ID Taxonomy EC Numbers Similarity Score
Halobacterium rubrum GCF_029872365.1 Archaea; Halobacterium rubrum; strain=R9-3 5.5.1.19 1.0 (exact match)
Halorussus pelagicus GCF_004087835.1 Archaea; Halorussus pelagicus; strain=RC-68 1.13.11.63 1.0 (exact match)
Halobacterium salinarum GCF_004799605.1 Archaea; Halobacterium salinarum; strain=91-R6 5.5.1.19 1.0 (exact match)

Scientific Summary: Why These Archaea Produce Beta-Carotene

These halophilic (salt-loving) archaea produce beta-carotene as part of their sophisticated carotenoid biosynthesis pathways, serving multiple critical biological functions:

Evolutionary Adaptation to Extreme Environments

  • Osmotic Protection: These organisms thrive in hypersaline environments (>15% salt concentration) where beta-carotene and its derivatives strengthen cell membranes against osmotic stress
  • UV Radiation Shield: Living in salt flats and solar salterns exposes them to intense UV radiation; carotenoids provide photoprotection
  • Oxidative Stress Defense: Extreme salinity generates reactive oxygen species; beta-carotene acts as a powerful antioxidant

Metabolic Integration

  • Dual-Purpose Biosynthesis: Beta-carotene serves as a precursor for retinal, the chromophore in bacteriorhodopsin (a light-driven proton pump for energy production)
  • Regulatory Balance: These organisms carefully regulate the conversion of lycopene to either beta-carotene (for retinal synthesis) or bacterioruberin (C50 carotenoids for membrane protection)
  • Resource Optimization: Sophisticated regulatory mechanisms ensure optimal allocation between protective pigments and functional cofactors

Biotechnological Significance

  • High Production Capacity: Halophilic archaea can accumulate carotenoids up to 335 μg/g dry mass
  • Unique Pigments: They produce bacterioruberin (C50 carotenoid) found almost exclusively in these organisms
  • Industrial Applications: These carotenoids have superior antioxidant properties compared to conventional beta-carotene and show promise for food, pharmaceutical, and cosmetic applications

Enzymatic Machinery

The identified EC numbers correspond to:

  • EC 5.5.1.19: Lycopene beta-cyclase (converts lycopene to beta-carotene)
  • EC 1.13.11.63: Beta-carotene 15,15'-dioxygenase (cleaves beta-carotene to retinal)

These enzymes represent key steps in the carotenoid biosynthetic pathway, confirming these organisms' genetic capacity for beta-carotene production and utilization.

This natural language query capability demonstrates how the MCP server enables intuitive exploration of complex genomic and chemical data, making sophisticated bioinformatics analyses accessible through conversational interfaces.

License

This project is licensed under the GPL-3.0 License. See the LICENSE file for details.

Contact

Semidán Robaina Estévez - [email protected]

Project Link: https://github.com/Robaina/chemsearch

About

Search genomes by chemical compounds

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published