Skip to content

alpi314/mass_spectra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mass Spectra

Article

https://aile3.ijs.si/dunja/SiKDD2023/Papers/IS_2023_-_SIKDD_paper_11.pdf

Purpose

Create a pipeline that extracts various molecule fingerprints and their embeddings from mass spectra data. Then train ML models to predict the fingerprint from the embedding.

Usage

  1. Install
  • Install Maven (needed for scyjava python package) (install guide) and add it to PATH
  • Install conda (install guide)
  • Run conda env create -f conda.yml --name mass_spectra to create conda environment called mass_spectra (note there might be some non fatal errors for some packages)
  • Run conda activate mass_spectra to activate the environment
  • Run pip install -r requirements.txt to install python packages that might be skipped by conda
  1. Setup
  • Extract files from data to source/dataset/ folder. (NOTE: the source/dataset/ should directly contain the extracted files without any subfolders).
  • Follow the jupyter notebooks in pipeline/ folder. (NOTE: the notebooks should be run in the order they are numbered).
  • If you follow the jupyter notebooks the basic pipeline will be executed and all embeddings, Spec2Vec models and ML model should be generated along with evaluation files.

Project Structure

  • helper/ - contains helper functions used by the project
  • mass_spectra - main command line tools also used in jupyter notebooks
  • pipeline/ - main pipeline built in jupyter notebooks from embedding to evaluation of trained models
  • playground/ - contains jupyter notebooks used for testing and playing around
  • source/ - contains the dataset and the generated files (models, spectra, embeddings, etc.)
  • conda.yml - contains the conda environment setup
  • requirements.txt - contains the python packages that are not installed by conda as a fallback

General Information

  1. Articles
  1. Other important external links
  1. Terminology
  • InChikey = International Chemical Identifier key
  • TBDMS = Tert-Butyldimethylsilyl chloride
  • TMS = Tetramethylsilane
  • RAW = raw data, BS = preprocessed data
  • GC–MS-EI = Gas Chromatography–Mass Spectrometry (GC–MS) with Cold Electron Ionization (EI)
  • Molecular fingerprints = bit string representation
  • SMILES = Simplified molecular-input line-entry system
  • metabolomics = the large-scale study of small molecules, commonly known as metabolites, within cells, biofluids, tissues or organisms
  • metabolites = a substance made or used when the body breaks down food, drugs or chemicals, or its own tissue
  • spec2vec = embedding approach that utalizes word2vec to create embeddings from spectral data
  • precursor (ion) = ion which is the source of a fragmentation either spontaneous or induced by collisions. Also known as "mother ion".
  • m/z = M stands for mass and Z stands for charge number of ions. m/z is the mass-to-charge ratio. (Z is often 1, so m/z is often the same as mass.)
  • Tanimoto similartiy = measure of similarity between two sets of data. It is a metric used to compare the similarity of two sets of data, and is often used in machine learning and data science.
  1. Chemistry Development Kit (CDK)
  • built in Java
  • python wrapper: scyjava
  • java and python wrapper need Maven
from scyjava import config, jimport
config.endpoints.append('org.openscience.cdk:cdk-bundle:2.8')

CircularFingerprinter = jimport('org.openscience.cdk.fingerprint.CircularFingerprinter')

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •