Mass Spectra

Article

https://aile3.ijs.si/dunja/SiKDD2023/Papers/IS_2023_-_SIKDD_paper_11.pdf

Purpose

Create a pipeline that extracts various molecule fingerprints and their embeddings from mass spectra data. Then train ML models to predict the fingerprint from the embedding.

Usage

Install

Install Maven (needed for scyjava python package) (install guide) and add it to PATH
Install conda (install guide)
Run conda env create -f conda.yml --name mass_spectra to create conda environment called mass_spectra (note there might be some non fatal errors for some packages)
Run conda activate mass_spectra to activate the environment
Run pip install -r requirements.txt to install python packages that might be skipped by conda

Setup

Extract files from data to source/dataset/ folder. (NOTE: the source/dataset/ should directly contain the extracted files without any subfolders).
Follow the jupyter notebooks in pipeline/ folder. (NOTE: the notebooks should be run in the order they are numbered).
If you follow the jupyter notebooks the basic pipeline will be executed and all embeddings, Spec2Vec models and ML model should be generated along with evaluation files.

Project Structure

helper/ - contains helper functions used by the project
mass_spectra - main command line tools also used in jupyter notebooks
pipeline/ - main pipeline built in jupyter notebooks from embedding to evaluation of trained models
playground/ - contains jupyter notebooks used for testing and playing around
source/ - contains the dataset and the generated files (models, spectra, embeddings, etc.)
conda.yml - contains the conda environment setup
requirements.txt - contains the python packages that are not installed by conda as a fallback

General Information

Articles

Other important external links

Terminology

InChikey = International Chemical Identifier key
TBDMS = Tert-Butyldimethylsilyl chloride
TMS = Tetramethylsilane
RAW = raw data, BS = preprocessed data
GC–MS-EI = Gas Chromatography–Mass Spectrometry (GC–MS) with Cold Electron Ionization (EI)
Molecular fingerprints = bit string representation
SMILES = Simplified molecular-input line-entry system
metabolomics = the large-scale study of small molecules, commonly known as metabolites, within cells, biofluids, tissues or organisms
metabolites = a substance made or used when the body breaks down food, drugs or chemicals, or its own tissue
spec2vec = embedding approach that utalizes word2vec to create embeddings from spectral data
precursor (ion) = ion which is the source of a fragmentation either spontaneous or induced by collisions. Also known as "mother ion".
m/z = M stands for mass and Z stands for charge number of ions. m/z is the mass-to-charge ratio. (Z is often 1, so m/z is often the same as mass.)
Tanimoto similartiy = measure of similarity between two sets of data. It is a metric used to compare the similarity of two sets of data, and is often used in machine learning and data science.

Chemistry Development Kit (CDK)

built in Java
python wrapper: scyjava
java and python wrapper need Maven

from scyjava import config, jimport
config.endpoints.append('org.openscience.cdk:cdk-bundle:2.8')

CircularFingerprinter = jimport('org.openscience.cdk.fingerprint.CircularFingerprinter')

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
helper		helper
mass_spectra		mass_spectra
pipeline		pipeline
playgrounds		playgrounds
source		source
wrappers		wrappers
.gitignore		.gitignore
README.md		README.md
conda.yml		conda.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mass Spectra

Article

Purpose

Usage

Project Structure

General Information

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

alpi314/mass_spectra

Folders and files

Latest commit

History

Repository files navigation

Mass Spectra

Article

Purpose

Usage

Project Structure

General Information

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages