Skip to content

snail123815/oleveler

Repository files navigation

Omics-leveler

by Du, Chao (杜超), PhD.

Member of MBT group of Microbio Science led by Prof.dr. G.P. van Wezel
Institute of Biology, Leiden University, the Netherlands
[email protected]
durand[dot]dc[at]hot[no space]mail.com

For you to analyse quantitative proteomics and transcriptiomics data at ease!

Oleveler is short for Omics-leveler, it only requires very basic python knowledge to work with. The analysis starts with MaxQuant result (proteomics) or featureCount (transcriptomics) result, do the right statistics with ease, generate customised plots including PCA, PLS, volcano, bar, etc. More importantly, this tool is designed to give you the ability to query the dataset at any time that you come up with any brilliant idea!

1. Introduction

Thanks to the advancement in both technologies, quantitative proteomics and transcriptomics are applied more often in biology research. Scientists are generating huge amount of data that may include more samples or/and more complex experiment designs. This posses a great challenge in data analyses. Often, to use proper statistics in data analysis, a specialist in proteomics or a bioinfomaticion analysing transcriptomics data are taking a lot of efforts to understand the scientific question behind the experiment design. This process includes numerous communication between specialists and biologists. It is very time consuming and may leads to disastrous mis-understanding. Oleveler is born to solve this "last kilometre" problem by giving the ease and flexibility in data analysis to biologists.

Processing raw LC-MS/MS files or raw reads files are not within the scope of this tool. For a typical biologist, I strongly suggest that you leave that part to a specialist.

Statistics, eg. data transformation, calculation of log2 fold change and corresponding p-values, are done by running R code inside Oleveler using DESeq2[1][1] (both proteomics and transcriptomics data) and MSstats[2][2]. You need to reference them if you do the different analysis inside Oleveler. Also, please reference apeglm[3][3] if you used shrinked data for plotting etc.

Oleveler is provided as a one file system intended to minimise the chance of operational errors. As it is built for jupyternotebook or jupyterlab, bioinformaticions can also use Oleveler to build a JupyterHub that deliver the power of data analysis to end users without installing dependencies on end users' computer.

Current design code of this program is to make sure every function can be called independently, with all information passed in as parameters.

from oleveler import *

2. Install Dependencies

Create a conda environment for oleveler is recommanded. Mamba is recommanded to install dependencies because it is much faster and reliable than conda itself. To install Mamba in your conda environment:

conda install mamba -n base -c conda-forge

Then you can clone this repository to your local dir by:

git clone --depth 1 https://github.com/snail123815/oleveler.git

And start creating a environment with dependencies:

cd oleveler
mamba env create -n oleveler -f oleveler_deps.yml

No error message should appear.

Before running your analysis, do not forget to activate the environment you just created:

mamba activate oleveler

3. Prepare your data - Example folder my_analysis

The analysis needs to start with a fresh (empty) folder. Assume the folder is named my_analysis. Please copy the oleveler.py file from this repository (or download it using the download )

3.1 Proteomics data

Oleveler starts with MaxQuant processed data. From the analysis folder combined/txt/, please copy the following two files:

  • evidence.txt
  • proteinGroups.txt

Create a folder named MaxQuant_output in my_analysis folder, put the above files in that folder.

3.1.1 Edit Annotation.csv

Copy Annotation_proteomics_example.csv from this project, put it directly in my_analysis folder and rename it as Annotation.csv.

Open Annotation.csv with Excel, edit it to fit your proteomics project.

There are four columns in this file: Raw.file, Condition, BioReplicate, Experiment

Raw.file - For MaxQuant data or RNASeq featureCounts data, fill in all the file name of the raw proteomics files, but without the file extension. Eg. for '210619_DC_01.raw' you can fill in '210619_DC_01'. Contents of this column needs to be unique.
For other data input, fill in the same as Experiment column.
Condition - fill in the experimental condition for each raw file. Same condition (biological replicates should share the same name). Please add all experimental information to this column. Eg. strain 'WT' in 'MM' medium collected at 24 hours, you should enter something like 'WT_MM_24'.
BioReplicate - fill in a number of the bio-replicate within one condition. Eg. '210619_DC_01', '210619_DC_09', and '210619_DC_20' are samples from the same condition, then give them numbering '1', '2', and '3' in this column. Orders do not matter.
Experiment - fill in the experiment name of the raw file belongs. Each should contain both condition and bio-replicate information. Contents of this column needs to be unique only if there is only one LC-MS/MS run per sample.

This file will be used in MSstats[2][2] so it should comply with its rules. Although it is possible to use data with multiple runs per sample (eg. samples that fractionised before LC-MS/MS runs), but that have not been tested in Oleveler.

3.1.2 Edit comparisons.xlsx

This file is to provide enough information for the program to do "different analysis" to see which proteins are changed in different conditions, the condition of experiment and corresponding control needs to be specified.

Copy comparison_example.xlsx to my_analysis folder and rename it as comparison.xlsx. Open it with Excel, edit it to fit your project.

There are three columns in this file: id, exp, ctr

id - identifier for this perticular comparison. Using this id you can let the notebook show specific analysis result (volcano plot etc.).
exp - the condition that will be defined as 'experiment condition'. Needs to be one of the conditions listed in Condition column of Annotation.csv file.
ctr - the condition that will be defined as 'control condition'.

The different analysis will show the 'log2 fold changes' (LFCs) of exp divided by ctr. When zero is encountered in this comparision, result will show as inf for 'divided by zero' conditions, -inf for zero devide positive number conditons, otherwise nan for 'not a number'.

3.2 Transcriptomics data

Oleveler starts with feature counts (usually it is read counts for each gene) files. These files usually are genereated by featureCounts. (Salmon support is on the way)

Create a folder named quantResult in my_analysis folder, put read counts files for all samples in this folder.

3.2.1 Edit Annotation.csv

Copy Annotation_example.csv from this project, put it directly in my_analysis folder and rename it as Annotation.csv.

Open Annotation.csv with Excel, edit it to fit your proteomics project.

There are four columns in this file: Raw.file, Condition, BioReplicate, Experiment
(if you do not see four columns, that means excel did not reconise the file as csv format, read instructions from Microsoft.)

Raw.file - fill in all the file name of the read counts files, but without the file extension. Eg. for 'D24_1.txt' you can fill in 'D24_1'. Contents of this column needs to be unique.
Condition - fill in the experimental condition for each read counts file. Same condition (biological replicates should share the same name). Please add all experimental information to this column. Eg. strain 'WT' in 'MM' medium collected at 24 hours, you should enter something like 'WT_MM_24'.
BioReplicate - fill in a number of the bio-replicate within one condition. Eg. 'D24_1', 'D24_2', and 'D24_3' are samples from the same condition, then give them numbering '1', '2', and '3' in this column. Orders do not matter.
Experiment - fill in the experiment name of the raw file belongs. Each should contain both condition and bio-replicate information. Contents of this column needs to be unique.

3.2.2 Edit comparisons.xlsx

This file is to provide enough information for the program to do "different analysis" to see which genes are changed in different conditions, the condition of experiment and corresponding control needs to be specified.

Copy comparison_example.xlsx to my_analysis folder and rename it as comparison.xlsx. Open it with Excel, edit it to fit your project.

There are three columns in this file: id, exp, ctr

id - identifier for this perticular comparison. Using this id you can let the notebook show specific analysis result (volcano plot etc.).
exp - the condition that will be defined as 'experiment condition'. Needs to be one of the conditions listed in Condition column of Annotation.csv file.
ctr - the condition that will be defined as 'control condition'.

The different analysis will show the 'log2 fold changes' (LFCs) of exp divided by ctr. When zero is encountered in this comparision, result will show as inf for 'divided by zero' conditions, -inf for zero devide positive number conditons, otherwise nan for 'not a number'.

4. Start analysing

4.1 Proteomics data

Assume your data is in the dir MaxQuant_output.

4.1.1 Proteomics data import

# Load data
lfqDf, id2group = loadMQLfqData('MaxQuant_output')
# Load meta
metaDf, conditions, experiments = loadMeta('Annotation.csv')
# Calculate mean and var
meanDf, nquantDf, varDf, stdDf, semDf = getStats(lfqDf, experiments)
# Transformation vst using DESeq2, change the `ref` value to the sample name you want to be control
vstDf = deseq2Process(lfqDf, metaDf, ref='QC')
# Process raw data using MSstats
msstatQuantDf, logDf = processMSstats('MaxQuant_output', 'Annotation.csv')
logDfFilled = logDf.replace(np.NaN, logDf.min().min()-np.log(2))

4.1.2 General statistics

4.2 Transcriptomics data

Assume the count table is in dir featureCounts and the annotation information is stored as .gff file. Following example is from Streptomyces coelicolor M145 experiments, this strain do not have natural plasmids, thus the gene from plasmids are excluded from analysis (gene ID containing 'SCP'). The annotation also support .gbk file format.

4.2.1 Transcriptomics data import and preprocessing

# Load data
selfAlignCt = gatherCountTable("featureCounts/")
saDf = calculateTPM(selfAlignCt, 'featureCounts/GCF_000203835.1_ASM20383v1_genomic.gff', 
                    tagsForGeneName='locus_tag', removerRNA=True, removeIDcontains=['SCP'])

# Load meta
metaDf, conditions, experiments = loadMeta('Annotation.csv')

# Calculate mean and var
meanDf, nquantDf, varDf, stdDf, semDf = getStats(saDf, experiments)

# Transformation vst using DESeq2
# https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization
vstDf = deseq2Process(selfAlignCt, metaDf, ref='WT_45')
# Calculate mean and var
# These data should only be used in plotting, principal analysis, or other stastistical analyses.
vstMeanDf, vstNquantDf, vstVarDf, vstStdDf, vstSemDf = getStats(vstDf, experiments, title='vst')

4.2.2 General statistics

  1. DEG analysis
# Differential expression analysis
# https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#differential-expression-analysis
deseq2CompResults, comparisons = makeCompMatrixDeseq2('comparisons.xlsx', 
                                                      selfAlignCt,
                                                      'annotation.csv',
                                                      shrink=None)```

3. 

## 5. Known Issues

[under construction]
- [x] Close temp files opened due to required compatibility to windows
- [x] \*DESeq2 will die when too many comparisons will run in the same R kernel. This often happens because each comparison consume to much memory and thus leads to memory surge or other problem. Solved by removing tempfiles etc.
- [x] Add queryLfc()
- [ ] Add queryViolin()
  - For query gene list
- [ ] Pack related functions
- [x] Add references list

## 6. References

1. Love, M.I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15, 550 (2014)

[1]:https://doi.org/10.1186/s13059-014-0550-8 "DESeq2"

2. Meena Choi, Ching-Yun Chang, Timothy Clough, Daniel Broudy, Trevor Killeen, Brendan MacLean, Olga Vitek, MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments, Bioinformatics, Volume 30, Issue 17, 1 September 2014, Pages 25242526

[2]:https://doi.org/10.1093/bioinformatics/btu305 "MSstats"

3. Anqi Zhu, Joseph G Ibrahim, Michael I Love, Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences, Bioinformatics, Volume 35, Issue 12, June 2019, Pages 20842092

[3]: https://doi.org/10.1093/bioinformatics/bty895 "apeglm"