Skip to content

Commit

Permalink
Improve documentation and file structure of the sample data.
Browse files Browse the repository at this point in the history
  • Loading branch information
gieses committed Jan 5, 2021
1 parent 2375e07 commit 87989ae
Show file tree
Hide file tree
Showing 35 changed files with 53,931 additions and 17 deletions.
42 changes: 30 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
![pytest](https://github.com/Rappsilber-Laboratory/xiRT/workflows/pytest/badge.svg)

A python package for multi-dimensional retention time prediction for linear and crosslinked
peptides using a (siamese) deep neural network architecture.
peptides using a (Siamese) deep neural network architecture.
---

- [Overview](#overview)
Expand Down Expand Up @@ -54,9 +54,16 @@ to the [documentation](https://xirt.readthedocs.io/en/latest/) for more details
xiRT is a python package that comes with a executable python file. To run xiRT follow the steps
below.

#### Requirements
xiRT requires a running python installation on windows/mac/linux. All further requirements
are managed during the installation process via pip or conda. xiRT was tested using python >3.7 with
TensorFlow 1.4 and python >3.8 and TensorFlow >2.0. A GPU is not mandatory to run xiRT, however
it can greatly decrease runtime. Further system requirements depend on the data sets to be used.

#### Installation
To install xiRT simply run the command below. We recommend to use an isolated python environment,
for example by using pipenv **or** conda.
for example by using pipenv **or** conda. Installation should finish within minutes.

Using pipenv:
>pipenv shell
>
Expand All @@ -78,7 +85,8 @@ Hint:
pydot and graphviz sometimes make trouble when they are installed via pip. If on linux,
simply use *sudo apt-get install graphviz*, on windows download latest graphviz package from
[here](https://www2.graphviz.org/Packages/stable/windows/), unzip the content of the file and the
*bin* directory path to the windows PATH variable.
*bin* directory path to the windows PATH variable. These two packages allow the vizualization
of the neural network architecture.

#### Usage
The command line interface (CLI) requires three inputs:
Expand All @@ -104,25 +112,35 @@ fraction numbers are possible too).
|--------------------|----------------------|--------------------------------------------------------------------------------|-------------|
| peptide sequence 1 | Peptide1 | First peptide sequence for crosslinks | PEPRTIDER |
| peptide sequence 2 | Peptide2 | Second peptide sequence for crosslinks, or empty | ELRVIS |
| fasta descrition 1 | Fasta1 | FASTA header / description of protein 1 | SUCD_ECOLI Succinate--CoA ligase [ADP-forming] |
| fasta descrition 2 | Fasta2 | FASTA header / description of protein 2 | SUCC_ECOLI Succinate--CoA ligase [ADP-forming] |
| link site 1 | LinkPos1 | Crosslink position in the first peptide (0-based) | 3 |
| link site 2 | LinkPos2 | Crosslink position in the second peptide (0-based | 2 |
| precursor charge | Charge | Precursor charge of the crosslinked peptide | 3 |
| score | Score | Single score from the search engine | 17.12 |
| unique id | CSMID | A unique index for each entry in the result table | 0 |
| decoy | isTT | Binary column which is True for any TT identification and False for TD, DD ids | TT |
| score | score | Single score from the search engine | 17.12 |
| unique id | PSMID | A unique index for each entry in the result table | 0 |
| TT | isTT | Binary column which is True for any TT identification and False for TD, DD ids | True |
| fdr | fdr | Estimated false discovery rate | 0.01 |
| fdr level | fdrGroup | String identifier for heteromeric and self links (splitted FDR) | heteromeric |

The first four columns should be self explanatory, if not check the [sample input](https://github.com/Rappsilber-Laboratory/xiRT/tree/master/sample_data).
The fifth column ("CSMID") is a unique(!) integer that can be used as to retrieve CSMs. In addition,
depending on the number retention time domains that should to be learned/predicted the RT columns
The fifth column ("PSMID") is a unique(!) integer that can be used as to retrieve CSMs. In addition,
depending on the number retention time domains that should be learned/predicted the RT columns
need to be present. The column names need to match the configuration in the network parameter yaml.
Note that xiRT swaps the sequences such that peptide1 is longer than peptide 2. In order to
keep track of this process all columns that follow the convention <prefix>1 and <prefix>2 are swapped.
Make sure to only have such paired columns and not single columns ending with 1/2.

#### xiRT config
This file determines the network architecture and training behaviour used in xiRT.
This file determines the network architecture and training behaviour used in xiRT. Please see
the [documentation](https://xirt.readthedocs.io/en/latest/parameters.html#xirt-parameters) for a
detailed example. For crosslinks the most important parameter sections to adapt are the *output* and
the *predictions* section. Here the parameters must be adapted for the used chromatography
dimensions and modelling choices. See also the provided
[examples](https://xirt.readthedocs.io/en/latest/usage.html#examples).

#### Setup config
This file determines the input data to be used and gives some training procedure options.
This file determines the input data to be used and gives some training procedure options. Please see
the [documentation](https://xirt.readthedocs.io/en/latest/parameters.html#learning-parameters) for
a detailed example.

### Contributors
- Sven Giese
Expand Down
Binary file added documentation/imgs/qc_plots/cv_epochs_loss.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added documentation/imgs/qc_plots/qc_cv01_obs_pred.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions documentation/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ similar tasks such has collision-cross section prediction can be learned.
readme
installation
usage
results
parameters
modules
development
Expand Down
145 changes: 145 additions & 0 deletions documentation/source/results.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
Results
=======

This section covers the results that are generated after a successful xiRT run. In the command
line call the output folder needs to be specified. Typically, the csv/xls files are the most
important outputs for most applications. The created folder will contain the following results:

1) log file
2) callbacks
3) quality control visualizations
4) tables (CSV/XLS)


For which you find more details in the following paragraphs.

Log File
********
The log file contains useful information, including the xiRT version and parameters. Moreover
various steps that are performed during the analysis with xiRT are documented. For example,
the number of duplicated entries, the amino acid alphabet, maximum sequence length etc. The logs
also contain short numeric summaries from the CV training of xiRT.

.. code-block:: console
2021-01-04 17:21:31,708 - xirt - INFO - Init logging file.
2021-01-04 17:21:31,708 - xirt - INFO - Starting Time: 17:21:31
2021-01-04 17:21:31,708 - xirt - INFO - Starting xiRT.
2021-01-04 17:21:31,708 - xirt - INFO - Using xiRT version: 1.0.63
2021-01-04 17:21:31,781 - xirt.__main__ - INFO - xi params: sample_data/xirt_params_3RT.yaml
2021-01-04 17:21:31,781 - xirt.__main__ - INFO - learning_params: sample_data/learning_params_training_cv.yaml
2021-01-04 17:21:31,781 - xirt.__main__ - INFO - peptides: sample_data/DSS_xisearch_fdr_CSM50percent.csv
2021-01-04 17:21:31,781 - xirt.predictor - INFO - Preprocessing peptides.
2021-01-04 17:21:31,781 - xirt.predictor - INFO - Input peptides: 17886
2021-01-04 17:21:31,781 - xirt.predictor - INFO - Reordering peptide sequences. (mode: crosslink)
2021-01-04 17:21:43,726 - xirt.processing - INFO - Preparing peptide sequences for columns: Peptide1,Peptide2
2021-01-04 17:21:44,296 - xirt.predictor - INFO - Duplicatad entries (by sequence only): 5426/17886
2021-01-04 17:21:44,312 - xirt.predictor - INFO - Encode crosslinked residues.
2021-01-04 17:21:46,910 - xirt.predictor - INFO - Applying length filter: 17886 peptides left
2021-01-04 17:21:46,920 - xirt.processing - INFO - Setting max_length to: 59
2021-01-04 17:21:47,012 - xirt.processing - INFO - alphabet: ['-OH' 'A' 'D' 'E' 'F' 'G' 'H' 'H-' 'I' 'K' 'L' 'M' 'N' 'O' 'P' 'Q' 'R'
'S' 'T' 'V' 'W' 'Y' 'clA' 'clD' 'clE' 'clF' 'clG' 'clI' 'clK' 'clL' 'clM'
'clN' 'clP' 'clQ' 'clR' 'clS' 'clT' 'clV' 'clY' 'clcmC' 'cloxM' 'cmC'
...
...
2021-01-04 17:28:38,903 - xirt.qc - INFO - Metrics: r2: 0.30 f1: 0.16 acc: 0.25 racc: 0.61
2021-01-04 17:28:39,207 - xirt.qc - INFO - QC: rp
2021-01-04 17:28:39,215 - xirt.qc - INFO - Metrics: r2: 0.69
2021-01-04 17:28:43,643 - xirt.__main__ - INFO - Writing output tables.
2021-01-04 17:29:01,207 - xirt.__main__ - INFO - Completed xiRT run.
2021-01-04 17:29:01,207 - xirt.__main__ - INFO - End Time: 17:29:01
2021-01-04 17:29:01,208 - xirt.__main__ - INFO - xiRT CV-training took: 7.20 minutes
2021-01-04 17:29:01,209 - xirt.__main__ - INFO - xiRT took: 7.49 minutes
Callbacks
*********
Callbacks are used throughout xiRT to select the best performing model which is not necessarily
the last (epoch) model that was trained. To reuse the already trained models for transfer-learning
and predictions on other data sets the neural network model ("xirt_model_XX.h5") as well as the
parameters/weights ("xirt_weights_XX.h5") are stored. In addition training results per epoch
are stored ("xirt_epochlog_XX.log"). XX refers to the cross-validation fold, e.g. 01, 02 and 03 for
k=3. The epoch log contains losses and metrics for the training and validation data. For some
applications the used encoder (mapping of amino acids to integers) needs to be transferred.
Therefore, the callbacks also include a trained label encoder from sklearn as pickled object
("encoder.p"). The last file contains the formatted input data again as pickled data. It can
be used programmatically for debugging, exploration and manual retention time prediction using
an already existing model. The data can be parsed in python via:

.. code-block:: python
import pickle
X, y = pickle.load(open("Xy_data.p", "rb"))
alpha_peptides, beta_peptides = X[0], X[1]
# assuming 3 RT dimensions
RT1, RT2, RT3 = y
Visualizations
**************
xiRT will create a rich set of QC plots that should always be investigated. The plots are stored
in svg/png/pdf format.

Epoch Loss / Metrics
'''''''''''''''''''''
.. image:: ../imgs/qc_plots/cv_epochs_loss.png

The epoch loss/metrics plot shows the training behavior over the epochs and is a good diagnostic tool to
assess robustness across CV-folds, learning rate adjustment, overfit-detection and general learning
behavior across tasks. In the example above, we see quick convergence and robust learning behavior
after 10 epochs. In non-regression tasks loss and metrics are not necessarily the same.

CV Summary
'''''''''''
.. image:: ../imgs/qc_plots/cv_summary_strip_loss.png

The CV summary shows the point estimates of the loss/metric for the training, validation
and prediction folds for all training tasks. Unvalidation refers to the data not passing the
training FDR cutoff.

CV Observations
'''''''''''''''
.. image:: ../imgs/qc_plots/qc_cv01_obs_pred.png

This plot shows the prediction performance for each CV-fold on all tasks. It also reports some
key metrics that are not reported in the epoch log (r2, f1, accuracy, relaxed accuracy).


Tables
******
The tables contain a lot of extra information (some of which is used for the plots above). Please
find an example of each file on (GitHub)[].

Processed PSMS
''''''''''''''
This table ("processed_psms.csv") contains the input data together with internally done
processing steps. The additional columns are:
- swapped (indicator if peptide order was swapped)
- Seq_Peptide1/Seq_Peptide2 (peptide sequences in modX format)
- Seqar_Peptide1/Seqar_Peptide2 (peptide sequences in array format
- Duplicate (indicator if combination of sequences and charge is unique within the xiRT definition)
- scx0_based (0-based fraction number)
- scx_1hot (1-hot encoded fraction variable)
- scx_ordinal (ordinal encoded fraction variable)
- fdr_mask (indicator if PSM passed the FDR for training)



Epoch History
'''''''''''''
This table ("epoch_history.csv") has similar data as the callbacks version but the CV results are
concatenated and learning rate decay is documented.

Error Features
''''''''''''''
This table ("error_features.csv") contains the input PSMID, crossvalidation split annotation
and the predicted retention times (including their basic error terms).

Error Features Interactions
'''''''''''''''''''''''''''
This table ("error_features_interactions.csv") contains the input PSMID,
and the some engineered error terms from the previous table.

Model Summary
'''''''''''''
This table ("model_summary.csv") contains important metrics that summarize the performance of the
learned models across CV-splits and their corresponding train/validation/prediction splits.
Loading

0 comments on commit 87989ae

Please sign in to comment.