Improve documentation and file structure of the sample data.

Rappsilber-Laboratory · Jan 5, 2021 · 87989ae · 87989ae
1 parent 2375e07
commit 87989ae
Show file tree

Hide file tree

Showing 35 changed files with 53,931 additions and 17 deletions.
diff --git a/README.md b/README.md
@@ -11,7 +11,7 @@
 ![pytest](https://github.com/Rappsilber-Laboratory/xiRT/workflows/pytest/badge.svg)
 
 A python package for multi-dimensional retention time prediction for linear and crosslinked 
-peptides using a (siamese) deep neural network architecture.
+peptides using a (Siamese) deep neural network architecture.
 ---
 
 - [Overview](#overview)
@@ -54,9 +54,16 @@ to the [documentation](https://xirt.readthedocs.io/en/latest/) for more details
 xiRT is a python package that comes with a executable python file. To run xiRT follow the steps 
 below.
 
+#### Requirements
+xiRT requires a running python installation on windows/mac/linux. All further requirements
+are managed during the installation process via pip or conda. xiRT was tested using python >3.7 with
+TensorFlow 1.4 and python >3.8 and TensorFlow >2.0. A GPU is not mandatory to run xiRT, however
+it can greatly decrease runtime. Further system requirements depend on the data sets to be used.
+
 #### Installation
 To install xiRT simply run the command below. We recommend to use an isolated python environment,
-for example by using pipenv **or** conda. 
+for example by using pipenv **or** conda. Installation should finish within minutes.
+
 Using pipenv:
 >pipenv shell
 >
@@ -78,7 +85,8 @@ Hint:
 pydot and graphviz sometimes make trouble when they are installed via pip. If on linux,
 simply use *sudo apt-get install graphviz*, on windows download latest graphviz package from 
 [here](https://www2.graphviz.org/Packages/stable/windows/), unzip the content of the file and the
-*bin* directory path to the windows PATH variable.
+*bin* directory path to the windows PATH variable. These two packages allow the vizualization
+of the neural network architecture.
 
 #### Usage
 The command line interface (CLI) requires three inputs:
@@ -104,25 +112,35 @@ fraction numbers are possible too).
 |--------------------|----------------------|--------------------------------------------------------------------------------|-------------|
 | peptide sequence 1 | Peptide1             | First peptide sequence for crosslinks                                        | PEPRTIDER   |
 | peptide sequence 2 | Peptide2             | Second peptide sequence for crosslinks, or empty                                 | ELRVIS      |
+| fasta descrition 1        | Fasta1             | FASTA header / description of protein 1                                    | SUCD_ECOLI Succinate--CoA ligase [ADP-forming]           |
+| fasta descrition 2        | Fasta2             | FASTA header / description of protein 2                                | SUCC_ECOLI Succinate--CoA ligase [ADP-forming]           |
 | link site 1        | LinkPos1             | Crosslink position in the first peptide (0-based)                                    | 3           |
 | link site 2        | LinkPos2             | Crosslink position in the second peptide (0-based                                | 2           |
-| precursor charge   | Charge               | Precursor charge of the crosslinked peptide                                    | 3           |
-| score              | Score                | Single score from the search engine                                            | 17.12       |
-| unique id          | CSMID                | A unique index for each entry in the result table                              | 0           |
-| decoy              | isTT                 | Binary column which is True for any TT identification and False for TD, DD ids | TT          |
+| score              | score                | Single score from the search engine                                            | 17.12       |
+| unique id          | PSMID                | A unique index for each entry in the result table                              | 0           |
+| TT              | isTT                 | Binary column which is True for any TT identification and False for TD, DD ids | True          |
 | fdr                | fdr                  | Estimated false discovery rate                                                 | 0.01        |
-| fdr level          | fdrGroup             | String identifier for heteromeric and self links (splitted FDR)                | heteromeric |
 
 The first four columns should be self explanatory, if not check the [sample input](https://github.com/Rappsilber-Laboratory/xiRT/tree/master/sample_data). 
-The fifth column ("CSMID") is a unique(!) integer that can be used as to retrieve CSMs. In addition, 
-depending on the number retention time domains that should to be learned/predicted the RT columns 
+The fifth column ("PSMID") is a unique(!) integer that can be used as to retrieve CSMs. In addition, 
+depending on the number retention time domains that should be learned/predicted the RT columns 
 need to be present. The column names need to match the configuration in the network parameter yaml.
+Note that xiRT swaps the sequences such that peptide1 is longer than peptide 2. In order to
+keep track of this process all columns that follow the convention <prefix>1 and <prefix>2 are swapped.
+Make sure to only have such paired columns and not single columns ending with 1/2.
 
 #### xiRT config
-This file determines the network architecture and training behaviour used in xiRT.
+This file determines the network architecture and training behaviour used in xiRT. Please see
+the [documentation](https://xirt.readthedocs.io/en/latest/parameters.html#xirt-parameters) for a 
+detailed example. For crosslinks the most important parameter sections to adapt are the *output* and
+the *predictions* section. Here the parameters must be adapted for the used chromatography
+dimensions and modelling choices. See also the provided 
+[examples](https://xirt.readthedocs.io/en/latest/usage.html#examples).
 
 #### Setup config
-This file determines the input data to be used and gives some training procedure options.
+This file determines the input data to be used and gives some training procedure options. Please see
+the [documentation](https://xirt.readthedocs.io/en/latest/parameters.html#learning-parameters) for 
+a detailed example.
 
 ### Contributors
 - Sven Giese

diff --git a/documentation/imgs/qc_plots/cv_epochs_loss.png b/documentation/imgs/qc_plots/cv_epochs_loss.png
diff --git a/documentation/imgs/qc_plots/cv_summary_strip_loss.png b/documentation/imgs/qc_plots/cv_summary_strip_loss.png
diff --git a/documentation/imgs/qc_plots/qc_cv01_obs_pred.png b/documentation/imgs/qc_plots/qc_cv01_obs_pred.png
diff --git a/documentation/source/index.rst b/documentation/source/index.rst
@@ -39,6 +39,7 @@ similar tasks such has collision-cross section prediction can be learned.
    readme
    installation
    usage
+   results
    parameters
    modules
    development

diff --git a/documentation/source/results.rst b/documentation/source/results.rst
@@ -0,0 +1,145 @@
+Results
+=======
+
+This section covers the results that are generated after a successful xiRT run. In the command
+line call the output folder needs to be specified. Typically, the csv/xls files are the most
+important outputs for most applications. The created folder will contain the following results:
+
+1) log file
+2) callbacks
+3) quality control visualizations
+4) tables (CSV/XLS)
+
+
+For which you find more details in the following paragraphs.
+
+Log File
+********
+The log file contains useful information, including the xiRT version and parameters. Moreover
+various steps that are performed during the analysis with xiRT are documented. For example,
+the number of duplicated entries, the amino acid alphabet, maximum sequence length etc. The logs
+also contain short numeric summaries from the CV training of xiRT.
+
+.. code-block:: console
+
+
+    2021-01-04 17:21:31,708 - xirt - INFO - Init logging file.
+    2021-01-04 17:21:31,708 - xirt - INFO - Starting Time: 17:21:31
+    2021-01-04 17:21:31,708 - xirt - INFO - Starting xiRT.
+    2021-01-04 17:21:31,708 - xirt - INFO - Using xiRT version: 1.0.63
+    2021-01-04 17:21:31,781 - xirt.__main__ - INFO - xi params: sample_data/xirt_params_3RT.yaml
+    2021-01-04 17:21:31,781 - xirt.__main__ - INFO - learning_params: sample_data/learning_params_training_cv.yaml
+    2021-01-04 17:21:31,781 - xirt.__main__ - INFO - peptides: sample_data/DSS_xisearch_fdr_CSM50percent.csv
+    2021-01-04 17:21:31,781 - xirt.predictor - INFO - Preprocessing peptides.
+    2021-01-04 17:21:31,781 - xirt.predictor - INFO - Input peptides: 17886
+    2021-01-04 17:21:31,781 - xirt.predictor - INFO - Reordering peptide sequences. (mode: crosslink)
+    2021-01-04 17:21:43,726 - xirt.processing - INFO - Preparing peptide sequences for columns: Peptide1,Peptide2
+    2021-01-04 17:21:44,296 - xirt.predictor - INFO - Duplicatad entries (by sequence only): 5426/17886
+    2021-01-04 17:21:44,312 - xirt.predictor - INFO - Encode crosslinked residues.
+    2021-01-04 17:21:46,910 - xirt.predictor - INFO - Applying length filter: 17886 peptides left
+    2021-01-04 17:21:46,920 - xirt.processing - INFO - Setting max_length to: 59
+    2021-01-04 17:21:47,012 - xirt.processing - INFO - alphabet: ['-OH' 'A' 'D' 'E' 'F' 'G' 'H' 'H-' 'I' 'K' 'L' 'M' 'N' 'O' 'P' 'Q' 'R'
+     'S' 'T' 'V' 'W' 'Y' 'clA' 'clD' 'clE' 'clF' 'clG' 'clI' 'clK' 'clL' 'clM'
+     'clN' 'clP' 'clQ' 'clR' 'clS' 'clT' 'clV' 'clY' 'clcmC' 'cloxM' 'cmC'
+     ...
+     ...
+    2021-01-04 17:28:38,903 - xirt.qc - INFO - Metrics: r2: 0.30 f1: 0.16 acc: 0.25 racc: 0.61
+    2021-01-04 17:28:39,207 - xirt.qc - INFO - QC: rp
+    2021-01-04 17:28:39,215 - xirt.qc - INFO - Metrics: r2: 0.69
+    2021-01-04 17:28:43,643 - xirt.__main__ - INFO - Writing output tables.
+    2021-01-04 17:29:01,207 - xirt.__main__ - INFO - Completed xiRT run.
+    2021-01-04 17:29:01,207 - xirt.__main__ - INFO - End Time: 17:29:01
+    2021-01-04 17:29:01,208 - xirt.__main__ - INFO - xiRT CV-training took: 7.20 minutes
+    2021-01-04 17:29:01,209 - xirt.__main__ - INFO - xiRT took: 7.49 minutes
+
+Callbacks
+*********
+Callbacks are used throughout xiRT to select the best performing model which is not necessarily
+the last (epoch) model that was trained. To reuse the already trained models for transfer-learning
+and predictions on other data sets the neural network model ("xirt_model_XX.h5") as well as the
+parameters/weights ("xirt_weights_XX.h5") are stored. In addition training results per epoch
+are stored ("xirt_epochlog_XX.log"). XX refers to the cross-validation fold, e.g. 01, 02 and 03 for
+k=3. The epoch log contains losses and metrics for the training and validation data. For some
+applications the used encoder (mapping of amino acids to integers) needs to be transferred.
+Therefore, the callbacks also include a trained label encoder from sklearn as pickled object
+("encoder.p"). The last file contains the formatted input data again as pickled data. It can
+be used programmatically for debugging, exploration and manual retention time prediction using
+an already existing model. The data can be parsed in python via:
+
+.. code-block:: python
+
+    import pickle
+    X, y = pickle.load(open("Xy_data.p", "rb"))
+    alpha_peptides, beta_peptides = X[0], X[1]
+    # assuming 3 RT dimensions
+    RT1, RT2, RT3 = y
+
+Visualizations
+**************
+xiRT will create a rich set of QC plots that should always be investigated. The plots are stored
+in svg/png/pdf format.
+
+Epoch  Loss / Metrics
+'''''''''''''''''''''
+.. image:: ../imgs/qc_plots/cv_epochs_loss.png
+
+The epoch loss/metrics plot shows the training behavior over the epochs and is a good diagnostic tool to
+assess robustness across CV-folds, learning rate adjustment, overfit-detection and general learning
+behavior across tasks. In the example above, we see quick convergence and robust learning behavior
+after 10 epochs. In non-regression tasks loss and metrics are not necessarily the same.
+
+CV Summary
+'''''''''''
+.. image:: ../imgs/qc_plots/cv_summary_strip_loss.png
+
+The CV summary shows the point estimates of the loss/metric for the training, validation
+and prediction folds for all training tasks. Unvalidation refers to the data not passing the
+training FDR cutoff.
+
+CV Observations
+'''''''''''''''
+.. image:: ../imgs/qc_plots/qc_cv01_obs_pred.png
+
+This plot shows the prediction performance for each CV-fold on all tasks. It also reports some
+key metrics that are not reported in the epoch log (r2, f1, accuracy, relaxed accuracy).
+
+
+Tables
+******
+The tables contain a lot of extra information (some of which is used for the plots above). Please
+find an example of each file on (GitHub)[].
+
+Processed PSMS
+''''''''''''''
+This table ("processed_psms.csv") contains the input data together with internally done
+processing steps. The additional columns are:
+- swapped (indicator if peptide order was swapped)
+- Seq_Peptide1/Seq_Peptide2 (peptide sequences in modX format)
+- Seqar_Peptide1/Seqar_Peptide2 (peptide sequences in array format
+- Duplicate (indicator if combination of sequences and charge is unique within the xiRT definition)
+- scx0_based (0-based fraction number)
+- scx_1hot (1-hot encoded fraction variable)
+- scx_ordinal (ordinal encoded fraction variable)
+- fdr_mask (indicator if PSM passed the FDR for training)
+
+
+
+Epoch History
+'''''''''''''
+This table ("epoch_history.csv") has similar data as the callbacks version but the CV results are
+concatenated and learning rate decay is documented.
+
+Error Features
+''''''''''''''
+This table ("error_features.csv") contains the input PSMID, crossvalidation split annotation
+and the predicted retention times (including their basic error terms).
+
+Error Features Interactions
+'''''''''''''''''''''''''''
+This table ("error_features_interactions.csv") contains the input PSMID,
+and the some engineered error terms from the previous table.
+
+Model Summary
+'''''''''''''
+This table ("model_summary.csv") contains important metrics that summarize the performance of the
+learned models across CV-splits and their corresponding train/validation/prediction splits.
-Original file line number
+Diff line change
@@ Expand Up @@
        readme
        installation
        usage
+       results
        parameters
        modules
        development
@@ Expand Down @@