PyProphet

Overview

PyProphet [1] is a reimplementation of the mProphet [2] algorithm for targeted proteomics. It is particularly optimized for analysis of large scale data sets generated by OpenSWATH or DIANA.

This description represents the new SQLite-based workflow that is available since OpenMS 2.4.0 and PyProphet 2.0.1. This version includes the IPF [3] and large-scale data set optimizations [4]. You can alternatively follow the instructions for the PyProphet Legacy Workflow.

Contact and Support

We provide support for PyProphet on the GitHub repository.

You can contact the authors Uwe Schmitt, Johan Teleman, Hannes Röst and George Rosenberger.

Tutorial

Merging

Generate OSW output files according to section openswath_workflow. PyProphet is then applied to one or several such SQLite-based reports. Several different commands can be run to consecutively do the analysis:

pyprophet --help
pyprophet merge --help

This command provides an overview of all available commands to manipulate OSW input files. Further instructions are available for the individual commands.

pyprophet merge --template=library.pqp --out=merged.osw *.osw

In most scenarios, more than a single DIA / SWATH-MS run was acquired and the samples should be compared qualitatively and/or quantitatively with the OpenSWATH workflow. After individual processing with OpenSWATH and the identical spectral library, the files can be merged by PyProphet.

This command will merge multiple files using a reference PQP or OSW file containing a library as template. Please note that the experiment-wide context on peptide query-level is applied to merged files, whereas the run-specific context is used with separate OSW files [4]. The model will be stored in the output and can be applied to the full file(s).

If semi-supervised learning is too slow, or the run-specific context is required, subsample the files before merging with a smaller subsample_ratio:

pyprophet subsample --in=merged.osw --out=subsampled.osw --subsample_ratio=0.1

Scoring

pyprophet score --in=merged.osw --level=ms2

The main command will conduct semi-supervised learning and error-rate estimation in a fully automated fashion. --help will show the full selection of parameters to adjust the process. The default parameters are recommended for SCIEX TripleTOF 5600/6600 instrument data, but can be adjusted in other scenarios.

When using the IPF extension, the parameter --level can be set to ms2, ms1 or transition. If MS2 and MS2 information should be considered together, --level can alternatively be set to ms1ms2. If MS1 or transition-level data should be scored, the command is executed three times, e.g.:

pyprophet score --in=merged.osw --level=ms1 \
score --in=merged.osw --level=ms2 \
score --in=merged.osw --level=transition

The scoring steps on MS1 and transition-level have some dependencies on the MS2 peak group signals. The parameter --ipf_max_peakgroup_rank specifies how many peak group candidates should be assessed in IPF. For example, if this parameter is set to 1, only the top scoring peak group will be investigated. In some scenarios, a set of peptide query parameters might detect several peak groups of different peptidoforms that should be independently identified. If the parameter is set to 3, the top 3 peak groups are investigated. Note that for higher values (or very generic applications), it might be a better option to disable the PyProphet assumption of a single best peak group per peptide query. This can be conducted by setting --group_id to feature_id and will change the assumption that all high scoring peak groups are potential peptide signals.

Importantly, PyProphet will store all results in the input OSW files. This can be changed by specifying --out. However, since all steps are non-destructive, this is not necessary.

IPF

If IPF should be applied after scoring, the following command can be used:

pyprophet ipf --in=merged.osw

To adjust the IPF-specific parameters, please consult pyprophet ipf --help. If MS1 or MS2 precursor data should not be used, e.g. due to poor instrument performance, this can be disabled by setting --no-ipf_ms1_scoring and --no-ipf_ms2_scoring. The experimental setting --ipf_grouped_fdr can be used in case of extremly heterogeneous spectral library, e.g. containing mostly unmodified peptides that are mainly detect and peptidoforms with various potential site-localizations, which are mostly not detectable. This parameter will estimate the FDR independently grouped according to number of site-localizations.

Several thresholds (–ipf_max_precursor_pep,`–ipf_max_peakgroup_pep`,` –ipf_max_precursor_peakgroup_pep`,`–ipf_max_transition_pep`) are defined for IPF to exclude very poor signals. When disabled, the error model still works, but sensitivity is reduced. Tweaking of these parameters should only be conducted with a reference data set.

Contexts & FDR

To conduct peptide inference in run-specific, experiment-wide and global contexts, the following command can be applied:

pyprophet peptide --in=merged.osw --context=run-specific \
peptide --in=merged.osw --context=experiment-wide \
peptide --in=merged.osw --context=global

This will generate individual PDF reports and store the scores in a non-redundant fashion in the OSW file.

Analogously, this can be conducted on protein-level as well:

pyprophet protein --in=merged.osw --context=run-specific \
protein --in=merged.osw --context=experiment-wide \
protein --in=merged.osw --context=global

Exporting

Finally, we can export the results to legacy OpenSWATH TSV report:

pyprophet export --in=merged.osw --out=legacy.tsv

By default, both peptide- and transition-level quantification is reported, which is necessary for requantification or SWATH2stats. If peptide and protein inference in the global context was conducted, the results will be filtered to 1% FDR by default. Further details can be found by pyprophet export --help.

Warning

By default, IPF results on peptidoform-level will be used if available. This can be disabled by setting --ipf=disable. The IPF results require different properties for TRIC. Please ensure that you want to analyze the results in the context of IPF, else, use the --ipf=disable or --ipf=augmented settings.

Scaling up

When moving to larger data sets that include 10-1000s of runs, the workflow described above might take a lot of time. For such applications, especially when the analysis is run on HPC infrastructure (cloud, cluster, etc.) we have implemented steps that can mostly parallelize on the level of independent runs:

In the first step, we will generate a subsampled classifer that is much faster to learn:

# Here we recommend to set subsample_rate to 1/N, where N is the number of runs.
# Example for N=10 runs:
for run in run_*.osw
do
run_subsampled = ${run}s # generates .osws files
pyprophet subsample --in=$run --out=$run_subsampled --subsample_ratio=0.1
done

pyprophet merge --template=library.pqp --out=model.osw *.osws

We then learn a classifer on MS1/MS2-level and store the results in model.osw:

pyprophet score --in=model.osw --level=ms1ms2

This classifier is then applied to all individual runs in parallel:

for run in run_*.osw
do
pyprophet score --in=$run --apply_weights=model.osw --level=ms1ms2
done

We then extract the relevant data for global scoring to generate a tiny file:

for run in run_*.osw
do
run_reduced = ${run}r # generates .oswr files
pyprophet reduce --in=$run --out=$run_reduced
done

Next, global peptide and protein-level error rate control is conducted by merging the oswr files:

pyprophet merge --template=model.osw --out=model_global.osw *.oswr

pyprophet peptide --context=global --in=model_global.osw

pyprophet protein --context=global --in=model_global.osw

Now we backpropagate the global statistics to the individual runs:

for run in run_*.osw
do
pyprophet backpropagate --in=$run --apply_scores=model_global.osw
done

We can then export the results with confidence scores on peptide-query-level (run-specific context), peptide sequence level (global context) and protein level (global context) in parallel:

for run in run_*.osw
do
pyprophet export --in=$run
done

References

[1]Teleman J, Röst HL, Rosenberger G, Schmitt U, Malmström L, Malmström J, Levander F. DIANA–algorithmic improvements for analysis of data-independent acquisition MS data. Bioinformatics. 2015 Feb 15;31(4):555-62. doi: 10.1093/bioinformatics/btu686. Epub 2014 Oct 27. PMID: 25348213
[2]Reiter L, Rinner O, Picotti P, Hüttenhain R, Beck M, Brusniak MY, Hengartner MO, Aebersold R. mProphet: automated data processing and statistical validation for large-scale SRM experiments. Nat Methods. 2011 May;8(5):430-5. doi: 10.1038/nmeth.1584. Epub 2011 Mar 20. PMID: 21423193
[3]Rosenberger G, Liu Y, Röst HL, Ludwig C, Buil A, Bensimon A, Soste M, Spector TD, Dermitzakis ET, Collins BC, Malmström L, Aebersold R. Inference and quantification of peptidoforms in large sample cohorts by SWATH-MS. Nat Biotechnol. 2017 Aug;35(8):781-788. doi: 10.1038/nbt.3908. Epub 2017 Jun 12. PMID: 28604659
[4](1, 2) Rosenberger G, Bludau I, Schmitt U, Heusel M, Hunter CL, Liu Y, MacCoss MJ, MacLean BX, Nesvizhskii AI, Pedrioli PGA, Reiter L, Röst HL, Tate S, Ting YS, Collins BC, Aebersold R. Statistical control of peptide and protein error rates in large-scale targeted data-independent acquisition analyses. Nat Methods. 2017 Sep;14(9):921-927. doi: 10.1038/nmeth.4398. Epub 2017 Aug 21. PMID: 28825704