PyProphet Legacy Workflow

Overview

PyProphet [1] is a reimplementation of the mProphet [2] algorithm for targeted proteomics. It is particularly optimized for analysis of large scale data sets generated by OpenSWATH or DIANA.

This description represents the legacy workflow using TSV file formats.

Contact and Support

We provide support for PyProphet on the GitHub repository.

You can contact the authors Uwe Schmitt, Johan Teleman, Hannes Röst and George Rosenberger.

Installation

The PyProphet legacy workflow is distributed by two modules:

PyProphet

PyProphet is the main Python package.

Currently PyProphet requires Python 2.7 and several dependencies. Windows users should install Anaconda, Mac and Linux users should be able to install PyProphet directly from PyPI:

pip install git+https://github.com/PyProphet/pyprophet.git@legacy

PyProphet-cli aka Jumbo-PyProphet

To deal with larger data sets and to provide error rate control on the level of peptide sequences and proteins for different contexts (run-specific, experiment-wide and global), an extension of PyProphet is in development [4]. It is optimized to analyse hundreds of runs simultaneously and builds on IBM LSF or OpenLava workflow managers, but the steps can also be executed independently. It can be installed from PyPI:

pip install git+https://github.com/PyProphet/pyprophet.git@legacy
pip install git+https://github.com/PyProphet/pyprophet-cli.git@legacy
pip install git+https://github.com/PyProphet/pyprophet-brutus-driver.git

PyProphet-cli can be adapted to other workflow managers by development of lightweight modules replacing pyprophet-brutus.

Tutorial

PyProphet

An extended tutorial describing a complete OpenSWATH analysis workflow including PyProphet was recently published [3] and is also available from bioRxiv.

PyProphet-cli aka Jumbo-PyProphet

If the three modules have been properly configured, PyProphet jobs can be submitted using the following command:

pyprophet-cli run_on_brutus \
--data-folder="/tmp/openswath_results/" \
--data-filename-pattern="openswath_output_*.tsv" --sample-factor=0.1 --job-count=10 \
--extra-args-prepare --extra-group-column=ProteinName \
--extra-args-score --lambda=0.8

The example works as following:

  • –data-folder: /tmp/openswath_results/ contains 10 files, openswath_output_0.tsv - openswath_output_9.tsv.
  • –data-filename-pattern: This regular expression is used to grab the correct files.
  • –sample-factor: This value can be anything from 0 - 1. We recommend to use 1/(#runs), here 1/10=0.1.
  • –job-count: Specifies the number of parallel jobs to submit.
  • –extra-args-prepare –extra-group-column=ProteinName: Also compute protein-level q-values
  • –extra-args-score –lambda=0.8: Set lambda to 0.8 for q-value estimation.

There are further parameters that can be set, please refer to:

pyprophet-cli --help

Alternatively, if pyprophet-brutus-driver is not available or for integration with other workflow managers, it is also possible to execute all steps independently. In the following example, 3 example runs are used:

  1. Prepare data
pyprophet-cli prepare --data-folder="/tmp/openswath_results/" --data-filename-pattern="*.tsv" \
--work-folder=/tmp/pyprophet_work/ --separator="tab" --extra-group-column="ProteinName"
  1. Subsample
pyprophet-cli subsample --data-folder="/tmp/openswath_results/" --data-filename-pattern="*.tsv" \
--work-folder="/tmp/pyprophet_work/" --separator="tab" --job-number 1 --job-count 3 --sample-factor=0.4 &
pyprophet-cli subsample --data-folder="/tmp/openswath_results/" --data-filename-pattern="*.tsv" \
--work-folder="/tmp/pyprophet_work/" --separator="tab" --job-number 2 --job-count 3 --sample-factor=0.4 &
pyprophet-cli subsample --data-folder="/tmp/openswath_results/" --data-filename-pattern="*.tsv" \
--work-folder="/tmp/pyprophet_work/" --separator="tab" --job-number 3 --job-count 3 --sample-factor=0.4 &
  1. Semi-supervised learning
pyprophet-cli learn --work-folder="/tmp/pyprophet_work/" --separator="tab" --ignore-invalid-scores
  1. Scoring
pyprophet-cli apply_weights --data-folder="/tmp/openswath_results/" --data-filename-pattern="*.tsv" \
--work-folder="/tmp/pyprophet_work/" --separator="tab" --job-number 1 --job-count 3 &
pyprophet-cli apply_weights --data-folder="/tmp/openswath_results/" --data-filename-pattern="*.tsv" \
--work-folder="/tmp/pyprophet_work/" --separator="tab" --job-number 2 --job-count 3 &
pyprophet-cli apply_weights --data-folder="/tmp/openswath_results/" --data-filename-pattern="*.tsv" \
--work-folder="/tmp/pyprophet_work/" --separator="tab" --job-number 3 --job-count 3 &
  1. Statistical validation
  • Run-specific context
pyprophet-cli score --data-folder="/tmp/openswath_results/" --data-filename-pattern="*.tsv" \
--work-folder="/tmp/pyprophet_work/" --result-folder="/tmp/pyprophet_result_run_specific" --separator="tab" \
--job-number 1 --job-count 3 --lambda=0.4 --statistics-mode=run-specific --overwrite-results &
pyprophet-cli score --data-folder="/tmp/openswath_results/" --data-filename-pattern="*.tsv" \
--work-folder="/tmp/pyprophet_work/" --result-folder="/tmp/pyprophet_result_run_specific" --separator="tab" \
--job-number 2 --job-count 3 --lambda=0.4 --statistics-mode=run-specific --overwrite-results &
pyprophet-cli score --data-folder="/tmp/openswath_results/" --data-filename-pattern="*.tsv" \
--work-folder="/tmp/pyprophet_work/" --result-folder="/tmp/pyprophet_result_run_specific" --separator="tab" \
--job-number 3 --job-count 3 --lambda=0.4 --statistics-mode=run-specific --overwrite-results &
  • Experiment-wide context
pyprophet-cli score --data-folder="/tmp/openswath_results/" --data-filename-pattern="*.tsv" \
--work-folder="/tmp/pyprophet_work/" --result-folder="/tmp/pyprophet_result_experiment_wide" --separator="tab" \
--job-number 1 --job-count 3 --lambda=0.4 --statistics-mode=experiment-wide &
pyprophet-cli score --data-folder="/tmp/openswath_results/" --data-filename-pattern="*.tsv" \
--work-folder="/tmp/pyprophet_work/" --result-folder="/tmp/pyprophet_result_experiment_wide" --separator="tab" \
--job-number 2 --job-count 3 --lambda=0.4 --statistics-mode=experiment-wide &
pyprophet-cli score --data-folder="/tmp/openswath_results/" --data-filename-pattern="*.tsv" \
--work-folder="/tmp/pyprophet_work/" --result-folder="/tmp/pyprophet_result_experiment_wide" --separator="tab" \
--job-number 3 --job-count 3 --lambda=0.4 --statistics-mode=experiment-wide &
  • Global context
pyprophet-cli score --data-folder="/tmp/openswath_results/" --data-filename-pattern="*.tsv" \
--work-folder="/tmp/pyprophet_work/" --result-folder="/tmp/pyprophet_result_global" --separator="tab" \
--job-number 1 --job-count 3 --lambda=0.4 --statistics-mode=global &
pyprophet-cli score --data-folder="/tmp/openswath_results/" --data-filename-pattern="*.tsv" \
--work-folder="/tmp/pyprophet_work/" --result-folder="/tmp/pyprophet_result_global" --separator="tab" \
--job-number 2 --job-count 3 --lambda=0.4 --statistics-mode=global --overwrite-results &
pyprophet-cli score --data-folder="/tmp/openswath_results/" --data-filename-pattern="*.tsv" \
--work-folder="/tmp/pyprophet_work/" --result-folder="/tmp/pyprophet_result_global" --separator="tab" \
--job-number 3 --job-count 3 --lambda=0.4 --statistics-mode=global --overwrite-results &

References

[1]Teleman J, Röst HL, Rosenberger G, Schmitt U, Malmström L, Malmström J, Levander F. DIANA–algorithmic improvements for analysis of data-independent acquisition MS data. Bioinformatics. 2015 Feb 15;31(4):555-62. doi: 10.1093/bioinformatics/btu686. Epub 2014 Oct 27. PMID: 25348213
[2]Reiter L, Rinner O, Picotti P, Hüttenhain R, Beck M, Brusniak MY, Hengartner MO, Aebersold R. mProphet: automated data processing and statistical validation for large-scale SRM experiments. Nat Methods. 2011 May;8(5):430-5. doi: 10.1038/nmeth.1584. Epub 2011 Mar 20. PMID: 21423193
[3]Röst HL, Aebersold R, Schubert OT. Automated SWATH Data Analysis Using Targeted Extraction of Ion Chromatograms. Methods Mol Biol. 2017;1550:289-307. doi: 10.1007/978-1-4939-6747-6_20. PMID: 28188537
[4]Rosenberger G, Bludau I, Schmitt U, Heusel M, Hunter CL, Liu Y, MacCoss MJ, MacLean BX, Nesvizhskii AI, Pedrioli PGA, Reiter L, Röst HL, Tate S, Ting YS, Collins BC, Aebersold R. Statistical control of peptide and protein error rates in large-scale targeted data-independent acquisition analyses. Nat Methods. 2017 Sep;14(9):921-927. doi: 10.1038/nmeth.4398. Epub 2017 Aug 21. PMID: 28825704