To estimate the number of false positive protein identifications in a more systematic fashion, you can use a decoy database containing reversed protein sequences. Since this initial application, many other researchers have used decoy searches to estimate the number of incorrect PSMs that exceed a given threshold. With this approach, you can adjust the score threshold to obtain a target false discovery rate1), 2).
Although database-searching algorithms work well, current scoring methods produce significant overlap between the scores of correct and incorrect peptide identifications. To ensure that a large fraction of the true positive identifications are retained, you must select a score threshold to ensure that a percentage of the peptide identifications is incorrect.
Because most database search algorithms return multiple scores—for example, XCorr and D Cn for Sequest HT—most proteomics studies apply separate thresholds to each score. Using multiple orthogonal score criteria is useful for eliminating false discoveries that might exceed one threshold but not another. However, in most cases these orthogonal scores are considered independently, ignoring the benefits that can be obtained if the features are considered jointly3).
The Percolator node uses semi-supervised machine learning to discriminate correct from incorrect peptide spectrum matches and calculates accurate statistics, such as the q-value and posterior error probability (PEP), to improve the number of confidently identified peptides at a given false discovery rate. It also assigns a statistically meaningful q-value to each PSM and the probability of the individual PSM being incorrect4).
Methodology
The Percolator algorithm uses a semi-supervised method to train a machine learning algorithm called a support vector machine (SVM) to discriminate between positive and negative PSMs3) 5) 6). As negative examples for the classifier, Percolator uses the PSMs derived from searching a decoy database that consists of reversed protein sequences (the classifier is the feature vector with its determined weights). As positive examples, it uses a subset of the high-scoring PSMs derived from searching the target database.
For each target and decoy PSM, Percolator computes a set of features related to the quality of the match—for example, search engine scores, precursor mass deviation, and average fragment mass deviation. Subsequently, Percolator chooses the most relevant feature from the set of target and decoy PSMs and filters it to a fixed false discovery rate (FDR)—for example, 1 percent. It then applies the learned classifier to all target and decoy PSMs and again filters the most relevant feature to a fixed FDR to continue the training procedure. After a few iterations, the system converges and results in a robust classifier that then rescores each PSM in the data set. The whole process is fully automated and does not require any expert-driven or subjective decisions, eliminating any artificial biases. The learned classifier is specifically adapted to each data set and is unique to it, allowing adaptation to variations in data quality, protocols, and instrumentation.
NOTE
To work properly, Percolator needs a sufficient number of PSMs from the target and the decoy search. If the search identified fewer than 200 target or decoy PSMs, or if fewer than 20 percent decoy PSMs are available compared to the number of target matches, Percolator rejects them for processing and displays an appropriate message in the Proteome Discoverer job queue or in the Result Summaries of an open report. In these cases, open the result file when the search finishes and perform a target- and decoy-based FDR calculation on the Peptide Confidence page.
Feature sets
Except for the individual scores of the search engines, the application uses the same feature set for all available search engines. The following table lists the features in this feature set.
Feature | Description |
---|---|
Search engine/scoring-specific features |
|
Mascot Ions Score | Refer to the Mascot documentation from Matrix Science. This feature is only used for Mascot searches. |
Sequest XCorr | Scores the number of fragment ions that are common to two different peptides with the same precursor mass and calculates the cross-correlation score for all candidate peptides queried from the database (Sequest HT searches only). |
Delta Cn | Normalized score distance from the second-best-scoring PSM of the spectrum. |
Binomial Score | Binomial score as described by S. A. Beausoleil, J. Villen, S.A. Gerber, J. Rush, and S.P. Gygi. 7) |
Peptide-precursor-related features | |
% Isolation Interference | Fraction of ion current in the isolation window not attributed to the identified precursor. |
MH+ [Da] | Singly charged mass of the peptide. |
Delta Mass [Da] | Deviation of the measured mass from the theoretical mass of the peptide, in daltons. |
Delta Mass [ppm] | Deviation of the measured mass from the theoretical mass of the peptide, in ppm. |
Absolute Delta Mass [Da] | Absolute deviation of the measured mass from the theoretical mass of the peptide, in daltons. |
Absolute Delta Mass [ppm] | Absolute deviation of the measured mass from the theoretical mass of the peptide, in ppm. |
Peptide Length | Length of the peptide in residues. |
Is z=1 | Binary flag indicating whether the charge state of the PSM is 1. |
Is z=2 | Binary flag indicating whether the charge state of the PSM is 2. |
Is z=3 | Binary flag indicating whether the charge state of the PSM is 3. |
Is z=4 | Binary flag indicating whether the charge state of the PSM is 4. |
Is z=5 | Binary flag indicating whether the charge state of the PSM is 5. |
Is z>5 | Binary flag indicating whether the charge state of the PSM is above 5. |
Digestion-related features |
|
# Missed Cleavages | Number of missed cleavages. |
FASTA-related features |
|
Log Peptides Matched | Logarithm of the number of candidates in the precursor mass window. |
Spectrum-related features |
|
Log Total Intensity | Logarithm of the total ion current of the fragment spectrum. |
Fraction Matched Intensity [%] | Fraction of the total ion current of the fragment spectrum that is matched by fragments of the PSM. |
Fragment-series-related features |
|
Fragment Coverage Series A, B, C [%] | Coverage of the N-terminal fragment ion series. The coverage is separately calculated for each series used, and the maximum coverage is used. |
Fragment Coverage Series X, Y, Z [%] | Coverage of the C-terminal fragment ion series. The coverage is separately calculated for each series used, and the maximum coverage is used. |
Log Matched Fragment Series Intensities A, B, C | Logarithm of the intensity sum of all matched N-terminal fragment ion series peaks. The intensity sum is separately calculated for each series used, and the maximum coverage is used. |
Log Matched Fragment Series Intensities X, Y, Z | Logarithm of the intensity sum of all matched C-terminal fragment ion series peaks. The intensity sum is separately calculated for each series used, and the maximum coverage is used. |
Longest Sequence Series A, B, C | Longest consecutive matched sequence among the N-terminal fragment ion series peaks. The sequence length is separately calculated for each series used, and the maximum coverage is used. |
Longest Sequence Series X, Y, Z | Longest consecutive matched sequence among the C-terminal fragment ion series peaks. The sequence length is separately calculated for each series used, and the maximum coverage is used. |
Fragment-related features |
|
IQR Fragment Delta Mass [Da] | Inter-quartile range of the distribution of mass errors of all fragments considered, in daltons. |
IQR Fragment Delta Mass [ppm] | Inter-quartile range of the distribution of mass errors of all fragments considered, in ppm. |
Mean Fragment Delta Mass [Da] | Arithmetic mean range of the distribution of mass errors of all fragments considered, in daltons. |
Mean Fragment Delta Mass [ppm] | Arithmetic mean range of the distribution of mass errors of all fragments considered, in ppm. |
Mean Absolute Fragment Delta Mass [Da] | Arithmetic mean range of the distribution of absolute mass errors of all fragments considered, in daltons. |
Mean Absolute Fragment Delta Mass [ppm] | Arithmetic mean range of the distribution of absolute mass errors of all fragments considered, in ppm. |
The Processing Step page of the Analysis Settings page of the Result Summaries lists the weights that Percolator assigns to individual features. The job queue also displays these weights.
q-value
A q-value is the minimal false discovery rate at which the identification is considered correct 4). q-values are estimated using the distribution of scores from the decoy database search. A q-value of 0.01 for the EAMRQPK peptide matching spectrum, s, means that if you try all possible FDR thresholds, 1 percent is the minimal FDR threshold at which the PSM of EAMRQPK to s appears in the output list.
Although the q-value is associated with a single PSM, it also depends on the data set that the PSM occurs in.
Posterior Error Probability (PEP)
The posterior error probability (PEP) is the probability that the observed PSM is incorrect. For example, if the PEP associated with (EAMRPK, s) is 5 percent, there is a 95 percent chance that the EAMRPK peptide was in the mass spectrometer when spectrum s was generated.
The PEP is similar to a local version of the FDR. The FDR measures the error rate associated with a collection of PSMs, and the PEP measures the probability of error for a single PSM. Similarly, the PEP measures the error rate for PSMs with a given score, x. Percolator computes the FDR from the PEPs, because the expected number of incorrect PSMs in a given set is equal to the sum of the PEPs 4).
Which score is best?
Suppose that you have just finished running a mass spectrometry experiment, and you have used a database search program to match each spectrum to a peptide. You now need to choose a piece of software to assign statistical scores to each of these PSMs. Assume that you have two choices: one program that computes accurate q-values and one program that computes accurate PEPs. Which program should you choose?
The answer depends on what you plan to do with your results 4). PEPs and q-values are complementary and useful in different situations. The q-value estimates the rate of misclassification among a set of PSMs. If you want to determine which proteins are expressed in a certain cell type under a certain set of conditions, or if your follow-up analysis involves looking at groups of PSMs, the q-value is an appropriate measure. For example, considering all proteins in a known pathway, evaluating enrichment with respect to Gene Ontology database categories, or performing experimental validation on a group of proteins involves analyzing groups of PSMs, so you would use the q-value.
If the goal of your experiment is instead to determine the presence of a specific peptide or protein, use the PEP. For example, suppose that you want to determine whether a certain protein is expressed in a certain cell type under a certain set of conditions. Examine the PEPs of your detected PSMs. Likewise, suppose that you have identified a large set of PSMs using a q-value threshold, and among them you identify a single PSM that is intriguing. Before deciding to dedicate significant resources to investigating a single result, you should examine the PEP associated with that PSM. Although the q-value associated with that PSM might be 0.01, the PEP is always greater than or equal to 0.01. In practice, the PEP values for PSMs near the q = 0.01 threshold are likely to be much larger than 1 percent.
Using Percolator in a workflow
For instructions to use Percolator to set up false discovery rates, see Calculating FDRs.
Parameters
The following table describes the parameters for the Percolator node.
Parameter | Definition |
---|---|
Target/Decoy selection | Impacts FDR calculation.
|
Validation Based On | Determines which algorithm to use to calculate the score that the validation is based on:
|
Maximum Delta Cn | Specifies a Δ Cn threshold and filters out all PSMs with a Δ Cn larger than this value. Δ Cn is the normalized score difference between the currently selected PSM and the highest-scoring PSM for that spectrum. The default value of 0.05 for the Maximum Delta Cn parameter means that from every spectrum, the PSM with the best score is selected, plus the PSMs that have scores with a normalized difference of no more than 5 percent. The Maximum Delta Cn parameter is a kind of flexible rank filter. PSMs that are excluded from processing by Percolator are classified as low-confidence. Range: 0–0.1; default: 0.05 |
Maximum Rank | Specifies a value that determines whether to use a PSM for validation. The application uses for validation all PSMs having a search engine rank greater than or equal to this value. Range: 0–no maximum; default: 0 (the application uses all available PSMs) |
Target FDR (Strict) | Specifies a target false discovery rate for peptide matches of high confidence. Range: 0.0–1.0; default: 0.01 |
Target FDR (Relaxed) | Specifies a target false discovery rate for peptide matches of medium confidence. Range: 0.0–1.0; default: 0.05 |