The Protein FDR Validator node estimates the number of falsely identified proteins among all the identified proteins. It uses the decoy results as proxies, assuming that as many decoy matches as targets contain false identifications. It assigns protein confidence levels according to the number of decoy proteins above a native score threshold. It then calculates experimental q-values from the counts of target and decoy proteins above the current score threshold. Specifically, the application performs the following steps when you use this node:
Procedure
- First it sorts all identified target and decoy proteins by their search engine score, for example, the Sequest HT protein score, in decreasing order.
- The application goes through the list of target proteins from top to bottom and calculates the false discovery rate (FDR) that would result if it used the target score of a particular protein as a threshold. It obtains this threshold by dividing the number of target proteins by the number of decoy proteins, or:
- The application calculates this threshold for every target protein.
- However, the FDR values do not monotonically increase when the score threshold decreases. When you lower the threshold further, the FDR usually decreases first, because you add additional target proteins that have a score above the threshold before you add the next decoy protein. For example, at a score threshold of 143, 1000 targets pass. With 10 decoys, an FDR of 1.0% results. Next the application lowers the score threshold by going to the next target of 142, then 141, then 137, and so forth. This methodology adds targets that pass the threshold. Because there are far fewer decoys, it is likely that it can take some time until you reach a score threshold where you add the next decoy score.
- Assume the next decoy has a score threshold of 90. From score 143 down, the FDR decreases at first as more targets, but not more decoys, pass. At score 91, 1050 targets but only 10 decoys yield an FDR of 0.95%. At score 90, 1100 targets but 11 decoys also yield an FDR of 1.0%.
- If you want to filter by 1% FDR, which score threshold is best to use, 143 or 90? Both result in an FDR of 1%, but 90 would yield 1100 targets, and 143 would yield only 1000. To circumvent this problem, the application uses the q-value, which is defined as the minimum FDR threshold at which a given target would be included in the results.
- The application traverses the list of target proteins sorted by score and calculates experimental q-values1) 2). Then it assigns to each target protein the minimum FDR threshold required to allow it to be included in the results.
The Protein FDR Validator uses a protein score to rank the list of proteins from the target search and from the decoy search. It then uses these ranked lists to calculate q-values from the false discovery rates (FDR) at each score threshold. It calculates the FDR by counting the number of target and decoy proteins at a given score threshold and defines the q-value as the minimum FDR attained at or above a given score threshold 3).
When the search results contain posterior error probabilities (PEPs) for the identified peptides (for example, because Percolator was used for the PSM validation), the Protein FDR Validator node calculates a score based on these PEP values of the PSMs and uses it to rank the list of proteins.
The node first checks to see whether all PSMs have PEP values. If not, it calculates the protein q-values according to the protein scores calculated by the search engine. If PEP values are available, the node calculates a new protein score by multiplying the PEP values of the peptides connected to the protein. To make it numerically more stable, because PEP values are very small numbers, it actually sums the logarithms of the PEP values as follows:
For the calculation, the node first groups the PSMs of the protein by sequence, charge, and theoretical mass and then uses the best valueβthat is, the minimum PEP value within these groupsβto calculate the sum PEP score:
When the application performs the validation on the new sum PEP scores, it adds these scores to a column called Sum PEP Score on the Proteins page. It also adds an Exp. q-value column, which displays the q-values derived from the validation, and a # Decoy Proteins column, which shows the number of decoy proteins above the given sum PEP score.
As a validation of the sum PEP score approach, consider the following example. First, the Protein FDR Validator node searches samples against two databases: a specific target proteome database such as cyanobacterium (Synechococcus elongatus) and a database containing a large number of non-overlapping protein sequences from a genetically diverse organism such as Tasmanian devils (Sarcophilus harrisii (Tasmanian Devil, translation from ensemble genome, eight times larger than the synechococcus database). The latter sequences serve as entrapment sequences that allow you to see the true and false identification rate in the final results.
The Protein FDR Validator node adds the following columns to the Proteins page for each search engine node that made a decoy search:
- Confidence column, which displays a green-, yellow-, or red-filled circle with these meanings:
- - Green: Displays high-confidence proteins, starting with either the first decoy protein or the protein FDR that reaches the high FDR threshold, whichever occurs later
- - Yellow: Displays medium-confidence proteins, starting with the protein following the first or highest protein threshold
- - Red: Displays low-confidence proteins, starting with the protein following the second reverse protein or the protein FDR that reaches the medium FDR threshold, whichever occurs later
- Decoy Protein Count column, which displays the number of the higher-ranked decoy or reverse proteins
- Exp. q-value column, which displays the q-values calculated from the number of target and decoy proteins above the current score threshold
The FDR Protein Validator node must be connected to the Protein Scorer node.
The following table describes the parameters for the Protein FDR Validator node.
Parameter | Definition |
---|---|
Target FDR (Strict) | Specifies a target false discovery rate for protein matches of high confidence. High-confidence proteins are those with a q-value higher than the specified threshold. Range: 0.0β1.0; default: 0.01 |
Target FDR (Relaxed) | Specifies a target false discovery rate for protein matches of medium confidence. Medium-confidence proteins are those with a q-value higher than the specified threshold. Range: 0.0β1.0; default: 0.05 |