You can use the MSF Files node to determine how the application parses the FASTA title lines of found proteins. When the application parses these title lines, it applies a set of predefined parsing rules to extract the accession and description of the protein. If it finds a protein in more than one results file from which it generates a report, and if the FASTA files of the two reports differ, the application displays the first available description and accession on the Proteins page. You can view all other accessions and descriptions when you move the cursor over the cells in the corresponding columns.
The MSF Files node has advanced parameters that you can set to change the FASTA title line parsing. Changing these settings affects how the application displays protein accession and description information in the report.
You can use the node’s Title Line Rule parameter to define an alternative parsing rule to use to parse the accession and description. If the specified rule does not match anything in the FASTA title line, the Proteome Discoverer application tries the standard parsing rules to match the protein accession and description.
In cases where multiple accessions and descriptions are available for a given protein, you can determine the order of the displayed accessions and descriptions by using three other node parameters. The application applies them in this order:
NOTE
The application applies the following rules only if multiple equally scored proteins are returned for the same set of PSMs, which rarely happens. Otherwise, it marks the highest-scoring protein reference as the master protein and displays it as such on the Protein Groups page. Even then, the protein marked as master is typically the longest of the group of equally scored protein references (the others are alternate master protein candidates).
Procedure
- Preferred Accession—Select a parsing rule to extract the preferred protein accession from the FASTA entry. If the application finds a preferred accession, it displays it instead of the primary accession.
- If you select a rule for the Preferred Accession parameter that matches one of the accessions, the application moves this accession and description to the first position.
- Preferred Taxonomy—Select a parsing rule to extract the preferred taxonomy from the FASTA entry. If the application finds a preferred taxonomy, it displays the accession and description of this entry, except when an entry containing a preferred accession is better than an entry containing preferred taxonomy and no preferred accession.
- If you performed the search without a preferred taxonomy and the application identifies proteins with the same sequence from different species, you can select a rule with the Preferred Taxonomy parameter to display the accession and description from the right species.
- Here is an example showing the precedence of the Preferred Taxonomy parameter. Suppose that you have one protein with more than one accession and description. Both descriptions contain some common keywords, for example:
- Description 1: xxxxxxxxxx human abc
- Description 2: yyyyyyyyyy human abc
- You set the Preferred Taxonomy parameter to a rule that includes “human.” Then you set the Avoid Expressions parameter to a rule that includes “human.”
- In this case, the Preferred Taxonomy parameter has a higher precedence than the Avoid Expression parameter.
- Avoid Expressions—Select the terms that the application should avoid when parsing the protein description. If more than one description is available, the application prefers the description containing none of the specified terms.
- Some of the publicly available protein databases, such as UniProt or NCBI, collect experimental verified and curated proteins as well as unverified proteins, for example, the output from bioinformatic algorithms predicting potential proteins in a sequenced genome. In many cases, these unverified proteins contain words like “predicted” or “hypothetical” in the description. The Avoid Expressions parameter matches these words in the description. Where there are two different descriptions from two different databases used in a complex search, the application displays the description that does not contain a word that you selected with the Avoid Expressions parameter.
- The Proteome Discoverer application only applies the Avoid Expressions parameter if there are different FASTA title lines for the same protein that conform to the other title line rules described in this topic.
- The following example shows how the Avoid Expressions parameter works. Suppose that you have one protein with more than one accession and description. Both descriptions contain some common keywords, for example:
- Description 1: xxxxxxxxxx human abc
- Description 2: yyyyyyyyyy human abc
- You set the Avoid Expressions parameter to a rule that includes “abc.”
- In this case, the application takes the first accession in the list.
The application saves the parsing rules for the FASTA title lines in the FastaTitleParsingRules.xml file, which is stored in the C:\ProgramData\Thermo\Proteome Discoverer 3.1\MagellanDBs folder or equivalent. This file contains the parsing rules, along with name and meta information. A parsing rule is a list of regular expressions. If the application uses the parsing rule, it applies all regular expressions in the list in the order of the list, starting with the first. It uses the first rule that matches the title line to read out the accession and, if declared, the description.
The file contains a section for each of the four parsing rule parameters of the MSF Files node. The basic *.xml document with the four sections without rules looks like this:
<?xml version="1.0" encoding="utf-16"?>
<FastaTitlelineRules xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<TitlelineRules>
... enter rules here ...
</TitlelineRules>
<AccessionRules>
... enter rules here ...
</AccessionRules>
<TaxonomyRules>
... enter rules here ...
</TaxonomyRules>
<AvoidExpressionRules>
... enter rules here ...
</AvoidExpressionRules>
</FastaTitlelineRules>
The parsing rules with their lists of regular expressions are defined as follows as in this example showing an accession rule to match SwissProt accessions:
<ParsingRule name="swissprot" isVisible ="true" changable="true">
<RuleParts>
<RulePart>\|SWISS-PROT:(?<AC>[A-N,R-Z][\d][A-Z][A-Z,\d][A-Z,
\d][\d])</RulePart>
<RulePart>\|SWISS-PROT:(?<AC>[O,P,Q][\d][A-Z,\d][A-Z,\d][A-Z,
\d][\d])</RulePart>
<RulePart>sp\|(?<AC>[A-N,R-Z][\d][A-Z][A-Z,\d][A-Z,\d][\d])\|
</RulePart>
<RulePart>sp\|(?<AC>[O,P,Q][\d][A-Z,\d][A-Z,\d][A-Z,\d][\d])\|
</RulePart>
</RuleParts>
</ParsingRule>
The first line of the rule defines the name that is displayed in the node parameter. It specifies whether the rule is visible in the list for the parameter. For new rules, always set isVisible
and changeable
to true
; otherwise, it is impossible to apply the rule.
The body of the rule is a list of statements containing one or more elements named RulePart
. All regular expressions are “or” connected in the final parsing rule. For accession rules, a named capture group, (?<AC>)
, must match the accession. The Proteome Discoverer application evaluates it to extract the accession for display. You must use <
instead of <
and >
instead of >
because the <
and >
are not allowed in the XML entry. The title line rules contain two named capture groups, ?<AC1>
and ?<Desc1>
, for accession and description.
To add or modify FASTA parsing rules, see Add or modify FASTA parsing rules.