After more careful testing of Crux Percolator in Windows, I see the following, some of which is different from what I asserted at the Crux developers meeting today.
Background: I am trying to use Percolator to process results from searches on crosslinked peptides. The representation of a crosslinked peptide will necessarily be more complicated than a string from the standard 2-letter amino acid alphabet. Furthermore, at this stage, I am not generating standard Crux tab-limited output. Instead I am post-processing my search results to generate a feature file which I want to be equivalent to the new tab-delimited .pin format.
When I put a .txt suffix on my feature file, Crux Percolator complains at the console with many iterations of
ERROR: No sequence found...
and terminates without doing anything useful. When I put a .pin suffix on my feature file, Crux Percolator seems to run normally, as judged by what appears on the console, and produces somewhat useful output files. However, many instances of messages like
ERROR: The modification symbol '2' is not valid.
WARNING: There is an unidentifiable modification in sequence <mgkdnkehkesk*1-28*geaiavaiaqmstvdlascdhgvvasvkrcimerdlypr> at position 14.</mgkdnkehkesk*1-28*geaiavaiaqmstvdlascdhgvvasvkrcimerdlypr>
subsequently appear on the console, I assume as a by-product of processing for the output files. The output files have several columns which are all zeros, with names like 'charge' and 'spectrum precursor m/z'; it appears Percolator attempted to calculate these without the requisite information being available. Also, the peptide sequences in the output files have all non-standard characters stripped out.
So, in summary, one can bypass sqt2pin/make-pin style pre-processing in Crux Percolator by naming the input with suffix .pin. However there is still some post-Percolator processing built into Crux which assumes the data came from an MS/MS experiment, and tries to conjure certain output fields accordingly. Stand-alone Percolator does not have this last behavior.
Hi Jeff,
The post-processing is a necessary step to output the non-standard Percolator outputs (e.g. mzid, pepxml), since Percolator's internal objects must be converted to Crux objects before they can be written.
If you want the native Percolator output, you can use
--original-output T".
I think the idea in Percolator is supposed to be that you can either
provide a PIN file, in which case, as Jeff says, Percolator will assume
that the data came from an MS/MS experiment, or you can provide a
tab-delimited text file using the "--feature-in-file" option, in which case
Percolator will just do the machine learning part but makes no assumptions
about the meanings of the various input features.
Jeff, why don't you want to use the feature-in-file option and a
tab-delimited file as input?
Bill
On Thu, Jan 15, 2015 at 4:16 PM, Kaipo kaipot@users.sf.net wrote:
Related
Issues:
#224I reinvestigated this Issue using a Linux binary (not Windows) built from trunk on 3/11/15, with a different .pin file than previously (attached). The behaviors previously reported are still observed, with some variations.
1) Changing the extension from .pin to .txt causes Percolator to complain at
the console with many iterations of
ERROR: No sequence found...
and terminate without doing anything useful.
2) With this .pin file, the SVM training and PSM-level analysis seem to work properly, as judged by the .log and percolator.XXX.txt.psms files. However, the subsequent peptide-level analysis fails, with this message:
FATAL: PSMID should be (((target|decoy)_fileidx)|filestem)_scan_charge_rank, but was 121212_F2-ReACT-PA-BDP-XL-4hr-1.txt_12536
It appears Crux Percolator is looking at the Scan_Id field for encoded information on scan, charge, and rank, and not finding them. Previously, my .pin file had ScanId's constructed to hold this information, so the peptide-level analysis succeeded, although there several fields created in the output which were filled with zeros.
For the record:
Jeff , could you give this patch a try?
Last edit: Kaipo 2015-04-01
I applied Kaipo's updated version of the patch (from 2015-05-13) to a fresh copy of the trunk checked out on 2015-05-19. The patched code was compiled and tested on a Linux machine.
I ran tests on a small pin file, named as either test.pin or test.txt. Peptide strings in the pin file contain non-standard characters capture crosslink information, e.g. -.KVKRNSTPPLSLFGQLLWR3-7TPEEIRKTFNIK_40444.-.
Running Percolator on these two files, with or without --feature-in-file T, gave these results.
test.pin (--feature-in-file F)
Percolator runs and produces more or less useful output. Before the patch, the console was filled with many repetitions of messages like:
ERROR: The modification symbol '2' is not valid.
WARNING: There is an unidentifiable modification in sequence <mgkdnkehkesk*1-28*geaiavaiaqmstvdlascdhgvvasvkrcimerdlypr> at position 14.</mgkdnkehkesk*1-28*geaiavaiaqmstvdlascdhgvvasvkrcimerdlypr>
After the patch, these no longer appear.
Otherwise, the behavior is the same as before the patch. In particular, the output files have several columns which are all zeros, with names like 'charge' and 'spectrum precursor m/z'; it appears Percolator attempted to calculate these without the requisite information being available. Also, the peptide sequences in the output files have all non-standard characters stripped out.
test.txt (--feature-in-file F)
Behavior unchanged by patch. Console has many repetitions of message:
ERROR: No sequence found...
and Percolator terminates without doing anything useful.
test.pin --feature-in-file T
Results identical to test.pin (--feature-in-file F).
test.txt --feature-in-file T
Results identical to test.pin (--feature-in-file F).
Summary: --feature-in-file parameter is now recognized (i.e. not rejected as invalid), but setting it to T doesn't cause Percolator to treat its input as a generic, non-proteomic feature file. Changing the file suffix does not help.
Last edit: Jeff Howbert 2015-05-20
Hi Jeff, Percolator should be treating the input as a generic feature file with feature-in-file=T. Can you try to turn on original-output=T and see if that works?
Hi Kaipo,
When I set --original-output=T, it gets rid of all the nonsense columns in the output and suppresses the deletion of non-standard characters from my peptide strings. In other words, I get the same output as from stand-alone Percolator, just as you predicted.
Additionally setting --feature-in-file=T does not change the behavior on my test.pin file. However, it does allow my test.txt file to be recognized as valid Percolator input; with this flag on it gets processed exactly like the test.pin file.
I think you cann apply the percolator_fixes patch to trunk and close this issue, along with Issue #221.
Thanks,
Jeff
Thanks Jeff, it's committed.