Re: [Open-ms-general] Retention Time prediction: RTModel parameter choice?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello Mathias,

thanks for your interest in this work.

> I'm having trouble understanding how to use the OpenMS tools for 
> retention time prediction.
>
> Let's picture I have 2 LCMS repeat experiments, same organism, with both 
> MS and MS/MS info in the experiments.
> My aim is to compare the predicted RT's with the RT corresponding to the 
> MS/MS peptides.
>
> After submitting the experiment files (mzXML) to Mascot, I get the 
> corresponding IdXML files.
> I use the first file as input-file to generate the SVM model, and use 
> the second IdXML file to predict the peptide-RT's against.
> I use IDFilter to obtain an IdXML file that will be used as training set.
> I'm using the OLIGO kernel type (maybe it is better to use the POLY 
> kernel?).
>   
This is exactly how it should work. You should use OLIGO. If you expect 
a shift in retention time between the two experiments, you should use 
the MapAligner to your data before submitting it to Mascot.
> How do I know what parameters are suited to IDFilter and RTModel to get 
> a good training set? Is it right that the OpenMS TOPP tools for 
> retention time prediction give NET values? 
If you want to use NET values then you have to specify the total time of 
the gradient via the parameter total_gradient_time in RTModel and 
RTPredict. We extended the functionality here such that you do not still 
need normalized elution times (in the latest development version and in 
the upcoming release).  This means that if you let total_gradient_time 
at value 1, the predicted retention times will be in the range of your 
measured retention times.

For getting a high confidence data set for training you can use the 
-pep_fraction parameter if Mascot is your ID engine. If this parameter 
is set to 1 the peptide identifications will be filtered according to 
the significance threshold score of Mascot (all identifications with 
smaller score will be filtered out).  If you  want to allow peptide 
identifications with a score which is 80% of the significance threshold 
value or bigger you can set the parameter to 0.8 and so on (have a look 
at Fig. 6 of Pfeifer et al. 2007 for an application of this). For high 
confidence training sets you should set pep_fraction to 1.

Another general possibility is to use the FalseDiscoveryRate tool. 
Therefore you have to generate two IdXML files for your training set. 
The first one is the file you already have and the second one should be 
constructed by searching with the same parameters against a decoy 
database. Then you can use FalseDiscoveryRate to estimate false 
discovery rates or q values:

FalseDiscoveryRate -fwd_in identifications_to_standard_database.IdXML 
-rev_in identifications_to_decoy_database.IdXML -out 
identifications_with_significance_measure.IdXML -peptides_only -q_values

In the output file all scores are replaced by the q_values/FDRs. The 
original Mascot score is stored as MetaInfo. This means that you can 
then directly filter for q_values/FDRs by using IDFilter with the 
pep_score option. Since IdXML has a parameter to store whether a higher 
score is better or a lower score, the IDFilter with e.g. -pep_score 0.01 
will filter out all identifications with a higher  q_value/FDR than 0.01 
(1%). I would prefer q values because they are directly suitable as 
filter threshold. For a high confidence training set, you should not set 
the pep_score higher than 0.10.

How big was your training set? Depending on the quality of your RTs 
maybe you have too little training data.

> I also tried plotting MS/MS RT 
> against the predicted RT of the peptides of its own LCMS experiment. I 
> get a similar fuzzy cloud with even points far away from the diagonal.
>
> Can you explain me what I am doing wrong?
>   
I suppose some parameters in the RTModel.ini file do not fit well to 
your data. If you send me your RTModel ini file I can have a look at 
this. The C parameter for the CV should be in the range of the maximal 
RT (0.001, 0.01, 0.1, 1 if you use NET and 1, 10, 100, 1000 if you use 
total_gradient_time=1). Sigma should be probed between 2 and 12. and nu 
should be between 0.3 and 0.7.

If this does not help you could also send me your IdXml file and I could 
have a look at this.

Best regards,
    Nico