[Gusdev-gusdev] Re: LoadBlastSimFast issue

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Ok,

Steve Fischer wrote:

> arnaud-
>
> ok, would you like to do the upgrade to LoadBlastSimFast?
>
So the plugin requires two extra parameters, '-queryIdAttr' and 
'-subjectIdAttr', is that right ?

> if i understand correctly this would avoid having to modify the 
> schema, is that right?

I think so.

Another thing, the plugin requires 'use' statements for loading the 
sequence objects we want to attach similarity data to.
Could we bypass somehow this declaration as in theory we would want to 
attach similarity data to any view on the top of NASequenceImp or 
AASequenceImp. By instanciating AASequence or NASequence superclass 
objects and using the subclass_view attribute to affect to the correct 
view the data, would it be feasible this way ?

>
> steve
>
> Arnaud Kerhornou wrote:
>
>>
>> Steve Fischer wrote:
>>
>>> Arnaud-
>>>
>>> ok, i've looked at LoadBlastSimFast.pm.     I see the addition of 
>>> the logic to use name to get the query object (and i know that we 
>>> discussed this in mail back in august).
>>>
>>> I am having some second thoughts about that change as it stands. The 
>>> original intent of the plugin was that the sequences submitted to 
>>> the blast process have been extracted from the database and 
>>> therefore have the primary key in their definition line.
>>>
>>> I think I understand that it would be useful to be able to skip that 
>>> step, ie, blast sequences using their native identifiers, and then 
>>> have the plugin discover what their internal primary key is.    
>>> That's what you want to do right?
>>>
>> that's right.
>>
>>> Does anybody know of any reason why that would not be ok?
>>>
>>> Assuming that nobody has any objections, maybe the best solution 
>>> would be to improve the plugin to take optional arguments that 
>>> specify the name of the query and/or subject identifier 
>>> attributes?    For example:  -queryIdAttr source_id.   This would 
>>> give us full flexibility (and also avoid the slightly risky 
>>> assumption that a digits-only identifier must be the primary key)
>>>
>> that sounds sensible.
>>
>>> steve
>>>
>>> Arnaud Kerhornou wrote:
>>>
>>>>
>>>> Steve Fischer wrote:
>>>>
>>>>> Arnaud-
>>>>>
>>>>> see below.
>>>>>
>>>>> steve
>>>>>
>>>>> Arnaud Kerhornou wrote:
>>>>>
>>>>>> Hi everyone
>>>>>>
>>>>>> To be able to reproduce the OrthoMCL method, I would like to 
>>>>>> raise two issues we've got:
>>>>>>
>>>>>> * The first issue relates to the view where are stored the 
>>>>>> protein sequences. I was thinking to use the TranslatedAASequence 
>>>>>> view as this one contains the translated sequences of our gene 
>>>>>> models. The problem I have is that it is missing a name attribute 
>>>>>> so I can not match the blast output query and subject names with 
>>>>>> the data into GUS (I didn't want to use the TranslatedAASequence 
>>>>>> primary keys as the identifiers of my proteins of interest).
>>>>>> Could we add a name attribute to this view ?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> hmm.   not quite following.   what would this name be, where would 
>>>>> it be derived from?
>>>>
>>>>
>>>>
>>>>
>>>> By default we assign the systematic id of the corresponding CDS to 
>>>> the protein name.
>>>>
>>>>>   why not use source_id and/or secondary_identifier? 
>>>>
>>>>
>>>>
>>>>
>>>> We could do that, but in any case that would involve to modify the 
>>>> code of the loading BLAST output plugin (LoadBlastSimFast.pm) to 
>>>> get the sequences entries. At the moment the match is made on the 
>>>> primary key (which I want to avoid) or the name attribute. The 
>>>> source_id attribute would do instead of the name attribute. It must 
>>>> work for any blast (DNA Vs DNA or Protein Vs Protein) with the 
>>>> various potential GUS sequence objects we want to attach similarity 
>>>> data to. As far as I can see the source_id attribute is present in 
>>>> all of them (AASequenceImp and NASequenceImp tables).
>>>>
>>>>>   or, presumably this translated sequence has a relationship back 
>>>>> to its na sequence (although i don't immediately see that in the 
>>>>> schema browser), so couldn't you get a name or source_id from there?
>>>>>
>>>> That would require a more sophisticate query to get the sequence 
>>>> entry.
>>>>
>>>>>>
>>>>>> * The second issue relates to the BLAST output parsing, done by a 
>>>>>> module called BlastAnal.pm in the CBIL package. This module seems 
>>>>>> to parse BLAST output file with only one query sequence. I have 
>>>>>> more than one query sequence reported so I had to change the code 
>>>>>> of this module to allow more than one query sequence. Can my code 
>>>>>> be integrated  to CBIL package ? Note that I didn't change the 
>>>>>> interface of this module so it doesn't affect the scripts that 
>>>>>> are using it, I'm thinking in particular of 
>>>>>> parseBlastFilesForSimilarity.pl
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> this sounds ok.   how about we just take a quick look at this 
>>>>> together while you are visiting?   then we can fold it into the 
>>>>> code base.   do you want to send it by mail?
>>>>>
>>>> That's fine, the module is attached.
>>>>
>>>>>>
>>>>>> cheers
>>>>>> Arnaud
>>>>>>
>>>>>>
>