Re: [Gusdev-gusdev] LoadAnnotatedSeqs plugin underway

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

paul-

let me start digesting this by email.

about your extensions to EMBL.  the bioPerl model we are parsing into is 
based on generic features, tags and annotation.  as long as the 
extensions can be parsed into those objects we're half way there.   are 
the extensions syntactically consistent w/ standard embl files, but 
varying only in the particulars of what the data is called?

about building the hierarchy.  if you looked at the bioperl api for the 
unflattener, you'd see that its unflattening uses gene name as a clue to 
deciding what features go together in a particular gene model.

can gene name be relied upon to identify all the features that are 
associated with this gene?

finally, about the GO stuff, yes, we can probably reuse your code.

steve

Paul Mooney wrote:

>
> On 9 Dec 2004, at 19:31, Steve Fischer wrote:
>
>> paul-
>>
>> hey.  do you want to set up a time to chat so i can catch you up on 
>> what we have in mind?
>
>
>
> At the moment I'm curious how much can be achieved via a generic 
> plugin. I think the plugin will need plugin's to do specialised parts 
> :) However I'd be glad to give my assistance to the effort. Below are 
> my random thoughts I've just had on the matter;
>
>
> Here at the PSU we store an awful lot of info that can not be stored 
> in a standard EMBL file, hence we have extended it to fit out own 
> needs. As an example we use several name qualifiers for genes;
>
>     . systematic_id           - the name cast in stone
>     . temporary_systematic_id - the name as it is currently known
>     . previous_systematic_id  - as it was known
>     . gene                    - EMBL standard qualifier
>
> Hence just trying to unflatten the EMBL file is tricky because 
> systematic and temporary_sysetmatic_ids are mixed in the same 
> sequence, hence building the hierarchy would need specialised code. 
> TIGR XML has the same issue though so maybe its not too specialised 
> after all :/ (PUB_LOCUS and LOCUS has a direct mapping to 
> systematic_id and temporary_systematic_id).
>
> Something like this entry;
>     /curation="name; origin; date; permission; type; dbref; notes ..."
> i.e.
>     /curation="Matt Berriman; genedb; 20020128; public; comment"
> is unique to the PSU and I'm not sure where it fits in GUS.
>
> However;
>
> I have code that creates GO entries - supply a high level function 
> with all the standard GO fields and it creates the 5 rows (?) in the 
> different tables as required. This is definitely something that can be 
> shared across centres, perhaps in a code library. All your code has to 
> do is parse out the GO fields from the data. No reason why it couldn't 
> accept a GO Bioperl object (I presume one exists).
>
> Perhaps the parsing needs to a super class for each data source and 
> then sub-classed by each centre?
>
> Ok, enough ramblings. Does any of this make sense?
> Paul.
>
>> steve
>>
>> Chris Stoeckert wrote:
>>
>>> Hi Steve,
>>> Thanks for putting this out on gusdev. Marie-Adele indicated that 
>>> Paul Mooney was very interested in this and I will likely meet with 
>>> him about this when I visit in January. Please include him in email 
>>> correspondence when not addressed to the general gusdev list.
>>> Thanks,
>>> Chris
>>>
>>> On Dec 9, 2004, at 2:11 PM, Steve Fischer wrote:
>>>
>>>> folks-
>>>>
>>>> the UGA folks and CBIL folks have started collaborating on a new 
>>>> plugin called LoadAnnotatedSeqs.   It will use BioPerl to parse the 
>>>> input data.
>>>>
>>>> We expect it to take annotated sequences (NA at first) in genbank, 
>>>> tigr xml and embl formats (plus any others supported by the bioPerl 
>>>> parser).
>>>>
>>>> It will take an XML file that describes the mapping from the input 
>>>> features to GUS features, and SO features.
>>>> It will also hard code special cases to handle qualifer data that 
>>>> is distributed to tables outside of the NAFeature tables.
>>>>
>>>> For our projects we will be developing a mapping that unifies the 
>>>> semantics of the data we are getting from our different sources and 
>>>> formats.
>>>> (we plan to work with the PSU folks to incorporate the knowledge 
>>>> they have acquired in their work to make an EMBL parser)
>>>>
>>>> ideas and suggestions are encouraged.
>>>>
>>>> steve
>>>>
>>>>
>>>>
>>>> -------------------------------------------------------
>>>> SF email is sponsored by - The IT Product Guide
>>>> Read honest & candid reviews on hundreds of IT Products from real 
>>>> users.
>>>> Discover which products truly live up to the hype. Start reading 
>>>> now. http://productguide.itmanagersjournal.com/
>>>> _______________________________________________
>>>> Gusdev-gusdev mailing list
>>>> Gus...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
>>>
>>>
>>