Re: [Gusdev-gusdev] LoadAnnotatedSeqs plugin underway

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

folks-

i took a look at Ed's dump of the bioperl objects created by the parse 
of genbank.

for genbank, the bioperl Annotation objects are only used to describe 
the sequence and not any individual features.

our mapping assumes that, so we're lucky so far.   we'll need to have a 
look at tigr.

nonetheless, i think we need to adjust our mapping XML schema a tad.   
the main insight is that the source of our mapping is not genbank, or 
tiger, etc, but... bioperl objects.   our mapping syntax must describe 
how to map bioperl feature objects into gus, regardless of the origin of 
the data.    and, bioperl features have 'tags' and 'annotation'

right now we have a <qualifier> tag that was intended to map an input 
qualifier to a gus attribute.

but, bioperl doesn't have 'qualifiers' so, i think we need to replace 
<qualifer> with:
  <tag>

so far, we don't need <annotation> for the feature mapping, and lets 
hope we don't.    but, if we do, at least our xml will be forward 
compatible.

that said, we still owe ourselves a mapping for the tags and annotation 
that directly describe the sequence.

steve

Steve Fischer wrote:

> paul-  see in line
>
> steve
>
> Paul Mooney wrote:
>
>>
>> On 10 Dec 2004, at 12:52, Steve Fischer wrote:
>>
>>> paul-
>>>
>>> ok, i see.
>>>
>>> are there any other examples besides curation in which you have 
>>> placed structured data in qualifiers?     are there examples of 
>>> standard embl qualifiers in which you expect to find structured data 
>>> and parse it?
>>>
>>
>> After talking with Arnaud it seems we can take each 
>> qualifier/structured field and create a new feature, with each one of 
>> its qualifiers holding one piece of data. This would fit into your 
>> mapping scheme.
>>
> ok.  great.  i was wondering about that.
>
> so does that mean that we can expect that no qualifiers will contain 
> structured data that needs to be parsed?
>
>>> in the case of curation, where do you put that info in GUS?
>>
>>
>>
>> It will probably end up as a note, for now at least.
>>
>>>
>>> about systematic_ids, i understand what you've said.   one thing 
>>> though.  how do they relate to gene names?
>>
>>
>>
> ok, but, what i'm driving at is that the unflattener uses gene name 
> (/gene=) to decide what features go together in one gene model.   
> really, it wouldn't matter what the value of the /gene= is, as long as 
> it is identical for all features that belong to the gene.   is that 
> consistent with your use of /gene?
>
>> They are the gene names :)
>> Standard EMBL uses a /gene qualifier for the gene symbol and 
>> /standard_name for the human readable name.
>> During sequencing and annotation using a single /gene conveys no 
>> meaning as to how stable/temporary the ID is.
>>
>>> steve
>>>
>>> Paul Mooney wrote:
>>>
>>>>
>>>> On 9 Dec 2004, at 23:21, Steve Fischer wrote:
>>>>
>>>>> paul-
>>>>>
>>>>> let me start digesting this by email.
>>>>>
>>>>> about your extensions to EMBL.  the bioPerl model we are parsing 
>>>>> into is based on generic features, tags and annotation.  as long 
>>>>> as the extensions can be parsed into those objects we're half way 
>>>>> there.   are the extensions syntactically consistent w/ standard 
>>>>> embl files, but varying only in the particulars of what the data 
>>>>> is called?
>>>>
>>>>
>>>>
>>>>
>>>> We have additional qualifiers with values. The values hold 
>>>> structured information (say key=value pairs).
>>>> Bioperl will quite happily parse them into tags and values.
>>>> What controls the mapping of a tag to a GUS objects(s)?
>>>> What parses the structured information out to populate the 
>>>> object(s) and fill in the objects fields (which is another mapping)?
>>>>
>>>> Something like this non-EMBL standard entry, curation, has several 
>>>> values in a fixed field format;
>>>>
>>>>     /curation="name; origin; date; permission; type; dbref; notes ..."
>>>> i.e.
>>>>     /curation="Matt Berriman; genedb; 20020128; public; comment"
>>>>
>>>> How do we specify where to put this in GUS? It's very PSU specific. 
>>>> Perhaps some sort of hook with specifying some perl code elsewhere 
>>>> to handle it?
>>>> We currently store GO annotation in EMBL like this;
>>>>
>>>>     /GO="aspect=process; GOid=GO:0006810; term=transport; 
>>>> evidence=ISS; db_xref=GOC:unpublished; with=SPTR:Q9UQ36; 
>>>> date=20001122"
>>>>
>>>> as EMBL only has the format /db_xref="GO:00123" but I hope there is 
>>>> a GO flat file loader so we don't have to worry about this in the 
>>>> future.
>>>>
>>>>> about building the hierarchy.  if you looked at the bioperl api 
>>>>> for the unflattener, you'd see that its unflattening uses gene 
>>>>> name as a clue to deciding what features go together in a 
>>>>> particular gene model.
>>>>>
>>>>> can gene name be relied upon to identify all the features that are 
>>>>> associated with this gene?
>>>>
>>>>
>>>>
>>>>
>>>> You can switch to use any qualifier you like to identify groups, 
>>>> but you can only specify *one*.
>>>> We can have 2 :)
>>>> In the same sequence a gene may be identified by systematic_id.
>>>> Another gene in the same sequence maybe identified by 
>>>> temporary_systematic_id.
>>>> Eventually all genes will get a systematic_id but not straight away.
>>>>
>>>> In theory it should be easy to modify the flattener to use a 'best 
>>>> name first' policy.
>>>>
>>>> For TIGR XML you'd have PUB_LOCUS and LUCUS as the best names, in 
>>>> that order. Their too mix identifiers but since the XML already has 
>>>> a hierarchy you might get away with it????
>>>>
>>>>
>>>>> finally, about the GO stuff, yes, we can probably reuse your code.
>>>>>
>>>>> steve
>>>>>
>>>>>
>>>>> Paul Mooney wrote:
>>>>>
>>>>>>
>>>>>> On 9 Dec 2004, at 19:31, Steve Fischer wrote:
>>>>>>
>>>>>>> paul-
>>>>>>>
>>>>>>> hey.  do you want to set up a time to chat so i can catch you up 
>>>>>>> on what we have in mind?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> At the moment I'm curious how much can be achieved via a generic 
>>>>>> plugin. I think the plugin will need plugin's to do specialised 
>>>>>> parts :) However I'd be glad to give my assistance to the effort. 
>>>>>> Below are my random thoughts I've just had on the matter;
>>>>>>
>>>>>>
>>>>>> Here at the PSU we store an awful lot of info that can not be 
>>>>>> stored in a standard EMBL file, hence we have extended it to fit 
>>>>>> out own needs. As an example we use several name qualifiers for 
>>>>>> genes;
>>>>>>
>>>>>>     . systematic_id           - the name cast in stone
>>>>>>     . temporary_systematic_id - the name as it is currently known
>>>>>>     . previous_systematic_id  - as it was known
>>>>>>     . gene                    - EMBL standard qualifier
>>>>>>
>>>>>> Hence just trying to unflatten the EMBL file is tricky because 
>>>>>> systematic and temporary_sysetmatic_ids are mixed in the same 
>>>>>> sequence, hence building the hierarchy would need specialised 
>>>>>> code. TIGR XML has the same issue though so maybe its not too 
>>>>>> specialised after all :/ (PUB_LOCUS and LOCUS has a direct 
>>>>>> mapping to systematic_id and temporary_systematic_id).
>>>>>>
>>>>>> Something like this entry;
>>>>>>     /curation="name; origin; date; permission; type; dbref; notes 
>>>>>> ..."
>>>>>> i.e.
>>>>>>     /curation="Matt Berriman; genedb; 20020128; public; comment"
>>>>>> is unique to the PSU and I'm not sure where it fits in GUS.
>>>>>>
>>>>>> However;
>>>>>>
>>>>>> I have code that creates GO entries - supply a high level 
>>>>>> function with all the standard GO fields and it creates the 5 
>>>>>> rows (?) in the different tables as required. This is definitely 
>>>>>> something that can be shared across centres, perhaps in a code 
>>>>>> library. All your code has to do is parse out the GO fields from 
>>>>>> the data. No reason why it couldn't accept a GO Bioperl object (I 
>>>>>> presume one exists).
>>>>>>
>>>>>> Perhaps the parsing needs to a super class for each data source 
>>>>>> and then sub-classed by each centre?
>>>>>>
>>>>>> Ok, enough ramblings. Does any of this make sense?
>>>>>> Paul.
>>>>>>
>>>>>>> steve
>>>>>>>
>>>>>>> Chris Stoeckert wrote:
>>>>>>>
>>>>>>>> Hi Steve,
>>>>>>>> Thanks for putting this out on gusdev. Marie-Adele indicated 
>>>>>>>> that Paul Mooney was very interested in this and I will likely 
>>>>>>>> meet with him about this when I visit in January. Please 
>>>>>>>> include him in email correspondence when not addressed to the 
>>>>>>>> general gusdev list.
>>>>>>>> Thanks,
>>>>>>>> Chris
>>>>>>>>
>>>>>>>> On Dec 9, 2004, at 2:11 PM, Steve Fischer wrote:
>>>>>>>>
>>>>>>>>> folks-
>>>>>>>>>
>>>>>>>>> the UGA folks and CBIL folks have started collaborating on a 
>>>>>>>>> new plugin called LoadAnnotatedSeqs.   It will use BioPerl to 
>>>>>>>>> parse the input data.
>>>>>>>>>
>>>>>>>>> We expect it to take annotated sequences (NA at first) in 
>>>>>>>>> genbank, tigr xml and embl formats (plus any others supported 
>>>>>>>>> by the bioPerl parser).
>>>>>>>>>
>>>>>>>>> It will take an XML file that describes the mapping from the 
>>>>>>>>> input features to GUS features, and SO features.
>>>>>>>>> It will also hard code special cases to handle qualifer data 
>>>>>>>>> that is distributed to tables outside of the NAFeature tables.
>>>>>>>>>
>>>>>>>>> For our projects we will be developing a mapping that unifies 
>>>>>>>>> the semantics of the data we are getting from our different 
>>>>>>>>> sources and formats.
>>>>>>>>> (we plan to work with the PSU folks to incorporate the 
>>>>>>>>> knowledge they have acquired in their work to make an EMBL 
>>>>>>>>> parser)
>>>>>>>>>
>>>>>>>>> ideas and suggestions are encouraged.
>>>>>>>>>
>>>>>>>>> steve
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -------------------------------------------------------
>>>>>>>>> SF email is sponsored by - The IT Product Guide
>>>>>>>>> Read honest & candid reviews on hundreds of IT Products from 
>>>>>>>>> real users.
>>>>>>>>> Discover which products truly live up to the hype. Start 
>>>>>>>>> reading now. http://productguide.itmanagersjournal.com/
>>>>>>>>> _______________________________________________
>>>>>>>>> Gusdev-gusdev mailing list
>>>>>>>>> Gus...@li...
>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>
>
>
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now. 
> http://productguide.itmanagersjournal.com/
> _______________________________________________
> Gusdev-gusdev mailing list
> Gus...@li...
> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev