Re: [Gusdev-gusdev] LoadAnnotatedSeqs plugin underway

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

paul-  see in line

steve

Paul Mooney wrote:

>
> On 10 Dec 2004, at 12:52, Steve Fischer wrote:
>
>> paul-
>>
>> ok, i see.
>>
>> are there any other examples besides curation in which you have 
>> placed structured data in qualifiers?     are there examples of 
>> standard embl qualifiers in which you expect to find structured data 
>> and parse it?
>>
>
> After talking with Arnaud it seems we can take each 
> qualifier/structured field and create a new feature, with each one of 
> its qualifiers holding one piece of data. This would fit into your 
> mapping scheme.
>
ok.  great.  i was wondering about that.

so does that mean that we can expect that no qualifiers will contain 
structured data that needs to be parsed?

>> in the case of curation, where do you put that info in GUS?
>
>
> It will probably end up as a note, for now at least.
>
>>
>> about systematic_ids, i understand what you've said.   one thing 
>> though.  how do they relate to gene names?
>
>
ok, but, what i'm driving at is that the unflattener uses gene name 
(/gene=) to decide what features go together in one gene model.   
really, it wouldn't matter what the value of the /gene= is, as long as 
it is identical for all features that belong to the gene.   is that 
consistent with your use of /gene?

> They are the gene names :)
> Standard EMBL uses a /gene qualifier for the gene symbol and 
> /standard_name for the human readable name.
> During sequencing and annotation using a single /gene conveys no 
> meaning as to how stable/temporary the ID is.
>
>> steve
>>
>> Paul Mooney wrote:
>>
>>>
>>> On 9 Dec 2004, at 23:21, Steve Fischer wrote:
>>>
>>>> paul-
>>>>
>>>> let me start digesting this by email.
>>>>
>>>> about your extensions to EMBL.  the bioPerl model we are parsing 
>>>> into is based on generic features, tags and annotation.  as long as 
>>>> the extensions can be parsed into those objects we're half way 
>>>> there.   are the extensions syntactically consistent w/ standard 
>>>> embl files, but varying only in the particulars of what the data is 
>>>> called?
>>>
>>>
>>>
>>> We have additional qualifiers with values. The values hold 
>>> structured information (say key=value pairs).
>>> Bioperl will quite happily parse them into tags and values.
>>> What controls the mapping of a tag to a GUS objects(s)?
>>> What parses the structured information out to populate the object(s) 
>>> and fill in the objects fields (which is another mapping)?
>>>
>>> Something like this non-EMBL standard entry, curation, has several 
>>> values in a fixed field format;
>>>
>>>     /curation="name; origin; date; permission; type; dbref; notes ..."
>>> i.e.
>>>     /curation="Matt Berriman; genedb; 20020128; public; comment"
>>>
>>> How do we specify where to put this in GUS? It's very PSU specific. 
>>> Perhaps some sort of hook with specifying some perl code elsewhere 
>>> to handle it?
>>> We currently store GO annotation in EMBL like this;
>>>
>>>     /GO="aspect=process; GOid=GO:0006810; term=transport; 
>>> evidence=ISS; db_xref=GOC:unpublished; with=SPTR:Q9UQ36; date=20001122"
>>>
>>> as EMBL only has the format /db_xref="GO:00123" but I hope there is 
>>> a GO flat file loader so we don't have to worry about this in the 
>>> future.
>>>
>>>> about building the hierarchy.  if you looked at the bioperl api for 
>>>> the unflattener, you'd see that its unflattening uses gene name as 
>>>> a clue to deciding what features go together in a particular gene 
>>>> model.
>>>>
>>>> can gene name be relied upon to identify all the features that are 
>>>> associated with this gene?
>>>
>>>
>>>
>>> You can switch to use any qualifier you like to identify groups, but 
>>> you can only specify *one*.
>>> We can have 2 :)
>>> In the same sequence a gene may be identified by systematic_id.
>>> Another gene in the same sequence maybe identified by 
>>> temporary_systematic_id.
>>> Eventually all genes will get a systematic_id but not straight away.
>>>
>>> In theory it should be easy to modify the flattener to use a 'best 
>>> name first' policy.
>>>
>>> For TIGR XML you'd have PUB_LOCUS and LUCUS as the best names, in 
>>> that order. Their too mix identifiers but since the XML already has 
>>> a hierarchy you might get away with it????
>>>
>>>
>>>> finally, about the GO stuff, yes, we can probably reuse your code.
>>>>
>>>> steve
>>>>
>>>>
>>>> Paul Mooney wrote:
>>>>
>>>>>
>>>>> On 9 Dec 2004, at 19:31, Steve Fischer wrote:
>>>>>
>>>>>> paul-
>>>>>>
>>>>>> hey.  do you want to set up a time to chat so i can catch you up 
>>>>>> on what we have in mind?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> At the moment I'm curious how much can be achieved via a generic 
>>>>> plugin. I think the plugin will need plugin's to do specialised 
>>>>> parts :) However I'd be glad to give my assistance to the effort. 
>>>>> Below are my random thoughts I've just had on the matter;
>>>>>
>>>>>
>>>>> Here at the PSU we store an awful lot of info that can not be 
>>>>> stored in a standard EMBL file, hence we have extended it to fit 
>>>>> out own needs. As an example we use several name qualifiers for 
>>>>> genes;
>>>>>
>>>>>     . systematic_id           - the name cast in stone
>>>>>     . temporary_systematic_id - the name as it is currently known
>>>>>     . previous_systematic_id  - as it was known
>>>>>     . gene                    - EMBL standard qualifier
>>>>>
>>>>> Hence just trying to unflatten the EMBL file is tricky because 
>>>>> systematic and temporary_sysetmatic_ids are mixed in the same 
>>>>> sequence, hence building the hierarchy would need specialised 
>>>>> code. TIGR XML has the same issue though so maybe its not too 
>>>>> specialised after all :/ (PUB_LOCUS and LOCUS has a direct mapping 
>>>>> to systematic_id and temporary_systematic_id).
>>>>>
>>>>> Something like this entry;
>>>>>     /curation="name; origin; date; permission; type; dbref; notes 
>>>>> ..."
>>>>> i.e.
>>>>>     /curation="Matt Berriman; genedb; 20020128; public; comment"
>>>>> is unique to the PSU and I'm not sure where it fits in GUS.
>>>>>
>>>>> However;
>>>>>
>>>>> I have code that creates GO entries - supply a high level function 
>>>>> with all the standard GO fields and it creates the 5 rows (?) in 
>>>>> the different tables as required. This is definitely something 
>>>>> that can be shared across centres, perhaps in a code library. All 
>>>>> your code has to do is parse out the GO fields from the data. No 
>>>>> reason why it couldn't accept a GO Bioperl object (I presume one 
>>>>> exists).
>>>>>
>>>>> Perhaps the parsing needs to a super class for each data source 
>>>>> and then sub-classed by each centre?
>>>>>
>>>>> Ok, enough ramblings. Does any of this make sense?
>>>>> Paul.
>>>>>
>>>>>> steve
>>>>>>
>>>>>> Chris Stoeckert wrote:
>>>>>>
>>>>>>> Hi Steve,
>>>>>>> Thanks for putting this out on gusdev. Marie-Adele indicated 
>>>>>>> that Paul Mooney was very interested in this and I will likely 
>>>>>>> meet with him about this when I visit in January. Please include 
>>>>>>> him in email correspondence when not addressed to the general 
>>>>>>> gusdev list.
>>>>>>> Thanks,
>>>>>>> Chris
>>>>>>>
>>>>>>> On Dec 9, 2004, at 2:11 PM, Steve Fischer wrote:
>>>>>>>
>>>>>>>> folks-
>>>>>>>>
>>>>>>>> the UGA folks and CBIL folks have started collaborating on a 
>>>>>>>> new plugin called LoadAnnotatedSeqs.   It will use BioPerl to 
>>>>>>>> parse the input data.
>>>>>>>>
>>>>>>>> We expect it to take annotated sequences (NA at first) in 
>>>>>>>> genbank, tigr xml and embl formats (plus any others supported 
>>>>>>>> by the bioPerl parser).
>>>>>>>>
>>>>>>>> It will take an XML file that describes the mapping from the 
>>>>>>>> input features to GUS features, and SO features.
>>>>>>>> It will also hard code special cases to handle qualifer data 
>>>>>>>> that is distributed to tables outside of the NAFeature tables.
>>>>>>>>
>>>>>>>> For our projects we will be developing a mapping that unifies 
>>>>>>>> the semantics of the data we are getting from our different 
>>>>>>>> sources and formats.
>>>>>>>> (we plan to work with the PSU folks to incorporate the 
>>>>>>>> knowledge they have acquired in their work to make an EMBL parser)
>>>>>>>>
>>>>>>>> ideas and suggestions are encouraged.
>>>>>>>>
>>>>>>>> steve
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -------------------------------------------------------
>>>>>>>> SF email is sponsored by - The IT Product Guide
>>>>>>>> Read honest & candid reviews on hundreds of IT Products from 
>>>>>>>> real users.
>>>>>>>> Discover which products truly live up to the hype. Start 
>>>>>>>> reading now. http://productguide.itmanagersjournal.com/
>>>>>>>> _______________________________________________
>>>>>>>> Gusdev-gusdev mailing list
>>>>>>>> Gus...@li...
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>