Re: [Gusdev-gusdev] LoadAnnotatedSeqs plugin underway

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On 10 Dec 2004, at 12:52, Steve Fischer wrote:

> paul-
>
> ok, i see.
>
> are there any other examples besides curation in which you have placed 
> structured data in qualifiers?     are there examples of standard embl 
> qualifiers in which you expect to find structured data and parse it?
>

After talking with Arnaud it seems we can take each 
qualifier/structured field and create a new feature, with each one of 
its qualifiers holding one piece of data. This would fit into your 
mapping scheme.

> in the case of curation, where do you put that info in GUS?

It will probably end up as a note, for now at least.

>
> about systematic_ids, i understand what you've said.   one thing 
> though.  how do they relate to gene names?

They are the gene names :)
Standard EMBL uses a /gene qualifier for the gene symbol and 
/standard_name for the human readable name.
During sequencing and annotation using a single /gene conveys no 
meaning as to how stable/temporary the ID is.

> steve
>
> Paul Mooney wrote:
>
>>
>> On 9 Dec 2004, at 23:21, Steve Fischer wrote:
>>
>>> paul-
>>>
>>> let me start digesting this by email.
>>>
>>> about your extensions to EMBL.  the bioPerl model we are parsing 
>>> into is based on generic features, tags and annotation.  as long as 
>>> the extensions can be parsed into those objects we're half way 
>>> there.   are the extensions syntactically consistent w/ standard 
>>> embl files, but varying only in the particulars of what the data is 
>>> called?
>>
>>
>> We have additional qualifiers with values. The values hold structured 
>> information (say key=value pairs).
>> Bioperl will quite happily parse them into tags and values.
>> What controls the mapping of a tag to a GUS objects(s)?
>> What parses the structured information out to populate the object(s) 
>> and fill in the objects fields (which is another mapping)?
>>
>> Something like this non-EMBL standard entry, curation, has several 
>> values in a fixed field format;
>>
>>     /curation="name; origin; date; permission; type; dbref; notes ..."
>> i.e.
>>     /curation="Matt Berriman; genedb; 20020128; public; comment"
>>
>> How do we specify where to put this in GUS? It's very PSU specific. 
>> Perhaps some sort of hook with specifying some perl code elsewhere to 
>> handle it?
>> We currently store GO annotation in EMBL like this;
>>
>>     /GO="aspect=process; GOid=GO:0006810; term=transport; 
>> evidence=ISS; db_xref=GOC:unpublished; with=SPTR:Q9UQ36; 
>> date=20001122"
>>
>> as EMBL only has the format /db_xref="GO:00123" but I hope there is a 
>> GO flat file loader so we don't have to worry about this in the 
>> future.
>>
>>> about building the hierarchy.  if you looked at the bioperl api for 
>>> the unflattener, you'd see that its unflattening uses gene name as a 
>>> clue to deciding what features go together in a particular gene 
>>> model.
>>>
>>> can gene name be relied upon to identify all the features that are 
>>> associated with this gene?
>>
>>
>> You can switch to use any qualifier you like to identify groups, but 
>> you can only specify *one*.
>> We can have 2 :)
>> In the same sequence a gene may be identified by systematic_id.
>> Another gene in the same sequence maybe identified by 
>> temporary_systematic_id.
>> Eventually all genes will get a systematic_id but not straight away.
>>
>> In theory it should be easy to modify the flattener to use a 'best 
>> name first' policy.
>>
>> For TIGR XML you'd have PUB_LOCUS and LUCUS as the best names, in 
>> that order. Their too mix identifiers but since the XML already has a 
>> hierarchy you might get away with it????
>>
>>
>>> finally, about the GO stuff, yes, we can probably reuse your code.
>>>
>>> steve
>>>
>>>
>>> Paul Mooney wrote:
>>>
>>>>
>>>> On 9 Dec 2004, at 19:31, Steve Fischer wrote:
>>>>
>>>>> paul-
>>>>>
>>>>> hey.  do you want to set up a time to chat so i can catch you up 
>>>>> on what we have in mind?
>>>>
>>>>
>>>>
>>>>
>>>> At the moment I'm curious how much can be achieved via a generic 
>>>> plugin. I think the plugin will need plugin's to do specialised 
>>>> parts :) However I'd be glad to give my assistance to the effort. 
>>>> Below are my random thoughts I've just had on the matter;
>>>>
>>>>
>>>> Here at the PSU we store an awful lot of info that can not be 
>>>> stored in a standard EMBL file, hence we have extended it to fit 
>>>> out own needs. As an example we use several name qualifiers for 
>>>> genes;
>>>>
>>>>     . systematic_id           - the name cast in stone
>>>>     . temporary_systematic_id - the name as it is currently known
>>>>     . previous_systematic_id  - as it was known
>>>>     . gene                    - EMBL standard qualifier
>>>>
>>>> Hence just trying to unflatten the EMBL file is tricky because 
>>>> systematic and temporary_sysetmatic_ids are mixed in the same 
>>>> sequence, hence building the hierarchy would need specialised code. 
>>>> TIGR XML has the same issue though so maybe its not too specialised 
>>>> after all :/ (PUB_LOCUS and LOCUS has a direct mapping to 
>>>> systematic_id and temporary_systematic_id).
>>>>
>>>> Something like this entry;
>>>>     /curation="name; origin; date; permission; type; dbref; notes 
>>>> ..."
>>>> i.e.
>>>>     /curation="Matt Berriman; genedb; 20020128; public; comment"
>>>> is unique to the PSU and I'm not sure where it fits in GUS.
>>>>
>>>> However;
>>>>
>>>> I have code that creates GO entries - supply a high level function 
>>>> with all the standard GO fields and it creates the 5 rows (?) in 
>>>> the different tables as required. This is definitely something that 
>>>> can be shared across centres, perhaps in a code library. All your 
>>>> code has to do is parse out the GO fields from the data. No reason 
>>>> why it couldn't accept a GO Bioperl object (I presume one exists).
>>>>
>>>> Perhaps the parsing needs to a super class for each data source and 
>>>> then sub-classed by each centre?
>>>>
>>>> Ok, enough ramblings. Does any of this make sense?
>>>> Paul.
>>>>
>>>>> steve
>>>>>
>>>>> Chris Stoeckert wrote:
>>>>>
>>>>>> Hi Steve,
>>>>>> Thanks for putting this out on gusdev. Marie-Adele indicated that 
>>>>>> Paul Mooney was very interested in this and I will likely meet 
>>>>>> with him about this when I visit in January. Please include him 
>>>>>> in email correspondence when not addressed to the general gusdev 
>>>>>> list.
>>>>>> Thanks,
>>>>>> Chris
>>>>>>
>>>>>> On Dec 9, 2004, at 2:11 PM, Steve Fischer wrote:
>>>>>>
>>>>>>> folks-
>>>>>>>
>>>>>>> the UGA folks and CBIL folks have started collaborating on a new 
>>>>>>> plugin called LoadAnnotatedSeqs.   It will use BioPerl to parse 
>>>>>>> the input data.
>>>>>>>
>>>>>>> We expect it to take annotated sequences (NA at first) in 
>>>>>>> genbank, tigr xml and embl formats (plus any others supported by 
>>>>>>> the bioPerl parser).
>>>>>>>
>>>>>>> It will take an XML file that describes the mapping from the 
>>>>>>> input features to GUS features, and SO features.
>>>>>>> It will also hard code special cases to handle qualifer data 
>>>>>>> that is distributed to tables outside of the NAFeature tables.
>>>>>>>
>>>>>>> For our projects we will be developing a mapping that unifies 
>>>>>>> the semantics of the data we are getting from our different 
>>>>>>> sources and formats.
>>>>>>> (we plan to work with the PSU folks to incorporate the knowledge 
>>>>>>> they have acquired in their work to make an EMBL parser)
>>>>>>>
>>>>>>> ideas and suggestions are encouraged.
>>>>>>>
>>>>>>> steve
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -------------------------------------------------------
>>>>>>> SF email is sponsored by - The IT Product Guide
>>>>>>> Read honest & candid reviews on hundreds of IT Products from 
>>>>>>> real users.
>>>>>>> Discover which products truly live up to the hype. Start reading 
>>>>>>> now. http://productguide.itmanagersjournal.com/
>>>>>>> _______________________________________________
>>>>>>> Gusdev-gusdev mailing list
>>>>>>> Gus...@li...
>>>>>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>