From: Paul M. <pj...@sa...> - 2004-12-10 14:18:21
|
On 10 Dec 2004, at 12:52, Steve Fischer wrote: > paul- > > ok, i see. > > are there any other examples besides curation in which you have placed > structured data in qualifiers? are there examples of standard embl > qualifiers in which you expect to find structured data and parse it? > After talking with Arnaud it seems we can take each qualifier/structured field and create a new feature, with each one of its qualifiers holding one piece of data. This would fit into your mapping scheme. > in the case of curation, where do you put that info in GUS? It will probably end up as a note, for now at least. > > about systematic_ids, i understand what you've said. one thing > though. how do they relate to gene names? They are the gene names :) Standard EMBL uses a /gene qualifier for the gene symbol and /standard_name for the human readable name. During sequencing and annotation using a single /gene conveys no meaning as to how stable/temporary the ID is. > steve > > Paul Mooney wrote: > >> >> On 9 Dec 2004, at 23:21, Steve Fischer wrote: >> >>> paul- >>> >>> let me start digesting this by email. >>> >>> about your extensions to EMBL. the bioPerl model we are parsing >>> into is based on generic features, tags and annotation. as long as >>> the extensions can be parsed into those objects we're half way >>> there. are the extensions syntactically consistent w/ standard >>> embl files, but varying only in the particulars of what the data is >>> called? >> >> >> We have additional qualifiers with values. The values hold structured >> information (say key=value pairs). >> Bioperl will quite happily parse them into tags and values. >> What controls the mapping of a tag to a GUS objects(s)? >> What parses the structured information out to populate the object(s) >> and fill in the objects fields (which is another mapping)? >> >> Something like this non-EMBL standard entry, curation, has several >> values in a fixed field format; >> >> /curation="name; origin; date; permission; type; dbref; notes ..." >> i.e. >> /curation="Matt Berriman; genedb; 20020128; public; comment" >> >> How do we specify where to put this in GUS? It's very PSU specific. >> Perhaps some sort of hook with specifying some perl code elsewhere to >> handle it? >> We currently store GO annotation in EMBL like this; >> >> /GO="aspect=process; GOid=GO:0006810; term=transport; >> evidence=ISS; db_xref=GOC:unpublished; with=SPTR:Q9UQ36; >> date=20001122" >> >> as EMBL only has the format /db_xref="GO:00123" but I hope there is a >> GO flat file loader so we don't have to worry about this in the >> future. >> >>> about building the hierarchy. if you looked at the bioperl api for >>> the unflattener, you'd see that its unflattening uses gene name as a >>> clue to deciding what features go together in a particular gene >>> model. >>> >>> can gene name be relied upon to identify all the features that are >>> associated with this gene? >> >> >> You can switch to use any qualifier you like to identify groups, but >> you can only specify *one*. >> We can have 2 :) >> In the same sequence a gene may be identified by systematic_id. >> Another gene in the same sequence maybe identified by >> temporary_systematic_id. >> Eventually all genes will get a systematic_id but not straight away. >> >> In theory it should be easy to modify the flattener to use a 'best >> name first' policy. >> >> For TIGR XML you'd have PUB_LOCUS and LUCUS as the best names, in >> that order. Their too mix identifiers but since the XML already has a >> hierarchy you might get away with it???? >> >> >>> finally, about the GO stuff, yes, we can probably reuse your code. >>> >>> steve >>> >>> >>> Paul Mooney wrote: >>> >>>> >>>> On 9 Dec 2004, at 19:31, Steve Fischer wrote: >>>> >>>>> paul- >>>>> >>>>> hey. do you want to set up a time to chat so i can catch you up >>>>> on what we have in mind? >>>> >>>> >>>> >>>> >>>> At the moment I'm curious how much can be achieved via a generic >>>> plugin. I think the plugin will need plugin's to do specialised >>>> parts :) However I'd be glad to give my assistance to the effort. >>>> Below are my random thoughts I've just had on the matter; >>>> >>>> >>>> Here at the PSU we store an awful lot of info that can not be >>>> stored in a standard EMBL file, hence we have extended it to fit >>>> out own needs. As an example we use several name qualifiers for >>>> genes; >>>> >>>> . systematic_id - the name cast in stone >>>> . temporary_systematic_id - the name as it is currently known >>>> . previous_systematic_id - as it was known >>>> . gene - EMBL standard qualifier >>>> >>>> Hence just trying to unflatten the EMBL file is tricky because >>>> systematic and temporary_sysetmatic_ids are mixed in the same >>>> sequence, hence building the hierarchy would need specialised code. >>>> TIGR XML has the same issue though so maybe its not too specialised >>>> after all :/ (PUB_LOCUS and LOCUS has a direct mapping to >>>> systematic_id and temporary_systematic_id). >>>> >>>> Something like this entry; >>>> /curation="name; origin; date; permission; type; dbref; notes >>>> ..." >>>> i.e. >>>> /curation="Matt Berriman; genedb; 20020128; public; comment" >>>> is unique to the PSU and I'm not sure where it fits in GUS. >>>> >>>> However; >>>> >>>> I have code that creates GO entries - supply a high level function >>>> with all the standard GO fields and it creates the 5 rows (?) in >>>> the different tables as required. This is definitely something that >>>> can be shared across centres, perhaps in a code library. All your >>>> code has to do is parse out the GO fields from the data. No reason >>>> why it couldn't accept a GO Bioperl object (I presume one exists). >>>> >>>> Perhaps the parsing needs to a super class for each data source and >>>> then sub-classed by each centre? >>>> >>>> Ok, enough ramblings. Does any of this make sense? >>>> Paul. >>>> >>>>> steve >>>>> >>>>> Chris Stoeckert wrote: >>>>> >>>>>> Hi Steve, >>>>>> Thanks for putting this out on gusdev. Marie-Adele indicated that >>>>>> Paul Mooney was very interested in this and I will likely meet >>>>>> with him about this when I visit in January. Please include him >>>>>> in email correspondence when not addressed to the general gusdev >>>>>> list. >>>>>> Thanks, >>>>>> Chris >>>>>> >>>>>> On Dec 9, 2004, at 2:11 PM, Steve Fischer wrote: >>>>>> >>>>>>> folks- >>>>>>> >>>>>>> the UGA folks and CBIL folks have started collaborating on a new >>>>>>> plugin called LoadAnnotatedSeqs. It will use BioPerl to parse >>>>>>> the input data. >>>>>>> >>>>>>> We expect it to take annotated sequences (NA at first) in >>>>>>> genbank, tigr xml and embl formats (plus any others supported by >>>>>>> the bioPerl parser). >>>>>>> >>>>>>> It will take an XML file that describes the mapping from the >>>>>>> input features to GUS features, and SO features. >>>>>>> It will also hard code special cases to handle qualifer data >>>>>>> that is distributed to tables outside of the NAFeature tables. >>>>>>> >>>>>>> For our projects we will be developing a mapping that unifies >>>>>>> the semantics of the data we are getting from our different >>>>>>> sources and formats. >>>>>>> (we plan to work with the PSU folks to incorporate the knowledge >>>>>>> they have acquired in their work to make an EMBL parser) >>>>>>> >>>>>>> ideas and suggestions are encouraged. >>>>>>> >>>>>>> steve >>>>>>> >>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------- >>>>>>> SF email is sponsored by - The IT Product Guide >>>>>>> Read honest & candid reviews on hundreds of IT Products from >>>>>>> real users. >>>>>>> Discover which products truly live up to the hype. Start reading >>>>>>> now. http://productguide.itmanagersjournal.com/ >>>>>>> _______________________________________________ >>>>>>> Gusdev-gusdev mailing list >>>>>>> Gus...@li... >>>>>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev >>>>>> >>>>>> >>>>>> >>>>> >>> > |