From: Steve F. <sfi...@pc...> - 2004-12-10 16:09:24
|
paul- see in line steve Paul Mooney wrote: > > On 10 Dec 2004, at 12:52, Steve Fischer wrote: > >> paul- >> >> ok, i see. >> >> are there any other examples besides curation in which you have >> placed structured data in qualifiers? are there examples of >> standard embl qualifiers in which you expect to find structured data >> and parse it? >> > > After talking with Arnaud it seems we can take each > qualifier/structured field and create a new feature, with each one of > its qualifiers holding one piece of data. This would fit into your > mapping scheme. > ok. great. i was wondering about that. so does that mean that we can expect that no qualifiers will contain structured data that needs to be parsed? >> in the case of curation, where do you put that info in GUS? > > > It will probably end up as a note, for now at least. > >> >> about systematic_ids, i understand what you've said. one thing >> though. how do they relate to gene names? > > ok, but, what i'm driving at is that the unflattener uses gene name (/gene=) to decide what features go together in one gene model. really, it wouldn't matter what the value of the /gene= is, as long as it is identical for all features that belong to the gene. is that consistent with your use of /gene? > They are the gene names :) > Standard EMBL uses a /gene qualifier for the gene symbol and > /standard_name for the human readable name. > During sequencing and annotation using a single /gene conveys no > meaning as to how stable/temporary the ID is. > >> steve >> >> Paul Mooney wrote: >> >>> >>> On 9 Dec 2004, at 23:21, Steve Fischer wrote: >>> >>>> paul- >>>> >>>> let me start digesting this by email. >>>> >>>> about your extensions to EMBL. the bioPerl model we are parsing >>>> into is based on generic features, tags and annotation. as long as >>>> the extensions can be parsed into those objects we're half way >>>> there. are the extensions syntactically consistent w/ standard >>>> embl files, but varying only in the particulars of what the data is >>>> called? >>> >>> >>> >>> We have additional qualifiers with values. The values hold >>> structured information (say key=value pairs). >>> Bioperl will quite happily parse them into tags and values. >>> What controls the mapping of a tag to a GUS objects(s)? >>> What parses the structured information out to populate the object(s) >>> and fill in the objects fields (which is another mapping)? >>> >>> Something like this non-EMBL standard entry, curation, has several >>> values in a fixed field format; >>> >>> /curation="name; origin; date; permission; type; dbref; notes ..." >>> i.e. >>> /curation="Matt Berriman; genedb; 20020128; public; comment" >>> >>> How do we specify where to put this in GUS? It's very PSU specific. >>> Perhaps some sort of hook with specifying some perl code elsewhere >>> to handle it? >>> We currently store GO annotation in EMBL like this; >>> >>> /GO="aspect=process; GOid=GO:0006810; term=transport; >>> evidence=ISS; db_xref=GOC:unpublished; with=SPTR:Q9UQ36; date=20001122" >>> >>> as EMBL only has the format /db_xref="GO:00123" but I hope there is >>> a GO flat file loader so we don't have to worry about this in the >>> future. >>> >>>> about building the hierarchy. if you looked at the bioperl api for >>>> the unflattener, you'd see that its unflattening uses gene name as >>>> a clue to deciding what features go together in a particular gene >>>> model. >>>> >>>> can gene name be relied upon to identify all the features that are >>>> associated with this gene? >>> >>> >>> >>> You can switch to use any qualifier you like to identify groups, but >>> you can only specify *one*. >>> We can have 2 :) >>> In the same sequence a gene may be identified by systematic_id. >>> Another gene in the same sequence maybe identified by >>> temporary_systematic_id. >>> Eventually all genes will get a systematic_id but not straight away. >>> >>> In theory it should be easy to modify the flattener to use a 'best >>> name first' policy. >>> >>> For TIGR XML you'd have PUB_LOCUS and LUCUS as the best names, in >>> that order. Their too mix identifiers but since the XML already has >>> a hierarchy you might get away with it???? >>> >>> >>>> finally, about the GO stuff, yes, we can probably reuse your code. >>>> >>>> steve >>>> >>>> >>>> Paul Mooney wrote: >>>> >>>>> >>>>> On 9 Dec 2004, at 19:31, Steve Fischer wrote: >>>>> >>>>>> paul- >>>>>> >>>>>> hey. do you want to set up a time to chat so i can catch you up >>>>>> on what we have in mind? >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> At the moment I'm curious how much can be achieved via a generic >>>>> plugin. I think the plugin will need plugin's to do specialised >>>>> parts :) However I'd be glad to give my assistance to the effort. >>>>> Below are my random thoughts I've just had on the matter; >>>>> >>>>> >>>>> Here at the PSU we store an awful lot of info that can not be >>>>> stored in a standard EMBL file, hence we have extended it to fit >>>>> out own needs. As an example we use several name qualifiers for >>>>> genes; >>>>> >>>>> . systematic_id - the name cast in stone >>>>> . temporary_systematic_id - the name as it is currently known >>>>> . previous_systematic_id - as it was known >>>>> . gene - EMBL standard qualifier >>>>> >>>>> Hence just trying to unflatten the EMBL file is tricky because >>>>> systematic and temporary_sysetmatic_ids are mixed in the same >>>>> sequence, hence building the hierarchy would need specialised >>>>> code. TIGR XML has the same issue though so maybe its not too >>>>> specialised after all :/ (PUB_LOCUS and LOCUS has a direct mapping >>>>> to systematic_id and temporary_systematic_id). >>>>> >>>>> Something like this entry; >>>>> /curation="name; origin; date; permission; type; dbref; notes >>>>> ..." >>>>> i.e. >>>>> /curation="Matt Berriman; genedb; 20020128; public; comment" >>>>> is unique to the PSU and I'm not sure where it fits in GUS. >>>>> >>>>> However; >>>>> >>>>> I have code that creates GO entries - supply a high level function >>>>> with all the standard GO fields and it creates the 5 rows (?) in >>>>> the different tables as required. This is definitely something >>>>> that can be shared across centres, perhaps in a code library. All >>>>> your code has to do is parse out the GO fields from the data. No >>>>> reason why it couldn't accept a GO Bioperl object (I presume one >>>>> exists). >>>>> >>>>> Perhaps the parsing needs to a super class for each data source >>>>> and then sub-classed by each centre? >>>>> >>>>> Ok, enough ramblings. Does any of this make sense? >>>>> Paul. >>>>> >>>>>> steve >>>>>> >>>>>> Chris Stoeckert wrote: >>>>>> >>>>>>> Hi Steve, >>>>>>> Thanks for putting this out on gusdev. Marie-Adele indicated >>>>>>> that Paul Mooney was very interested in this and I will likely >>>>>>> meet with him about this when I visit in January. Please include >>>>>>> him in email correspondence when not addressed to the general >>>>>>> gusdev list. >>>>>>> Thanks, >>>>>>> Chris >>>>>>> >>>>>>> On Dec 9, 2004, at 2:11 PM, Steve Fischer wrote: >>>>>>> >>>>>>>> folks- >>>>>>>> >>>>>>>> the UGA folks and CBIL folks have started collaborating on a >>>>>>>> new plugin called LoadAnnotatedSeqs. It will use BioPerl to >>>>>>>> parse the input data. >>>>>>>> >>>>>>>> We expect it to take annotated sequences (NA at first) in >>>>>>>> genbank, tigr xml and embl formats (plus any others supported >>>>>>>> by the bioPerl parser). >>>>>>>> >>>>>>>> It will take an XML file that describes the mapping from the >>>>>>>> input features to GUS features, and SO features. >>>>>>>> It will also hard code special cases to handle qualifer data >>>>>>>> that is distributed to tables outside of the NAFeature tables. >>>>>>>> >>>>>>>> For our projects we will be developing a mapping that unifies >>>>>>>> the semantics of the data we are getting from our different >>>>>>>> sources and formats. >>>>>>>> (we plan to work with the PSU folks to incorporate the >>>>>>>> knowledge they have acquired in their work to make an EMBL parser) >>>>>>>> >>>>>>>> ideas and suggestions are encouraged. >>>>>>>> >>>>>>>> steve >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ------------------------------------------------------- >>>>>>>> SF email is sponsored by - The IT Product Guide >>>>>>>> Read honest & candid reviews on hundreds of IT Products from >>>>>>>> real users. >>>>>>>> Discover which products truly live up to the hype. Start >>>>>>>> reading now. http://productguide.itmanagersjournal.com/ >>>>>>>> _______________________________________________ >>>>>>>> Gusdev-gusdev mailing list >>>>>>>> Gus...@li... >>>>>>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>> >> |