From: Steve F. <sfi...@pc...> - 2004-12-09 19:11:47
|
folks- the UGA folks and CBIL folks have started collaborating on a new plugin called LoadAnnotatedSeqs. It will use BioPerl to parse the input data. We expect it to take annotated sequences (NA at first) in genbank, tigr xml and embl formats (plus any others supported by the bioPerl parser). It will take an XML file that describes the mapping from the input features to GUS features, and SO features. It will also hard code special cases to handle qualifer data that is distributed to tables outside of the NAFeature tables. For our projects we will be developing a mapping that unifies the semantics of the data we are getting from our different sources and formats. (we plan to work with the PSU folks to incorporate the knowledge they have acquired in their work to make an EMBL parser) ideas and suggestions are encouraged. steve |
From: Steve F. <sfi...@pc...> - 2004-12-09 23:21:12
|
paul- let me start digesting this by email. about your extensions to EMBL. the bioPerl model we are parsing into is based on generic features, tags and annotation. as long as the extensions can be parsed into those objects we're half way there. are the extensions syntactically consistent w/ standard embl files, but varying only in the particulars of what the data is called? about building the hierarchy. if you looked at the bioperl api for the unflattener, you'd see that its unflattening uses gene name as a clue to deciding what features go together in a particular gene model. can gene name be relied upon to identify all the features that are associated with this gene? finally, about the GO stuff, yes, we can probably reuse your code. steve Paul Mooney wrote: > > On 9 Dec 2004, at 19:31, Steve Fischer wrote: > >> paul- >> >> hey. do you want to set up a time to chat so i can catch you up on >> what we have in mind? > > > > At the moment I'm curious how much can be achieved via a generic > plugin. I think the plugin will need plugin's to do specialised parts > :) However I'd be glad to give my assistance to the effort. Below are > my random thoughts I've just had on the matter; > > > Here at the PSU we store an awful lot of info that can not be stored > in a standard EMBL file, hence we have extended it to fit out own > needs. As an example we use several name qualifiers for genes; > > . systematic_id - the name cast in stone > . temporary_systematic_id - the name as it is currently known > . previous_systematic_id - as it was known > . gene - EMBL standard qualifier > > Hence just trying to unflatten the EMBL file is tricky because > systematic and temporary_sysetmatic_ids are mixed in the same > sequence, hence building the hierarchy would need specialised code. > TIGR XML has the same issue though so maybe its not too specialised > after all :/ (PUB_LOCUS and LOCUS has a direct mapping to > systematic_id and temporary_systematic_id). > > Something like this entry; > /curation="name; origin; date; permission; type; dbref; notes ..." > i.e. > /curation="Matt Berriman; genedb; 20020128; public; comment" > is unique to the PSU and I'm not sure where it fits in GUS. > > However; > > I have code that creates GO entries - supply a high level function > with all the standard GO fields and it creates the 5 rows (?) in the > different tables as required. This is definitely something that can be > shared across centres, perhaps in a code library. All your code has to > do is parse out the GO fields from the data. No reason why it couldn't > accept a GO Bioperl object (I presume one exists). > > Perhaps the parsing needs to a super class for each data source and > then sub-classed by each centre? > > Ok, enough ramblings. Does any of this make sense? > Paul. > >> steve >> >> Chris Stoeckert wrote: >> >>> Hi Steve, >>> Thanks for putting this out on gusdev. Marie-Adele indicated that >>> Paul Mooney was very interested in this and I will likely meet with >>> him about this when I visit in January. Please include him in email >>> correspondence when not addressed to the general gusdev list. >>> Thanks, >>> Chris >>> >>> On Dec 9, 2004, at 2:11 PM, Steve Fischer wrote: >>> >>>> folks- >>>> >>>> the UGA folks and CBIL folks have started collaborating on a new >>>> plugin called LoadAnnotatedSeqs. It will use BioPerl to parse the >>>> input data. >>>> >>>> We expect it to take annotated sequences (NA at first) in genbank, >>>> tigr xml and embl formats (plus any others supported by the bioPerl >>>> parser). >>>> >>>> It will take an XML file that describes the mapping from the input >>>> features to GUS features, and SO features. >>>> It will also hard code special cases to handle qualifer data that >>>> is distributed to tables outside of the NAFeature tables. >>>> >>>> For our projects we will be developing a mapping that unifies the >>>> semantics of the data we are getting from our different sources and >>>> formats. >>>> (we plan to work with the PSU folks to incorporate the knowledge >>>> they have acquired in their work to make an EMBL parser) >>>> >>>> ideas and suggestions are encouraged. >>>> >>>> steve >>>> >>>> >>>> >>>> ------------------------------------------------------- >>>> SF email is sponsored by - The IT Product Guide >>>> Read honest & candid reviews on hundreds of IT Products from real >>>> users. >>>> Discover which products truly live up to the hype. Start reading >>>> now. http://productguide.itmanagersjournal.com/ >>>> _______________________________________________ >>>> Gusdev-gusdev mailing list >>>> Gus...@li... >>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev >>> >>> >> |
From: Paul M. <pj...@sa...> - 2004-12-10 11:04:18
|
On 9 Dec 2004, at 23:21, Steve Fischer wrote: > paul- > > let me start digesting this by email. > > about your extensions to EMBL. the bioPerl model we are parsing into > is based on generic features, tags and annotation. as long as the > extensions can be parsed into those objects we're half way there. > are the extensions syntactically consistent w/ standard embl files, > but varying only in the particulars of what the data is called? We have additional qualifiers with values. The values hold structured information (say key=value pairs). Bioperl will quite happily parse them into tags and values. What controls the mapping of a tag to a GUS objects(s)? What parses the structured information out to populate the object(s) and fill in the objects fields (which is another mapping)? Something like this non-EMBL standard entry, curation, has several values in a fixed field format; /curation="name; origin; date; permission; type; dbref; notes ..." i.e. /curation="Matt Berriman; genedb; 20020128; public; comment" How do we specify where to put this in GUS? It's very PSU specific. Perhaps some sort of hook with specifying some perl code elsewhere to handle it? We currently store GO annotation in EMBL like this; /GO="aspect=process; GOid=GO:0006810; term=transport; evidence=ISS; db_xref=GOC:unpublished; with=SPTR:Q9UQ36; date=20001122" as EMBL only has the format /db_xref="GO:00123" but I hope there is a GO flat file loader so we don't have to worry about this in the future. > about building the hierarchy. if you looked at the bioperl api for > the unflattener, you'd see that its unflattening uses gene name as a > clue to deciding what features go together in a particular gene model. > > can gene name be relied upon to identify all the features that are > associated with this gene? You can switch to use any qualifier you like to identify groups, but you can only specify *one*. We can have 2 :) In the same sequence a gene may be identified by systematic_id. Another gene in the same sequence maybe identified by temporary_systematic_id. Eventually all genes will get a systematic_id but not straight away. In theory it should be easy to modify the flattener to use a 'best name first' policy. For TIGR XML you'd have PUB_LOCUS and LUCUS as the best names, in that order. Their too mix identifiers but since the XML already has a hierarchy you might get away with it???? > finally, about the GO stuff, yes, we can probably reuse your code. > > steve > > > Paul Mooney wrote: > >> >> On 9 Dec 2004, at 19:31, Steve Fischer wrote: >> >>> paul- >>> >>> hey. do you want to set up a time to chat so i can catch you up on >>> what we have in mind? >> >> >> >> At the moment I'm curious how much can be achieved via a generic >> plugin. I think the plugin will need plugin's to do specialised parts >> :) However I'd be glad to give my assistance to the effort. Below are >> my random thoughts I've just had on the matter; >> >> >> Here at the PSU we store an awful lot of info that can not be stored >> in a standard EMBL file, hence we have extended it to fit out own >> needs. As an example we use several name qualifiers for genes; >> >> . systematic_id - the name cast in stone >> . temporary_systematic_id - the name as it is currently known >> . previous_systematic_id - as it was known >> . gene - EMBL standard qualifier >> >> Hence just trying to unflatten the EMBL file is tricky because >> systematic and temporary_sysetmatic_ids are mixed in the same >> sequence, hence building the hierarchy would need specialised code. >> TIGR XML has the same issue though so maybe its not too specialised >> after all :/ (PUB_LOCUS and LOCUS has a direct mapping to >> systematic_id and temporary_systematic_id). >> >> Something like this entry; >> /curation="name; origin; date; permission; type; dbref; notes ..." >> i.e. >> /curation="Matt Berriman; genedb; 20020128; public; comment" >> is unique to the PSU and I'm not sure where it fits in GUS. >> >> However; >> >> I have code that creates GO entries - supply a high level function >> with all the standard GO fields and it creates the 5 rows (?) in the >> different tables as required. This is definitely something that can >> be shared across centres, perhaps in a code library. All your code >> has to do is parse out the GO fields from the data. No reason why it >> couldn't accept a GO Bioperl object (I presume one exists). >> >> Perhaps the parsing needs to a super class for each data source and >> then sub-classed by each centre? >> >> Ok, enough ramblings. Does any of this make sense? >> Paul. >> >>> steve >>> >>> Chris Stoeckert wrote: >>> >>>> Hi Steve, >>>> Thanks for putting this out on gusdev. Marie-Adele indicated that >>>> Paul Mooney was very interested in this and I will likely meet with >>>> him about this when I visit in January. Please include him in email >>>> correspondence when not addressed to the general gusdev list. >>>> Thanks, >>>> Chris >>>> >>>> On Dec 9, 2004, at 2:11 PM, Steve Fischer wrote: >>>> >>>>> folks- >>>>> >>>>> the UGA folks and CBIL folks have started collaborating on a new >>>>> plugin called LoadAnnotatedSeqs. It will use BioPerl to parse >>>>> the input data. >>>>> >>>>> We expect it to take annotated sequences (NA at first) in genbank, >>>>> tigr xml and embl formats (plus any others supported by the >>>>> bioPerl parser). >>>>> >>>>> It will take an XML file that describes the mapping from the input >>>>> features to GUS features, and SO features. >>>>> It will also hard code special cases to handle qualifer data that >>>>> is distributed to tables outside of the NAFeature tables. >>>>> >>>>> For our projects we will be developing a mapping that unifies the >>>>> semantics of the data we are getting from our different sources >>>>> and formats. >>>>> (we plan to work with the PSU folks to incorporate the knowledge >>>>> they have acquired in their work to make an EMBL parser) >>>>> >>>>> ideas and suggestions are encouraged. >>>>> >>>>> steve >>>>> >>>>> >>>>> >>>>> ------------------------------------------------------- >>>>> SF email is sponsored by - The IT Product Guide >>>>> Read honest & candid reviews on hundreds of IT Products from real >>>>> users. >>>>> Discover which products truly live up to the hype. Start reading >>>>> now. http://productguide.itmanagersjournal.com/ >>>>> _______________________________________________ >>>>> Gusdev-gusdev mailing list >>>>> Gus...@li... >>>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev >>>> >>>> >>> > |
From: Steve F. <sfi...@pc...> - 2004-12-10 12:51:07
|
paul- ok, i see. are there any other examples besides curation in which you have placed structured data in qualifiers? are there examples of standard embl qualifiers in which you expect to find structured data and parse it? in the case of curation, where do you put that info in GUS? about systematic_ids, i understand what you've said. one thing though. how do they relate to gene names? steve Paul Mooney wrote: > > On 9 Dec 2004, at 23:21, Steve Fischer wrote: > >> paul- >> >> let me start digesting this by email. >> >> about your extensions to EMBL. the bioPerl model we are parsing into >> is based on generic features, tags and annotation. as long as the >> extensions can be parsed into those objects we're half way there. >> are the extensions syntactically consistent w/ standard embl files, >> but varying only in the particulars of what the data is called? > > > We have additional qualifiers with values. The values hold structured > information (say key=value pairs). > Bioperl will quite happily parse them into tags and values. > What controls the mapping of a tag to a GUS objects(s)? > What parses the structured information out to populate the object(s) > and fill in the objects fields (which is another mapping)? > > Something like this non-EMBL standard entry, curation, has several > values in a fixed field format; > > /curation="name; origin; date; permission; type; dbref; notes ..." > i.e. > /curation="Matt Berriman; genedb; 20020128; public; comment" > > How do we specify where to put this in GUS? It's very PSU specific. > Perhaps some sort of hook with specifying some perl code elsewhere to > handle it? > We currently store GO annotation in EMBL like this; > > /GO="aspect=process; GOid=GO:0006810; term=transport; > evidence=ISS; db_xref=GOC:unpublished; with=SPTR:Q9UQ36; date=20001122" > > as EMBL only has the format /db_xref="GO:00123" but I hope there is a > GO flat file loader so we don't have to worry about this in the future. > >> about building the hierarchy. if you looked at the bioperl api for >> the unflattener, you'd see that its unflattening uses gene name as a >> clue to deciding what features go together in a particular gene model. >> >> can gene name be relied upon to identify all the features that are >> associated with this gene? > > > You can switch to use any qualifier you like to identify groups, but > you can only specify *one*. > We can have 2 :) > In the same sequence a gene may be identified by systematic_id. > Another gene in the same sequence maybe identified by > temporary_systematic_id. > Eventually all genes will get a systematic_id but not straight away. > > In theory it should be easy to modify the flattener to use a 'best > name first' policy. > > For TIGR XML you'd have PUB_LOCUS and LUCUS as the best names, in that > order. Their too mix identifiers but since the XML already has a > hierarchy you might get away with it???? > > >> finally, about the GO stuff, yes, we can probably reuse your code. >> >> steve >> >> >> Paul Mooney wrote: >> >>> >>> On 9 Dec 2004, at 19:31, Steve Fischer wrote: >>> >>>> paul- >>>> >>>> hey. do you want to set up a time to chat so i can catch you up on >>>> what we have in mind? >>> >>> >>> >>> >>> At the moment I'm curious how much can be achieved via a generic >>> plugin. I think the plugin will need plugin's to do specialised >>> parts :) However I'd be glad to give my assistance to the effort. >>> Below are my random thoughts I've just had on the matter; >>> >>> >>> Here at the PSU we store an awful lot of info that can not be stored >>> in a standard EMBL file, hence we have extended it to fit out own >>> needs. As an example we use several name qualifiers for genes; >>> >>> . systematic_id - the name cast in stone >>> . temporary_systematic_id - the name as it is currently known >>> . previous_systematic_id - as it was known >>> . gene - EMBL standard qualifier >>> >>> Hence just trying to unflatten the EMBL file is tricky because >>> systematic and temporary_sysetmatic_ids are mixed in the same >>> sequence, hence building the hierarchy would need specialised code. >>> TIGR XML has the same issue though so maybe its not too specialised >>> after all :/ (PUB_LOCUS and LOCUS has a direct mapping to >>> systematic_id and temporary_systematic_id). >>> >>> Something like this entry; >>> /curation="name; origin; date; permission; type; dbref; notes ..." >>> i.e. >>> /curation="Matt Berriman; genedb; 20020128; public; comment" >>> is unique to the PSU and I'm not sure where it fits in GUS. >>> >>> However; >>> >>> I have code that creates GO entries - supply a high level function >>> with all the standard GO fields and it creates the 5 rows (?) in the >>> different tables as required. This is definitely something that can >>> be shared across centres, perhaps in a code library. All your code >>> has to do is parse out the GO fields from the data. No reason why it >>> couldn't accept a GO Bioperl object (I presume one exists). >>> >>> Perhaps the parsing needs to a super class for each data source and >>> then sub-classed by each centre? >>> >>> Ok, enough ramblings. Does any of this make sense? >>> Paul. >>> >>>> steve >>>> >>>> Chris Stoeckert wrote: >>>> >>>>> Hi Steve, >>>>> Thanks for putting this out on gusdev. Marie-Adele indicated that >>>>> Paul Mooney was very interested in this and I will likely meet >>>>> with him about this when I visit in January. Please include him in >>>>> email correspondence when not addressed to the general gusdev list. >>>>> Thanks, >>>>> Chris >>>>> >>>>> On Dec 9, 2004, at 2:11 PM, Steve Fischer wrote: >>>>> >>>>>> folks- >>>>>> >>>>>> the UGA folks and CBIL folks have started collaborating on a new >>>>>> plugin called LoadAnnotatedSeqs. It will use BioPerl to parse >>>>>> the input data. >>>>>> >>>>>> We expect it to take annotated sequences (NA at first) in >>>>>> genbank, tigr xml and embl formats (plus any others supported by >>>>>> the bioPerl parser). >>>>>> >>>>>> It will take an XML file that describes the mapping from the >>>>>> input features to GUS features, and SO features. >>>>>> It will also hard code special cases to handle qualifer data that >>>>>> is distributed to tables outside of the NAFeature tables. >>>>>> >>>>>> For our projects we will be developing a mapping that unifies the >>>>>> semantics of the data we are getting from our different sources >>>>>> and formats. >>>>>> (we plan to work with the PSU folks to incorporate the knowledge >>>>>> they have acquired in their work to make an EMBL parser) >>>>>> >>>>>> ideas and suggestions are encouraged. >>>>>> >>>>>> steve >>>>>> >>>>>> >>>>>> >>>>>> ------------------------------------------------------- >>>>>> SF email is sponsored by - The IT Product Guide >>>>>> Read honest & candid reviews on hundreds of IT Products from real >>>>>> users. >>>>>> Discover which products truly live up to the hype. Start reading >>>>>> now. http://productguide.itmanagersjournal.com/ >>>>>> _______________________________________________ >>>>>> Gusdev-gusdev mailing list >>>>>> Gus...@li... >>>>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev >>>>> >>>>> >>>>> >>>> >> |
From: Paul M. <pj...@sa...> - 2004-12-10 14:18:21
|
On 10 Dec 2004, at 12:52, Steve Fischer wrote: > paul- > > ok, i see. > > are there any other examples besides curation in which you have placed > structured data in qualifiers? are there examples of standard embl > qualifiers in which you expect to find structured data and parse it? > After talking with Arnaud it seems we can take each qualifier/structured field and create a new feature, with each one of its qualifiers holding one piece of data. This would fit into your mapping scheme. > in the case of curation, where do you put that info in GUS? It will probably end up as a note, for now at least. > > about systematic_ids, i understand what you've said. one thing > though. how do they relate to gene names? They are the gene names :) Standard EMBL uses a /gene qualifier for the gene symbol and /standard_name for the human readable name. During sequencing and annotation using a single /gene conveys no meaning as to how stable/temporary the ID is. > steve > > Paul Mooney wrote: > >> >> On 9 Dec 2004, at 23:21, Steve Fischer wrote: >> >>> paul- >>> >>> let me start digesting this by email. >>> >>> about your extensions to EMBL. the bioPerl model we are parsing >>> into is based on generic features, tags and annotation. as long as >>> the extensions can be parsed into those objects we're half way >>> there. are the extensions syntactically consistent w/ standard >>> embl files, but varying only in the particulars of what the data is >>> called? >> >> >> We have additional qualifiers with values. The values hold structured >> information (say key=value pairs). >> Bioperl will quite happily parse them into tags and values. >> What controls the mapping of a tag to a GUS objects(s)? >> What parses the structured information out to populate the object(s) >> and fill in the objects fields (which is another mapping)? >> >> Something like this non-EMBL standard entry, curation, has several >> values in a fixed field format; >> >> /curation="name; origin; date; permission; type; dbref; notes ..." >> i.e. >> /curation="Matt Berriman; genedb; 20020128; public; comment" >> >> How do we specify where to put this in GUS? It's very PSU specific. >> Perhaps some sort of hook with specifying some perl code elsewhere to >> handle it? >> We currently store GO annotation in EMBL like this; >> >> /GO="aspect=process; GOid=GO:0006810; term=transport; >> evidence=ISS; db_xref=GOC:unpublished; with=SPTR:Q9UQ36; >> date=20001122" >> >> as EMBL only has the format /db_xref="GO:00123" but I hope there is a >> GO flat file loader so we don't have to worry about this in the >> future. >> >>> about building the hierarchy. if you looked at the bioperl api for >>> the unflattener, you'd see that its unflattening uses gene name as a >>> clue to deciding what features go together in a particular gene >>> model. >>> >>> can gene name be relied upon to identify all the features that are >>> associated with this gene? >> >> >> You can switch to use any qualifier you like to identify groups, but >> you can only specify *one*. >> We can have 2 :) >> In the same sequence a gene may be identified by systematic_id. >> Another gene in the same sequence maybe identified by >> temporary_systematic_id. >> Eventually all genes will get a systematic_id but not straight away. >> >> In theory it should be easy to modify the flattener to use a 'best >> name first' policy. >> >> For TIGR XML you'd have PUB_LOCUS and LUCUS as the best names, in >> that order. Their too mix identifiers but since the XML already has a >> hierarchy you might get away with it???? >> >> >>> finally, about the GO stuff, yes, we can probably reuse your code. >>> >>> steve >>> >>> >>> Paul Mooney wrote: >>> >>>> >>>> On 9 Dec 2004, at 19:31, Steve Fischer wrote: >>>> >>>>> paul- >>>>> >>>>> hey. do you want to set up a time to chat so i can catch you up >>>>> on what we have in mind? >>>> >>>> >>>> >>>> >>>> At the moment I'm curious how much can be achieved via a generic >>>> plugin. I think the plugin will need plugin's to do specialised >>>> parts :) However I'd be glad to give my assistance to the effort. >>>> Below are my random thoughts I've just had on the matter; >>>> >>>> >>>> Here at the PSU we store an awful lot of info that can not be >>>> stored in a standard EMBL file, hence we have extended it to fit >>>> out own needs. As an example we use several name qualifiers for >>>> genes; >>>> >>>> . systematic_id - the name cast in stone >>>> . temporary_systematic_id - the name as it is currently known >>>> . previous_systematic_id - as it was known >>>> . gene - EMBL standard qualifier >>>> >>>> Hence just trying to unflatten the EMBL file is tricky because >>>> systematic and temporary_sysetmatic_ids are mixed in the same >>>> sequence, hence building the hierarchy would need specialised code. >>>> TIGR XML has the same issue though so maybe its not too specialised >>>> after all :/ (PUB_LOCUS and LOCUS has a direct mapping to >>>> systematic_id and temporary_systematic_id). >>>> >>>> Something like this entry; >>>> /curation="name; origin; date; permission; type; dbref; notes >>>> ..." >>>> i.e. >>>> /curation="Matt Berriman; genedb; 20020128; public; comment" >>>> is unique to the PSU and I'm not sure where it fits in GUS. >>>> >>>> However; >>>> >>>> I have code that creates GO entries - supply a high level function >>>> with all the standard GO fields and it creates the 5 rows (?) in >>>> the different tables as required. This is definitely something that >>>> can be shared across centres, perhaps in a code library. All your >>>> code has to do is parse out the GO fields from the data. No reason >>>> why it couldn't accept a GO Bioperl object (I presume one exists). >>>> >>>> Perhaps the parsing needs to a super class for each data source and >>>> then sub-classed by each centre? >>>> >>>> Ok, enough ramblings. Does any of this make sense? >>>> Paul. >>>> >>>>> steve >>>>> >>>>> Chris Stoeckert wrote: >>>>> >>>>>> Hi Steve, >>>>>> Thanks for putting this out on gusdev. Marie-Adele indicated that >>>>>> Paul Mooney was very interested in this and I will likely meet >>>>>> with him about this when I visit in January. Please include him >>>>>> in email correspondence when not addressed to the general gusdev >>>>>> list. >>>>>> Thanks, >>>>>> Chris >>>>>> >>>>>> On Dec 9, 2004, at 2:11 PM, Steve Fischer wrote: >>>>>> >>>>>>> folks- >>>>>>> >>>>>>> the UGA folks and CBIL folks have started collaborating on a new >>>>>>> plugin called LoadAnnotatedSeqs. It will use BioPerl to parse >>>>>>> the input data. >>>>>>> >>>>>>> We expect it to take annotated sequences (NA at first) in >>>>>>> genbank, tigr xml and embl formats (plus any others supported by >>>>>>> the bioPerl parser). >>>>>>> >>>>>>> It will take an XML file that describes the mapping from the >>>>>>> input features to GUS features, and SO features. >>>>>>> It will also hard code special cases to handle qualifer data >>>>>>> that is distributed to tables outside of the NAFeature tables. >>>>>>> >>>>>>> For our projects we will be developing a mapping that unifies >>>>>>> the semantics of the data we are getting from our different >>>>>>> sources and formats. >>>>>>> (we plan to work with the PSU folks to incorporate the knowledge >>>>>>> they have acquired in their work to make an EMBL parser) >>>>>>> >>>>>>> ideas and suggestions are encouraged. >>>>>>> >>>>>>> steve >>>>>>> >>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------- >>>>>>> SF email is sponsored by - The IT Product Guide >>>>>>> Read honest & candid reviews on hundreds of IT Products from >>>>>>> real users. >>>>>>> Discover which products truly live up to the hype. Start reading >>>>>>> now. http://productguide.itmanagersjournal.com/ >>>>>>> _______________________________________________ >>>>>>> Gusdev-gusdev mailing list >>>>>>> Gus...@li... >>>>>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev >>>>>> >>>>>> >>>>>> >>>>> >>> > |
From: Steve F. <sfi...@pc...> - 2004-12-10 16:09:24
|
paul- see in line steve Paul Mooney wrote: > > On 10 Dec 2004, at 12:52, Steve Fischer wrote: > >> paul- >> >> ok, i see. >> >> are there any other examples besides curation in which you have >> placed structured data in qualifiers? are there examples of >> standard embl qualifiers in which you expect to find structured data >> and parse it? >> > > After talking with Arnaud it seems we can take each > qualifier/structured field and create a new feature, with each one of > its qualifiers holding one piece of data. This would fit into your > mapping scheme. > ok. great. i was wondering about that. so does that mean that we can expect that no qualifiers will contain structured data that needs to be parsed? >> in the case of curation, where do you put that info in GUS? > > > It will probably end up as a note, for now at least. > >> >> about systematic_ids, i understand what you've said. one thing >> though. how do they relate to gene names? > > ok, but, what i'm driving at is that the unflattener uses gene name (/gene=) to decide what features go together in one gene model. really, it wouldn't matter what the value of the /gene= is, as long as it is identical for all features that belong to the gene. is that consistent with your use of /gene? > They are the gene names :) > Standard EMBL uses a /gene qualifier for the gene symbol and > /standard_name for the human readable name. > During sequencing and annotation using a single /gene conveys no > meaning as to how stable/temporary the ID is. > >> steve >> >> Paul Mooney wrote: >> >>> >>> On 9 Dec 2004, at 23:21, Steve Fischer wrote: >>> >>>> paul- >>>> >>>> let me start digesting this by email. >>>> >>>> about your extensions to EMBL. the bioPerl model we are parsing >>>> into is based on generic features, tags and annotation. as long as >>>> the extensions can be parsed into those objects we're half way >>>> there. are the extensions syntactically consistent w/ standard >>>> embl files, but varying only in the particulars of what the data is >>>> called? >>> >>> >>> >>> We have additional qualifiers with values. The values hold >>> structured information (say key=value pairs). >>> Bioperl will quite happily parse them into tags and values. >>> What controls the mapping of a tag to a GUS objects(s)? >>> What parses the structured information out to populate the object(s) >>> and fill in the objects fields (which is another mapping)? >>> >>> Something like this non-EMBL standard entry, curation, has several >>> values in a fixed field format; >>> >>> /curation="name; origin; date; permission; type; dbref; notes ..." >>> i.e. >>> /curation="Matt Berriman; genedb; 20020128; public; comment" >>> >>> How do we specify where to put this in GUS? It's very PSU specific. >>> Perhaps some sort of hook with specifying some perl code elsewhere >>> to handle it? >>> We currently store GO annotation in EMBL like this; >>> >>> /GO="aspect=process; GOid=GO:0006810; term=transport; >>> evidence=ISS; db_xref=GOC:unpublished; with=SPTR:Q9UQ36; date=20001122" >>> >>> as EMBL only has the format /db_xref="GO:00123" but I hope there is >>> a GO flat file loader so we don't have to worry about this in the >>> future. >>> >>>> about building the hierarchy. if you looked at the bioperl api for >>>> the unflattener, you'd see that its unflattening uses gene name as >>>> a clue to deciding what features go together in a particular gene >>>> model. >>>> >>>> can gene name be relied upon to identify all the features that are >>>> associated with this gene? >>> >>> >>> >>> You can switch to use any qualifier you like to identify groups, but >>> you can only specify *one*. >>> We can have 2 :) >>> In the same sequence a gene may be identified by systematic_id. >>> Another gene in the same sequence maybe identified by >>> temporary_systematic_id. >>> Eventually all genes will get a systematic_id but not straight away. >>> >>> In theory it should be easy to modify the flattener to use a 'best >>> name first' policy. >>> >>> For TIGR XML you'd have PUB_LOCUS and LUCUS as the best names, in >>> that order. Their too mix identifiers but since the XML already has >>> a hierarchy you might get away with it???? >>> >>> >>>> finally, about the GO stuff, yes, we can probably reuse your code. >>>> >>>> steve >>>> >>>> >>>> Paul Mooney wrote: >>>> >>>>> >>>>> On 9 Dec 2004, at 19:31, Steve Fischer wrote: >>>>> >>>>>> paul- >>>>>> >>>>>> hey. do you want to set up a time to chat so i can catch you up >>>>>> on what we have in mind? >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> At the moment I'm curious how much can be achieved via a generic >>>>> plugin. I think the plugin will need plugin's to do specialised >>>>> parts :) However I'd be glad to give my assistance to the effort. >>>>> Below are my random thoughts I've just had on the matter; >>>>> >>>>> >>>>> Here at the PSU we store an awful lot of info that can not be >>>>> stored in a standard EMBL file, hence we have extended it to fit >>>>> out own needs. As an example we use several name qualifiers for >>>>> genes; >>>>> >>>>> . systematic_id - the name cast in stone >>>>> . temporary_systematic_id - the name as it is currently known >>>>> . previous_systematic_id - as it was known >>>>> . gene - EMBL standard qualifier >>>>> >>>>> Hence just trying to unflatten the EMBL file is tricky because >>>>> systematic and temporary_sysetmatic_ids are mixed in the same >>>>> sequence, hence building the hierarchy would need specialised >>>>> code. TIGR XML has the same issue though so maybe its not too >>>>> specialised after all :/ (PUB_LOCUS and LOCUS has a direct mapping >>>>> to systematic_id and temporary_systematic_id). >>>>> >>>>> Something like this entry; >>>>> /curation="name; origin; date; permission; type; dbref; notes >>>>> ..." >>>>> i.e. >>>>> /curation="Matt Berriman; genedb; 20020128; public; comment" >>>>> is unique to the PSU and I'm not sure where it fits in GUS. >>>>> >>>>> However; >>>>> >>>>> I have code that creates GO entries - supply a high level function >>>>> with all the standard GO fields and it creates the 5 rows (?) in >>>>> the different tables as required. This is definitely something >>>>> that can be shared across centres, perhaps in a code library. All >>>>> your code has to do is parse out the GO fields from the data. No >>>>> reason why it couldn't accept a GO Bioperl object (I presume one >>>>> exists). >>>>> >>>>> Perhaps the parsing needs to a super class for each data source >>>>> and then sub-classed by each centre? >>>>> >>>>> Ok, enough ramblings. Does any of this make sense? >>>>> Paul. >>>>> >>>>>> steve >>>>>> >>>>>> Chris Stoeckert wrote: >>>>>> >>>>>>> Hi Steve, >>>>>>> Thanks for putting this out on gusdev. Marie-Adele indicated >>>>>>> that Paul Mooney was very interested in this and I will likely >>>>>>> meet with him about this when I visit in January. Please include >>>>>>> him in email correspondence when not addressed to the general >>>>>>> gusdev list. >>>>>>> Thanks, >>>>>>> Chris >>>>>>> >>>>>>> On Dec 9, 2004, at 2:11 PM, Steve Fischer wrote: >>>>>>> >>>>>>>> folks- >>>>>>>> >>>>>>>> the UGA folks and CBIL folks have started collaborating on a >>>>>>>> new plugin called LoadAnnotatedSeqs. It will use BioPerl to >>>>>>>> parse the input data. >>>>>>>> >>>>>>>> We expect it to take annotated sequences (NA at first) in >>>>>>>> genbank, tigr xml and embl formats (plus any others supported >>>>>>>> by the bioPerl parser). >>>>>>>> >>>>>>>> It will take an XML file that describes the mapping from the >>>>>>>> input features to GUS features, and SO features. >>>>>>>> It will also hard code special cases to handle qualifer data >>>>>>>> that is distributed to tables outside of the NAFeature tables. >>>>>>>> >>>>>>>> For our projects we will be developing a mapping that unifies >>>>>>>> the semantics of the data we are getting from our different >>>>>>>> sources and formats. >>>>>>>> (we plan to work with the PSU folks to incorporate the >>>>>>>> knowledge they have acquired in their work to make an EMBL parser) >>>>>>>> >>>>>>>> ideas and suggestions are encouraged. >>>>>>>> >>>>>>>> steve >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ------------------------------------------------------- >>>>>>>> SF email is sponsored by - The IT Product Guide >>>>>>>> Read honest & candid reviews on hundreds of IT Products from >>>>>>>> real users. >>>>>>>> Discover which products truly live up to the hype. Start >>>>>>>> reading now. http://productguide.itmanagersjournal.com/ >>>>>>>> _______________________________________________ >>>>>>>> Gusdev-gusdev mailing list >>>>>>>> Gus...@li... >>>>>>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>> >> |
From: Paul M. <pj...@sa...> - 2004-12-10 16:53:13
|
>>> are there any other examples besides curation in which you have >>> placed structured data in qualifiers? are there examples of >>> standard embl qualifiers in which you expect to find structured data >>> and parse it? >>> >> >> After talking with Arnaud it seems we can take each >> qualifier/structured field and create a new feature, with each one of >> its qualifiers holding one piece of data. This would fit into your >> mapping scheme. >> > ok. great. i was wondering about that. Arnaud had to point out the obvious to me :) > so does that mean that we can expect that no qualifiers will contain > structured data that needs to be parsed? We can make it so their is no structured data to be parsed (we'll leave it in but ignore it). >>> about systematic_ids, i understand what you've said. one thing >>> though. how do they relate to gene names? >> >> > ok, but, what i'm driving at is that the unflattener uses gene name > (/gene=) to decide what features go together in one gene model. > really, it wouldn't matter what the value of the /gene= is, as long as > it is identical for all features that belong to the gene. is that > consistent with your use of /gene? We don't use /gene until we submit to EMBL, when we convert /systematic_ids for /genes. All EMBL files parsed into GUS have the previously mentioned name qualifiers. Look at this extract of an EMBL file FT CDS 1..3 FT /systematic_id="name1" . . FT CDS 4..8 FT /temporary_systematic_id="name2" Hence I need to tell the flattener to look for systematic_id first, then temporary_systematic_id. It is important for us to use both. |
From: Steve F. <sfi...@pc...> - 2004-12-13 14:45:52
|
folks- i took a look at Ed's dump of the bioperl objects created by the parse of genbank. for genbank, the bioperl Annotation objects are only used to describe the sequence and not any individual features. our mapping assumes that, so we're lucky so far. we'll need to have a look at tigr. nonetheless, i think we need to adjust our mapping XML schema a tad. the main insight is that the source of our mapping is not genbank, or tiger, etc, but... bioperl objects. our mapping syntax must describe how to map bioperl feature objects into gus, regardless of the origin of the data. and, bioperl features have 'tags' and 'annotation' right now we have a <qualifier> tag that was intended to map an input qualifier to a gus attribute. but, bioperl doesn't have 'qualifiers' so, i think we need to replace <qualifer> with: <tag> so far, we don't need <annotation> for the feature mapping, and lets hope we don't. but, if we do, at least our xml will be forward compatible. that said, we still owe ourselves a mapping for the tags and annotation that directly describe the sequence. steve Steve Fischer wrote: > paul- see in line > > steve > > Paul Mooney wrote: > >> >> On 10 Dec 2004, at 12:52, Steve Fischer wrote: >> >>> paul- >>> >>> ok, i see. >>> >>> are there any other examples besides curation in which you have >>> placed structured data in qualifiers? are there examples of >>> standard embl qualifiers in which you expect to find structured data >>> and parse it? >>> >> >> After talking with Arnaud it seems we can take each >> qualifier/structured field and create a new feature, with each one of >> its qualifiers holding one piece of data. This would fit into your >> mapping scheme. >> > ok. great. i was wondering about that. > > so does that mean that we can expect that no qualifiers will contain > structured data that needs to be parsed? > >>> in the case of curation, where do you put that info in GUS? >> >> >> >> It will probably end up as a note, for now at least. >> >>> >>> about systematic_ids, i understand what you've said. one thing >>> though. how do they relate to gene names? >> >> >> > ok, but, what i'm driving at is that the unflattener uses gene name > (/gene=) to decide what features go together in one gene model. > really, it wouldn't matter what the value of the /gene= is, as long as > it is identical for all features that belong to the gene. is that > consistent with your use of /gene? > >> They are the gene names :) >> Standard EMBL uses a /gene qualifier for the gene symbol and >> /standard_name for the human readable name. >> During sequencing and annotation using a single /gene conveys no >> meaning as to how stable/temporary the ID is. >> >>> steve >>> >>> Paul Mooney wrote: >>> >>>> >>>> On 9 Dec 2004, at 23:21, Steve Fischer wrote: >>>> >>>>> paul- >>>>> >>>>> let me start digesting this by email. >>>>> >>>>> about your extensions to EMBL. the bioPerl model we are parsing >>>>> into is based on generic features, tags and annotation. as long >>>>> as the extensions can be parsed into those objects we're half way >>>>> there. are the extensions syntactically consistent w/ standard >>>>> embl files, but varying only in the particulars of what the data >>>>> is called? >>>> >>>> >>>> >>>> >>>> We have additional qualifiers with values. The values hold >>>> structured information (say key=value pairs). >>>> Bioperl will quite happily parse them into tags and values. >>>> What controls the mapping of a tag to a GUS objects(s)? >>>> What parses the structured information out to populate the >>>> object(s) and fill in the objects fields (which is another mapping)? >>>> >>>> Something like this non-EMBL standard entry, curation, has several >>>> values in a fixed field format; >>>> >>>> /curation="name; origin; date; permission; type; dbref; notes ..." >>>> i.e. >>>> /curation="Matt Berriman; genedb; 20020128; public; comment" >>>> >>>> How do we specify where to put this in GUS? It's very PSU specific. >>>> Perhaps some sort of hook with specifying some perl code elsewhere >>>> to handle it? >>>> We currently store GO annotation in EMBL like this; >>>> >>>> /GO="aspect=process; GOid=GO:0006810; term=transport; >>>> evidence=ISS; db_xref=GOC:unpublished; with=SPTR:Q9UQ36; >>>> date=20001122" >>>> >>>> as EMBL only has the format /db_xref="GO:00123" but I hope there is >>>> a GO flat file loader so we don't have to worry about this in the >>>> future. >>>> >>>>> about building the hierarchy. if you looked at the bioperl api >>>>> for the unflattener, you'd see that its unflattening uses gene >>>>> name as a clue to deciding what features go together in a >>>>> particular gene model. >>>>> >>>>> can gene name be relied upon to identify all the features that are >>>>> associated with this gene? >>>> >>>> >>>> >>>> >>>> You can switch to use any qualifier you like to identify groups, >>>> but you can only specify *one*. >>>> We can have 2 :) >>>> In the same sequence a gene may be identified by systematic_id. >>>> Another gene in the same sequence maybe identified by >>>> temporary_systematic_id. >>>> Eventually all genes will get a systematic_id but not straight away. >>>> >>>> In theory it should be easy to modify the flattener to use a 'best >>>> name first' policy. >>>> >>>> For TIGR XML you'd have PUB_LOCUS and LUCUS as the best names, in >>>> that order. Their too mix identifiers but since the XML already has >>>> a hierarchy you might get away with it???? >>>> >>>> >>>>> finally, about the GO stuff, yes, we can probably reuse your code. >>>>> >>>>> steve >>>>> >>>>> >>>>> Paul Mooney wrote: >>>>> >>>>>> >>>>>> On 9 Dec 2004, at 19:31, Steve Fischer wrote: >>>>>> >>>>>>> paul- >>>>>>> >>>>>>> hey. do you want to set up a time to chat so i can catch you up >>>>>>> on what we have in mind? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> At the moment I'm curious how much can be achieved via a generic >>>>>> plugin. I think the plugin will need plugin's to do specialised >>>>>> parts :) However I'd be glad to give my assistance to the effort. >>>>>> Below are my random thoughts I've just had on the matter; >>>>>> >>>>>> >>>>>> Here at the PSU we store an awful lot of info that can not be >>>>>> stored in a standard EMBL file, hence we have extended it to fit >>>>>> out own needs. As an example we use several name qualifiers for >>>>>> genes; >>>>>> >>>>>> . systematic_id - the name cast in stone >>>>>> . temporary_systematic_id - the name as it is currently known >>>>>> . previous_systematic_id - as it was known >>>>>> . gene - EMBL standard qualifier >>>>>> >>>>>> Hence just trying to unflatten the EMBL file is tricky because >>>>>> systematic and temporary_sysetmatic_ids are mixed in the same >>>>>> sequence, hence building the hierarchy would need specialised >>>>>> code. TIGR XML has the same issue though so maybe its not too >>>>>> specialised after all :/ (PUB_LOCUS and LOCUS has a direct >>>>>> mapping to systematic_id and temporary_systematic_id). >>>>>> >>>>>> Something like this entry; >>>>>> /curation="name; origin; date; permission; type; dbref; notes >>>>>> ..." >>>>>> i.e. >>>>>> /curation="Matt Berriman; genedb; 20020128; public; comment" >>>>>> is unique to the PSU and I'm not sure where it fits in GUS. >>>>>> >>>>>> However; >>>>>> >>>>>> I have code that creates GO entries - supply a high level >>>>>> function with all the standard GO fields and it creates the 5 >>>>>> rows (?) in the different tables as required. This is definitely >>>>>> something that can be shared across centres, perhaps in a code >>>>>> library. All your code has to do is parse out the GO fields from >>>>>> the data. No reason why it couldn't accept a GO Bioperl object (I >>>>>> presume one exists). >>>>>> >>>>>> Perhaps the parsing needs to a super class for each data source >>>>>> and then sub-classed by each centre? >>>>>> >>>>>> Ok, enough ramblings. Does any of this make sense? >>>>>> Paul. >>>>>> >>>>>>> steve >>>>>>> >>>>>>> Chris Stoeckert wrote: >>>>>>> >>>>>>>> Hi Steve, >>>>>>>> Thanks for putting this out on gusdev. Marie-Adele indicated >>>>>>>> that Paul Mooney was very interested in this and I will likely >>>>>>>> meet with him about this when I visit in January. Please >>>>>>>> include him in email correspondence when not addressed to the >>>>>>>> general gusdev list. >>>>>>>> Thanks, >>>>>>>> Chris >>>>>>>> >>>>>>>> On Dec 9, 2004, at 2:11 PM, Steve Fischer wrote: >>>>>>>> >>>>>>>>> folks- >>>>>>>>> >>>>>>>>> the UGA folks and CBIL folks have started collaborating on a >>>>>>>>> new plugin called LoadAnnotatedSeqs. It will use BioPerl to >>>>>>>>> parse the input data. >>>>>>>>> >>>>>>>>> We expect it to take annotated sequences (NA at first) in >>>>>>>>> genbank, tigr xml and embl formats (plus any others supported >>>>>>>>> by the bioPerl parser). >>>>>>>>> >>>>>>>>> It will take an XML file that describes the mapping from the >>>>>>>>> input features to GUS features, and SO features. >>>>>>>>> It will also hard code special cases to handle qualifer data >>>>>>>>> that is distributed to tables outside of the NAFeature tables. >>>>>>>>> >>>>>>>>> For our projects we will be developing a mapping that unifies >>>>>>>>> the semantics of the data we are getting from our different >>>>>>>>> sources and formats. >>>>>>>>> (we plan to work with the PSU folks to incorporate the >>>>>>>>> knowledge they have acquired in their work to make an EMBL >>>>>>>>> parser) >>>>>>>>> >>>>>>>>> ideas and suggestions are encouraged. >>>>>>>>> >>>>>>>>> steve >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> ------------------------------------------------------- >>>>>>>>> SF email is sponsored by - The IT Product Guide >>>>>>>>> Read honest & candid reviews on hundreds of IT Products from >>>>>>>>> real users. >>>>>>>>> Discover which products truly live up to the hype. Start >>>>>>>>> reading now. http://productguide.itmanagersjournal.com/ >>>>>>>>> _______________________________________________ >>>>>>>>> Gusdev-gusdev mailing list >>>>>>>>> Gus...@li... >>>>>>>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>> >>> > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://productguide.itmanagersjournal.com/ > _______________________________________________ > Gusdev-gusdev mailing list > Gus...@li... > https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev |