From: <pi...@pc...> - 2004-10-26 15:31:29
|
Hi Ed, The hierarchy for choosing the example id is swiss-prot->pir->longest description (sorry for the etc. before, I had to go look in LoadNRDB to remember). This hierarchy was chosen because we prefered having links to swiss-prot but if that was not available then to PIR and if neither was available, we would use the entry with what we hoped was the most informative (longest) description. It was an internal decision but I think a reasonable one. Debbie Quoting Ed Robinson <ed_...@be...>: > I knew it was a unique sequence over multiple redundant sources, I didn't > realize that they retained all the source ids with it. > > What is the full hierarchy for choosing which exemplar? Is this hierarchy a > GUS internal hierarchy, or is it a hierarchy used by the larger community > (e.g., we start with Genbank, and then the next two primary depositories > (EMBL, DBJ)... etc. > > Thanks, this is really helping me understand the logic internal to this > pluggin. > > -Ed > > > > > From: pi...@pc... > > Date: 2004/10/26 Tue AM 10:52:48 EDT > > To: Ed Robinson <ed_...@be...> > > CC: gus...@li... > > Subject: [Gusdev-gusdev] Re: Re: Corrupt NRDB? > > > > Hi Ed, > > > > The point of nrdb is that it is supposed to consolidate ids from multiple > > sources that represent the same sequence thus creating a non-redundant > > database. The plugin was written to put each of the ids into > dots.nrdbentry > > where the multiple rows from a single record would refer to a single > sequence > > in dots.externalaasequence. One of the ids would be the exemplar in > > dots.externalaasequence based on a hierarchy (swiss-prot->pir->etc.). > > > > This is one of the reasons that LoadNRDB is rather complicated because it > > updates these two tables. In addition, LoadNRDB attaches a taxon_id to each > row > > of NRDBEntry which is missing from the nr records from NCBI but provided > via a > > tax_id from the protein-gi to ncbi_tax_id file. > > > > Debbie > > > > Quoting Ed Robinson <ed_...@be...>: > > > > > Thanks. Everything is working fine then. I didn't realize that was part > of > > > the nrdb formatting. > > > > > > What exactly is the rule for using the escape header in these cases? Are > > > they multiple ids associated with one sequence? And what exactly does > > > NRDBLoad do with these. Does it just enter the last id number, or each > of > > > them? > > > > > > -ed > > > > > > > > > > > > > > From: pi...@pc... > > > > Date: 2004/10/25 Mon PM 04:01:06 EDT > > > > To: Ed Robinson <ed_...@be...> > > > > Subject: Re: Corrupt NRDB? > > > > > > > > Hi Ed, > > > > > > > > The last time I downloaded nrdb was on September 20 and the entry with > > > source_id > > > > = AN61313.1 looked fine: > > > > > > > > >gi|24430922|gb|AAN61313.1| cytochrome oxidase subunit III [Cicindela > > > > aureola]gi|24430920|gb|AAN61312.1| cytochrome oxidase subunit III > > > [Cicindela > > > > hemichrysea] > > > > > > > > > > GFFHSSLSPTVELGAMWPPAGISPFNPLQIPLLNTLILLTSGITVTWAHHGLMENNYTQALQGLFFTVILGIYFTALQAY > > > > > > > > > > EYFESPFTIADSVYGSTFFMATGFHGLHVIIGTTFLLVCLMRHWMNHFSSIHHFGFEAAAWYWHFVDVVWLFLYISIYWW > > > > > > > > Of source, you can't see the ^A here, that separates the two sources. I > > > didn't > > > > have a problem with that file. > > > > > > > > I don't see the oddness you see but in the past I have encountered > entries > > > that > > > > have not conformed to the format but not in the way you are describing. > If > > > you > > > > have downloaded multiple times, I would be suspicious of NCBI. The > current > > > > version of LoadNRDB.pm handles failures by printing the sequence into > > > STDERR. I > > > > printed sequence because that seemed to be the only reliably present > part > > > of a > > > > record. The whole thing should fail if the number of failures is over > 100. > > > > This wouldn't work well with what you are describing but perhaps could > get > > > you > > > > cloase enough to the record(s) find the error(s). > > > > > > > > -Debbie > > > > > > > > > > > > > > > > Quoting Ed Robinson <ed_...@be...>: > > > > > > > > > Debbie, > > > > > > > > > > I have been taking a very close look at the NR database, and I have > found > > > a > > > > > number of bad entries near the tail end of the file. Generally, > these > > > show > > > > > up as entries which have no sequence and, instead of having a ">" to > > > start > > > > > the next entry, the bad entries tail right into the next entry. In > some > > > > > editors, there are diamonds where the carriage return should be. > > > > > I have downloaded the nr file a number of ways, but I am still > finding > > > these > > > > > errors. Can you tell me if you find odd things with your NRDB also? > A > > > good > > > > > set of IDs to look at are the following: > > > > > > > > > > > > > > > > > > > > GFFHSSLSPTVELGAMWPPAGISPFNPLQIPLLNTLILLTSGITVTWAHHGLMENNYTQALQGLFFTVILGIYFTALQAYEYFESPFTIADSVYGSTFFMATGFHGLHVIIGTTFLLVCLMRHWMNHFSSIHHFGFEAAAWYWHFV|146 > > > > > ||176382|37| cytochrome oxidase subunit III [Cicindela > > > > > > > > > > > > > > > aureola]gi|24430922|AAN61313.1|ExternalAASequence|1|1|1|1|1|1|1|1|1|0|GFFHSSLSPTVELGAMWPPAGISPFNPLQIPLLNTLILLTSGITVTWAHHGLMENNYTQALQGLFFTVILGIYFTALQAYEYFESPFTIADSVYGSTFFMATGFHGLHVIIGTTFLLVCLMRHWMNHFSSIHHFGFEAAAWYWHFVDVVWLFLYISIYWWGS|162 > > > > > > > > > > > > > > > Let me know if you also have problems. > > > > > > > > > > > > > > > thanks > > > > > > > > > > -Ed > > > > > > > > > > > > > > > > > > > > > > > > > > > From: pi...@pc... > > > > > > Date: 2004/10/16 Sat AM 11:51:59 EDT > > > > > > To: Ed Robinson <ed_...@be...> > > > > > > CC: gus...@li... > > > > > > Subject: Re: [Gusdev-gusdev] New LoadNRDB & Consolidated GUS > install > > > > > package > > > > > > > > > > > > I recently (within the last 3 weeks) loaded an entirely new version > of > > > nrdb > > > > > and > > > > > > it took less than 24 hours. This should have been equivalent to a > first > > > > > load. I > > > > > > think that something else was wrong when you ran the plugin, > possibly > > > with > > > > > the > > > > > > database (e.g. indexes missing, a need to update statistics). > > > > > > > > > > > > I agree that LoadNRDB needs an upgrade but I think its poor > performance > > > in > > > > > this > > > > > > case is due to some other problem. > > > > > > > > > > > > -Debbie > > > > > > > > > > > > > > > > > > > > > > > > Quoting Ed Robinson <ed_...@be...>: > > > > > > > > > > > > > As many of you know, we have been doing quite a few GUS installs > down > > > > > here, > > > > > > > and this has pushed me to try and simplify this process as much > as > > > > > possible. > > > > > > > I am now far enough along on a couple things to bring them up on > the > > > > > list. > > > > > > > > > > > > > > First, installing NRDB the first time in GUS is a horribly > painful > > > > > process > > > > > > > using the exisintg plugin and this pain seems to be needless > since it > > > is > > > > > an > > > > > > > empty database. To this end, I have written a couple scripts and > a > > > batch > > > > > > > process for Oracle SQLLoader which accomplishes in about an hour > what > > > > > takes a > > > > > > > few weeks with the plugin. However, to make this work, I have to > > > reserve > > > > > > > early rows in a number of SRES tables for meaningful entries in > > > columns > > > > > such > > > > > > > as row_alg_invocation_id. Hence, my first discussion item: > Should > > > we > > > > > > > consider reserving early values in a number of the SRes tables to > > > serve > > > > > as > > > > > > > standard values. We already require that some rows be entered in > GUS > > > > > early > > > > > > > on to make some-things work such as LoadReviewType. It would > seem > > > that > > > > > we > > > > > > > should Pre-populate some of these tables with basic values that > we > > > can > > > > > then > > > > > > > refer to as standard values for bootstrapping operations such as > a > > > bulk > > > > > load > > > > > > > of NRDB. Does anyone else see any value in this and, if so, what > > > fileds > > > > > > > should we create standard entries for? Also, is there anything > else > > > that > > > > > > > would be amenable to a batch process for bootstrapping? (Note: I > do > > > NOT > > > > > > > think any organisim specific data is amenable to bootstrapping. > That > > > is > > > > > what > > > > > > > a (object based) pipeline is for. Also, this batch process is > only > > > good > > > > > if > > > > > > > you are using Oracle, but a similar process cab be written there > > > too.) > > > > > > > > > > > > > > This also gets me to some of the other scripts we use to > bootstrap > > > GUS, > > > > > such > > > > > > > as the predefined set of ExternalDatabases we load. The XML > which I > > > use > > > > > to > > > > > > > load this is pretty messy, and not well documented. Does anyone > mind > > > if > > > > > I > > > > > > > clean it up? If the answer is yes, is there anything I should > know > > > about > > > > > > > this file? it seems that the XML for this table load is a nice > one > > > to > > > > > > > clean-up and make standard for GUS installations all over since > it > > > will > > > > > push > > > > > > > gus to be standardized across installations. What else should we > > > > > > > standardize? > > > > > > > > > > > > > > Which now brings me to the last item I want to open up which is > that > > > I am > > > > > > > close to completing a full GUS installation wrapper script which > > > > > essentially > > > > > > > makes a GUS installation a click-and-play operation. One of our > > > > > > > deliverables is supposed to be an easy to install GUS package. > > > > > Regardless of > > > > > > > the state of GUS with regards to an official release, this script > is > > > > > going to > > > > > > > make my life a whole lot easier. I figure it might be nice to > > > package > > > > > the > > > > > > > whole kit-n-kaboodle up into one nice fat tarball with a simple > set > > > of > > > > > > > instructions for download from someplace. Is anyone else > interested > > > in > > > > > this? > > > > > > > > > > > > > > Finally, one quick question I have about the NRDB load is that > > > working on > > > > > it > > > > > > > showed me that the description filed in AASequenceIMP is too > short > > > for > > > > > many > > > > > > > of the descriptions in NRDB. Do we want to up the description > field > > > size > > > > > for > > > > > > > dots.aasequenceimp? > > > > > > > > > > > > > > Anyway, any feedback on this would be appreciated. > > > > > > > > > > > > > > -Ed R > > > > > > > > > > > > > > > > > > > > > Ed Robinson > > > > > > > 255 Deerfield Rd > > > > > > > Bogart, GA 30622 > > > > > > > (706)425-9181 > > > > > > > > > > > > > > --Learn more about the face of your neighbor, and less about your > > > own. > > > > > > > -Sargent Shriver > > > > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------- > > > > > > > This SF.net email is sponsored by: IT Product Guide on > > > ITManagersJournal > > > > > > > Use IT products in your business? Tell us what you think of them. > > > Give us > > > > > > > Your Opinions, Get Free ThinkGeek Gift Certificates! Click to > find > > > out > > > > > more > > > > > > > http://productguide.itmanagersjournal.com/guidepromo.tmpl > > > > > > > _______________________________________________ > > > > > > > Gusdev-gusdev mailing list > > > > > > > Gus...@li... > > > > > > > https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------- > > > > > > This SF.net email is sponsored by: IT Product Guide on > > > ITManagersJournal > > > > > > Use IT products in your business? Tell us what you think of them. > Give > > > us > > > > > > Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find > out > > > more > > > > > > http://productguide.itmanagersjournal.com/guidepromo.tmpl > > > > > > _______________________________________________ > > > > > > Gusdev-gusdev mailing list > > > > > > Gus...@li... > > > > > > https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev > > > > > > > > > > > > > > > > Ed Robinson > > > > > 255 Deerfield Rd > > > > > Bogart, GA 30622 > > > > > (706)425-9181 > > > > > > > > > > --Learn more about the face of your neighbor, and less about your > own. > > > > > -Sargent Shriver > > > > > > > > > > > > > > > > > > > > > > > Ed Robinson > > > 255 Deerfield Rd > > > Bogart, GA 30622 > > > (706)425-9181 > > > > > > --Learn more about the face of your neighbor, and less about your own. > > > -Sargent Shriver > > > > > > > > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: IT Product Guide on ITManagersJournal > > Use IT products in your business? Tell us what you think of them. Give us > > Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more > > http://productguide.itmanagersjournal.com/guidepromo.tmpl > > _______________________________________________ > > Gusdev-gusdev mailing list > > Gus...@li... > > https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev > > > > Ed Robinson > 255 Deerfield Rd > Bogart, GA 30622 > (706)425-9181 > > --Learn more about the face of your neighbor, and less about your own. > -Sargent Shriver > |