From: <pi...@pc...> - 2004-10-26 14:52:54
|
Hi Ed, The point of nrdb is that it is supposed to consolidate ids from multiple sources that represent the same sequence thus creating a non-redundant database. The plugin was written to put each of the ids into dots.nrdbentry where the multiple rows from a single record would refer to a single sequence in dots.externalaasequence. One of the ids would be the exemplar in dots.externalaasequence based on a hierarchy (swiss-prot->pir->etc.). This is one of the reasons that LoadNRDB is rather complicated because it updates these two tables. In addition, LoadNRDB attaches a taxon_id to each row of NRDBEntry which is missing from the nr records from NCBI but provided via a tax_id from the protein-gi to ncbi_tax_id file. Debbie Quoting Ed Robinson <ed_...@be...>: > Thanks. Everything is working fine then. I didn't realize that was part of > the nrdb formatting. > > What exactly is the rule for using the escape header in these cases? Are > they multiple ids associated with one sequence? And what exactly does > NRDBLoad do with these. Does it just enter the last id number, or each of > them? > > -ed > > > > > > From: pi...@pc... > > Date: 2004/10/25 Mon PM 04:01:06 EDT > > To: Ed Robinson <ed_...@be...> > > Subject: Re: Corrupt NRDB? > > > > Hi Ed, > > > > The last time I downloaded nrdb was on September 20 and the entry with > source_id > > = AN61313.1 looked fine: > > > > >gi|24430922|gb|AAN61313.1| cytochrome oxidase subunit III [Cicindela > > aureola]gi|24430920|gb|AAN61312.1| cytochrome oxidase subunit III > [Cicindela > > hemichrysea] > > > GFFHSSLSPTVELGAMWPPAGISPFNPLQIPLLNTLILLTSGITVTWAHHGLMENNYTQALQGLFFTVILGIYFTALQAY > > > EYFESPFTIADSVYGSTFFMATGFHGLHVIIGTTFLLVCLMRHWMNHFSSIHHFGFEAAAWYWHFVDVVWLFLYISIYWW > > > > Of source, you can't see the ^A here, that separates the two sources. I > didn't > > have a problem with that file. > > > > I don't see the oddness you see but in the past I have encountered entries > that > > have not conformed to the format but not in the way you are describing. If > you > > have downloaded multiple times, I would be suspicious of NCBI. The current > > version of LoadNRDB.pm handles failures by printing the sequence into > STDERR. I > > printed sequence because that seemed to be the only reliably present part > of a > > record. The whole thing should fail if the number of failures is over 100. > > This wouldn't work well with what you are describing but perhaps could get > you > > cloase enough to the record(s) find the error(s). > > > > -Debbie > > > > > > > > Quoting Ed Robinson <ed_...@be...>: > > > > > Debbie, > > > > > > I have been taking a very close look at the NR database, and I have found > a > > > number of bad entries near the tail end of the file. Generally, these > show > > > up as entries which have no sequence and, instead of having a ">" to > start > > > the next entry, the bad entries tail right into the next entry. In some > > > editors, there are diamonds where the carriage return should be. > > > I have downloaded the nr file a number of ways, but I am still finding > these > > > errors. Can you tell me if you find odd things with your NRDB also? A > good > > > set of IDs to look at are the following: > > > > > > > > > GFFHSSLSPTVELGAMWPPAGISPFNPLQIPLLNTLILLTSGITVTWAHHGLMENNYTQALQGLFFTVILGIYFTALQAYEYFESPFTIADSVYGSTFFMATGFHGLHVIIGTTFLLVCLMRHWMNHFSSIHHFGFEAAAWYWHFV|146 > > > ||176382|37| cytochrome oxidase subunit III [Cicindela > > > > > > aureola]gi|24430922|AAN61313.1|ExternalAASequence|1|1|1|1|1|1|1|1|1|0|GFFHSSLSPTVELGAMWPPAGISPFNPLQIPLLNTLILLTSGITVTWAHHGLMENNYTQALQGLFFTVILGIYFTALQAYEYFESPFTIADSVYGSTFFMATGFHGLHVIIGTTFLLVCLMRHWMNHFSSIHHFGFEAAAWYWHFVDVVWLFLYISIYWWGS|162 > > > > > > > > > Let me know if you also have problems. > > > > > > > > > thanks > > > > > > -Ed > > > > > > > > > > > > > > > > > From: pi...@pc... > > > > Date: 2004/10/16 Sat AM 11:51:59 EDT > > > > To: Ed Robinson <ed_...@be...> > > > > CC: gus...@li... > > > > Subject: Re: [Gusdev-gusdev] New LoadNRDB & Consolidated GUS install > > > package > > > > > > > > I recently (within the last 3 weeks) loaded an entirely new version of > nrdb > > > and > > > > it took less than 24 hours. This should have been equivalent to a first > > > load. I > > > > think that something else was wrong when you ran the plugin, possibly > with > > > the > > > > database (e.g. indexes missing, a need to update statistics). > > > > > > > > I agree that LoadNRDB needs an upgrade but I think its poor performance > in > > > this > > > > case is due to some other problem. > > > > > > > > -Debbie > > > > > > > > > > > > > > > > Quoting Ed Robinson <ed_...@be...>: > > > > > > > > > As many of you know, we have been doing quite a few GUS installs down > > > here, > > > > > and this has pushed me to try and simplify this process as much as > > > possible. > > > > > I am now far enough along on a couple things to bring them up on the > > > list. > > > > > > > > > > First, installing NRDB the first time in GUS is a horribly painful > > > process > > > > > using the exisintg plugin and this pain seems to be needless since it > is > > > an > > > > > empty database. To this end, I have written a couple scripts and a > batch > > > > > process for Oracle SQLLoader which accomplishes in about an hour what > > > takes a > > > > > few weeks with the plugin. However, to make this work, I have to > reserve > > > > > early rows in a number of SRES tables for meaningful entries in > columns > > > such > > > > > as row_alg_invocation_id. Hence, my first discussion item: Should > we > > > > > consider reserving early values in a number of the SRes tables to > serve > > > as > > > > > standard values. We already require that some rows be entered in GUS > > > early > > > > > on to make some-things work such as LoadReviewType. It would seem > that > > > we > > > > > should Pre-populate some of these tables with basic values that we > can > > > then > > > > > refer to as standard values for bootstrapping operations such as a > bulk > > > load > > > > > of NRDB. Does anyone else see any value in this and, if so, what > fileds > > > > > should we create standard entries for? Also, is there anything else > that > > > > > would be amenable to a batch process for bootstrapping? (Note: I do > NOT > > > > > think any organisim specific data is amenable to bootstrapping. That > is > > > what > > > > > a (object based) pipeline is for. Also, this batch process is only > good > > > if > > > > > you are using Oracle, but a similar process cab be written there > too.) > > > > > > > > > > This also gets me to some of the other scripts we use to bootstrap > GUS, > > > such > > > > > as the predefined set of ExternalDatabases we load. The XML which I > use > > > to > > > > > load this is pretty messy, and not well documented. Does anyone mind > if > > > I > > > > > clean it up? If the answer is yes, is there anything I should know > about > > > > > this file? it seems that the XML for this table load is a nice one > to > > > > > clean-up and make standard for GUS installations all over since it > will > > > push > > > > > gus to be standardized across installations. What else should we > > > > > standardize? > > > > > > > > > > Which now brings me to the last item I want to open up which is that > I am > > > > > close to completing a full GUS installation wrapper script which > > > essentially > > > > > makes a GUS installation a click-and-play operation. One of our > > > > > deliverables is supposed to be an easy to install GUS package. > > > Regardless of > > > > > the state of GUS with regards to an official release, this script is > > > going to > > > > > make my life a whole lot easier. I figure it might be nice to > package > > > the > > > > > whole kit-n-kaboodle up into one nice fat tarball with a simple set > of > > > > > instructions for download from someplace. Is anyone else interested > in > > > this? > > > > > > > > > > Finally, one quick question I have about the NRDB load is that > working on > > > it > > > > > showed me that the description filed in AASequenceIMP is too short > for > > > many > > > > > of the descriptions in NRDB. Do we want to up the description field > size > > > for > > > > > dots.aasequenceimp? > > > > > > > > > > Anyway, any feedback on this would be appreciated. > > > > > > > > > > -Ed R > > > > > > > > > > > > > > > Ed Robinson > > > > > 255 Deerfield Rd > > > > > Bogart, GA 30622 > > > > > (706)425-9181 > > > > > > > > > > --Learn more about the face of your neighbor, and less about your > own. > > > > > -Sargent Shriver > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------- > > > > > This SF.net email is sponsored by: IT Product Guide on > ITManagersJournal > > > > > Use IT products in your business? Tell us what you think of them. > Give us > > > > > Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find > out > > > more > > > > > http://productguide.itmanagersjournal.com/guidepromo.tmpl > > > > > _______________________________________________ > > > > > Gusdev-gusdev mailing list > > > > > Gus...@li... > > > > > https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------- > > > > This SF.net email is sponsored by: IT Product Guide on > ITManagersJournal > > > > Use IT products in your business? Tell us what you think of them. Give > us > > > > Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out > more > > > > http://productguide.itmanagersjournal.com/guidepromo.tmpl > > > > _______________________________________________ > > > > Gusdev-gusdev mailing list > > > > Gus...@li... > > > > https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev > > > > > > > > > > Ed Robinson > > > 255 Deerfield Rd > > > Bogart, GA 30622 > > > (706)425-9181 > > > > > > --Learn more about the face of your neighbor, and less about your own. > > > -Sargent Shriver > > > > > > > > > > > Ed Robinson > 255 Deerfield Rd > Bogart, GA 30622 > (706)425-9181 > > --Learn more about the face of your neighbor, and less about your own. > -Sargent Shriver > |