Re: [XMLPipeDB-developer] Test .gdb with new SGD table in SGDTest
Brought to you by:
kdahlquist,
zugzugglug
From: Kam D. <kda...@lm...> - 2009-12-04 21:42:23
|
Hi, Yes, I saw that list too. I told Kenny that I wanted to review them myself before we decided what to do with them. I may or may not get a chance to do this before I leave for the conference. If I don't get caught up, we'll just leave them on the "to do" list for later. Unfortunately, his list doesn't have the crucial piece of information which is what UniProt record is it a part of and is it the gene that belongs to that record. Cheers, Kam At 12:02 PM 12/4/2009, John David N. Dionisio wrote: >OK, I'll adjust as specified and let you know when there is a new test gdb. > >I just checked and Kenny has listed "the 33" in his online notebook, under >the December 3 entry. It looks like they also finished surveying where in >the XML these IDs were found, so that may be ready for some action >decisions :) > >John David N. Dionisio, PhD >Assistant Professor, Computer Science >Loyola Marymount University > > >On Dec 4, 2009, at 11:20 AM, Kam Dahlquist wrote: > > > Hi, > > > > See below: > > > > Kam > > > > At 09:41 PM 12/3/2009, you wrote: > >> Hi Kam, > >> > >> I've uploaded a test export to the wiki: > >> > >> > https://www.cs.lmu.edu/biodb/fall2009/index.php/File:Sc-Std_20091203-test.gdb > >> > >> This .gdb has a table called SGDTest, which is a candidate for what > the SGD table should really be. Please take a look and see if this > appears correct (or at least the right track :) ). There are fewer > records overall in this combo SGDTest table, as it represents only the > records with all 3 IDs. > > > > This is the right track, but... > > > > If a gene does not have a gene symbol (like ACT1), it's ORF ID is used > instead. The difference between the SGD table and the SGDTest is 1219 > records, I am guessing that most, if not all of them got left out because > they did not have a gene symbol. In that case, their ORF ID should be > copied over into the gene symbol field. > > > > Also, don't forget that some of the ORF IDs are not in the "Y" form, > but are in the form as follows: Q####. These are mitochondrial genes. > > > > Somehow in the 2006 yeast gdb, empty data is being tolerated in the ORF > or Symbol fields; I'm not sure how that is. I'm hoping that for every > S######### ID there is at least an ORF ID so that if you copy over that > to the gene symbol for the ones that are missing, we won't lose any records. > > > > The 2006 gdb is actually quite poor in terms of data integrity, not > that I look at it. > > > > > >> Also, I noticed that the Ensembl table in the 2006 version also has > more columns...should this also be replicated in the GenMAPP Builder > export? Are there other tables that I might not be remembering? > > > > No. Keep in mind, GenMAPP.org is using Ensembl as a primary database > so they have the ability to capture more data from Ensembl directly. I > don't think we should try to replicate this table because the info there > is pretty much in the UniProt or SGD tables. The only issue for users is > that if they made a MAPP or Expression Dataset using Ensembl in a > previous version of the gdb, it won't be compatible with ours. However, > I would wager that 99% of the yeast community would choose SGD as their > choice of system, so I think we're OK. > > > > > >> Meanwhile, before I left campus, Kenny and Don were off investigating > matched XML IDs that were not found in the database. There were 33 in > all, and by the time I left a few categories had already emerged --- IDs > in comment text only, another ID in a paper title but nowhere else --- so > this may turn out to be like A. thaliana. We'll see what their final > report looks like. > > > > Yeah, except that if it's only found in comment text or a paper title, > we probably don't want them. If there's a list of the 33 somewhere, we > can go through them one by one to make these determinations. > > > > Progress is definitely being made! > > > > > >> John David N. Dionisio, PhD > >> Assistant Professor, Computer Science > >> Loyola Marymount University > > > > |