Re: [XMLPipeDB-developer] Test .gdb with new SGD table in SGDTest
Brought to you by:
kdahlquist,
zugzugglug
From: John D. N. D. <do...@lm...> - 2009-12-05 07:47:02
|
Greetings Kam, In refining that modified SGD table query, I found some "interesting" sets of records. For your consideration... hjid | id | symbol | orf ---------+----+--------+--------- 1799877 | | LPT1 | 1801238 | | ALD6 | 1802608 | | | YDR539W 1803959 | | | 1805321 | | IME1 | YJR094C 1806664 | | RME1 | YGR044C 1808027 | | RSF1 | YMR030W 1809403 | | | ...these records are entries that have no SGD IDs. Some supplementary information about them: LPT1 - A9EDP4_YEAST - Lysophospholipid acyltransferase ALD6 - A9LRZ7_YEAST - Cytosolic aldehyde dehydrogenase YDR539W - B2NII0_YEAST - Putative uncharacterized protein YDR539W IME1 - B8XW28_YEAST - Ime1p RME1 - B8XW41_YEAST - Rme1p RSF1 - B8XW45_YEAST - Rsf1p I don't know what to say about the 2 records with *none* of the IDs though...hjid doesn't help because that is generated for the relational database, and is not part of the XML file. I'm guessing that these simply don't make it to SGD because there is no SGD ID? Here's another "interesting" set... hjid | id | symbol | orf ---------+------------+-----------+------------------------- 1509564 | S000000068 | TY1A-PR1 | YAR010C 1509564 | S000000068 | TY1A-PR1 | YPR137C-A 1509564 | S000000068 | TY1A-A | YAR010C 1509564 | S000000068 | TY1A-A | YPR137C-A 1128516 | S000000168 | RPS8A | YBL072C 1128516 | S000000168 | RPS8B | YBL072C 1128516 | S000000168 | RPS8A | YER102W 1128516 | S000000168 | RPS8B | YER102W 1043286 | S000000183 | RPL23B | YER117W 1043286 | S000000183 | RPL23B | YBL087C 1043286 | S000000183 | RPL23A | YBL087C 1043286 | S000000183 | RPL23A | YER117W These guys (among others) either have 2 symbol tags, 2 ORF names, or both. Presumably an issue since the SGD IDs must be unique in the SGD table? And finally, here are some "old friends" (again a subset): 1529063 | S000000200 | | YBL104C/YBL103C-A 712746 | S000000302 | MMS4 | YBR098W/YBR100W 888697 | S000000510 | PGS1 | YCL004W/YCL003W 145116 | S000000520 | BUD3 | YCL014W/YCL013W/YCL012W 1738671 | S000005435 | | YOL075C/YOL074C 1745387 | S000005522 | | YOL162W/YOL163W 1745387 | S000005523 | | YOL162W/YOL163W 1792808 | S000005613 | YVC1 | YOR087W/YOR088W 5540 | S000005765 | ABP140 | YOR239W/YOR240W ...there are 41 of these in all. Presumably we'll want to "split" somehow, but I figure that their being in the SGD table may cause an issue, since that would duplicate/triplicate the SGD ID record? Let me know how you'd like to deal with these...meanwhile, I'll implement what you said about the ORF filling in for the symbol/primary when that isn't around. I'll let you know if there's a new test build lying around. John David N. Dionisio, PhD Assistant Professor, Computer Science Loyola Marymount University On Dec 4, 2009, at 11:20 AM, Kam Dahlquist wrote: > Hi, > > See below: > > Kam > > At 09:41 PM 12/3/2009, you wrote: >> Hi Kam, >> >> I've uploaded a test export to the wiki: >> >> https://www.cs.lmu.edu/biodb/fall2009/index.php/File:Sc-Std_20091203-test.gdb >> >> This .gdb has a table called SGDTest, which is a candidate for what the SGD table should really be. Please take a look and see if this appears correct (or at least the right track :) ). There are fewer records overall in this combo SGDTest table, as it represents only the records with all 3 IDs. > > This is the right track, but... > > If a gene does not have a gene symbol (like ACT1), it's ORF ID is used instead. The difference between the SGD table and the SGDTest is 1219 records, I am guessing that most, if not all of them got left out because they did not have a gene symbol. In that case, their ORF ID should be copied over into the gene symbol field. > > Also, don't forget that some of the ORF IDs are not in the "Y" form, but are in the form as follows: Q####. These are mitochondrial genes. > > Somehow in the 2006 yeast gdb, empty data is being tolerated in the ORF or Symbol fields; I'm not sure how that is. I'm hoping that for every S######### ID there is at least an ORF ID so that if you copy over that to the gene symbol for the ones that are missing, we won't lose any records. > > The 2006 gdb is actually quite poor in terms of data integrity, not that I look at it. > > >> Also, I noticed that the Ensembl table in the 2006 version also has more columns...should this also be replicated in the GenMAPP Builder export? Are there other tables that I might not be remembering? > > No. Keep in mind, GenMAPP.org is using Ensembl as a primary database so they have the ability to capture more data from Ensembl directly. I don't think we should try to replicate this table because the info there is pretty much in the UniProt or SGD tables. The only issue for users is that if they made a MAPP or Expression Dataset using Ensembl in a previous version of the gdb, it won't be compatible with ours. However, I would wager that 99% of the yeast community would choose SGD as their choice of system, so I think we're OK. > > >> Meanwhile, before I left campus, Kenny and Don were off investigating matched XML IDs that were not found in the database. There were 33 in all, and by the time I left a few categories had already emerged --- IDs in comment text only, another ID in a paper title but nowhere else --- so this may turn out to be like A. thaliana. We'll see what their final report looks like. > > Yeah, except that if it's only found in comment text or a paper title, we probably don't want them. If there's a list of the 33 somewhere, we can go through them one by one to make these determinations. > > Progress is definitely being made! > > >> John David N. Dionisio, PhD >> Assistant Professor, Computer Science >> Loyola Marymount University > > |