Re: [XMLPipeDB-developer] Test .gdb with new SGD table in SGDTest

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Greetings Kam,

In refining that modified SGD table query, I found some "interesting" sets of records.  For your consideration...

  hjid   | id | symbol |   orf   
---------+----+--------+---------
 1799877 |    | LPT1   | 
 1801238 |    | ALD6   | 
 1802608 |    |        | YDR539W
 1803959 |    |        | 
 1805321 |    | IME1   | YJR094C
 1806664 |    | RME1   | YGR044C
 1808027 |    | RSF1   | YMR030W
 1809403 |    |        | 

...these records are entries that have no SGD IDs.  Some supplementary information about them:

LPT1 - A9EDP4_YEAST - Lysophospholipid acyltransferase
ALD6 - A9LRZ7_YEAST - Cytosolic aldehyde dehydrogenase
YDR539W - B2NII0_YEAST - Putative uncharacterized protein YDR539W
IME1 - B8XW28_YEAST - Ime1p
RME1 - B8XW41_YEAST - Rme1p
RSF1 - B8XW45_YEAST - Rsf1p

I don't know what to say about the 2 records with *none* of the IDs though...hjid doesn't help because that is generated for the relational database, and is not part of the XML file.

I'm guessing that these simply don't make it to SGD because there is no SGD ID?

Here's another "interesting" set...

  hjid   |     id     |  symbol   |           orf           
---------+------------+-----------+-------------------------
 1509564 | S000000068 | TY1A-PR1  | YAR010C
 1509564 | S000000068 | TY1A-PR1  | YPR137C-A
 1509564 | S000000068 | TY1A-A    | YAR010C
 1509564 | S000000068 | TY1A-A    | YPR137C-A
 1128516 | S000000168 | RPS8A     | YBL072C
 1128516 | S000000168 | RPS8B     | YBL072C
 1128516 | S000000168 | RPS8A     | YER102W
 1128516 | S000000168 | RPS8B     | YER102W
 1043286 | S000000183 | RPL23B    | YER117W
 1043286 | S000000183 | RPL23B    | YBL087C
 1043286 | S000000183 | RPL23A    | YBL087C
 1043286 | S000000183 | RPL23A    | YER117W

These guys (among others) either have 2 symbol tags, 2 ORF names, or both.  Presumably an issue since the SGD IDs must be unique in the SGD table?

And finally, here are some "old friends" (again a subset):

 1529063 | S000000200 |           | YBL104C/YBL103C-A
  712746 | S000000302 | MMS4   | YBR098W/YBR100W
  888697 | S000000510 | PGS1   | YCL004W/YCL003W
  145116 | S000000520 | BUD3   | YCL014W/YCL013W/YCL012W
 1738671 | S000005435 |        | YOL075C/YOL074C
 1745387 | S000005522 |        | YOL162W/YOL163W
 1745387 | S000005523 |        | YOL162W/YOL163W
 1792808 | S000005613 | YVC1   | YOR087W/YOR088W
    5540 | S000005765 | ABP140 | YOR239W/YOR240W

...there are 41 of these in all.  Presumably we'll want to "split" somehow, but I figure that their being in the SGD table may cause an issue, since that would duplicate/triplicate the SGD ID record?

Let me know how you'd like to deal with these...meanwhile, I'll implement what you said about the ORF filling in for the symbol/primary when that isn't around.  I'll let you know if there's a new test build lying around.

John David N. Dionisio, PhD
Assistant Professor, Computer Science
Loyola Marymount University

On Dec 4, 2009, at 11:20 AM, Kam Dahlquist wrote:

> Hi,
> 
> See below:
> 
> Kam
> 
> At 09:41 PM 12/3/2009, you wrote:
>> Hi Kam,
>> 
>> I've uploaded a test export to the wiki:
>> 
>> https://www.cs.lmu.edu/biodb/fall2009/index.php/File:Sc-Std_20091203-test.gdb
>> 
>> This .gdb has a table called SGDTest, which is a candidate for what the SGD table should really be.  Please take a look and see if this appears correct (or at least the right track  :)  ).  There are fewer records overall in this combo SGDTest table, as it represents only the records with all 3 IDs.
> 
> This is the right track, but...
> 
> If a gene does not have a gene symbol (like ACT1), it's ORF ID is used instead.  The difference between the SGD table and the SGDTest is 1219 records, I am guessing that most, if not all of them got left out because they did not have a gene symbol.  In that case, their ORF ID should be copied over into the gene symbol field.
> 
> Also, don't forget that some of the ORF IDs are not in the "Y" form, but are in the form as follows:  Q####.  These are mitochondrial genes.
> 
> Somehow in the 2006 yeast gdb, empty data is being tolerated in the ORF or Symbol fields; I'm not sure how that is.  I'm hoping that for every S######### ID there is at least an ORF ID so that if you copy over that to the gene symbol for the ones that are missing, we won't lose any records.
> 
> The 2006 gdb is actually quite poor in terms of data integrity, not that I look at it.
> 
> 
>> Also, I noticed that the Ensembl table in the 2006 version also has more columns...should this also be replicated in the GenMAPP Builder export?  Are there other tables that I might not be remembering?
> 
> No.  Keep in mind, GenMAPP.org is using Ensembl as a primary database so they have the ability to capture more data from Ensembl directly.  I don't think we should try to replicate this table because the info there is pretty much in the UniProt or SGD tables.  The only issue for users is that if they made a MAPP or Expression Dataset using Ensembl in a previous version of the gdb, it won't be compatible with ours.  However, I would wager that 99% of the yeast community would choose SGD as their choice of system, so I think we're OK.
> 
> 
>> Meanwhile, before I left campus, Kenny and Don were off investigating matched XML IDs that were not found in the database. There were 33 in all, and by the time I left a few categories had already emerged --- IDs in comment text only, another ID in a paper title but nowhere else --- so this may turn out to be like A. thaliana.  We'll see what their final report looks like.
> 
> Yeah, except that if it's only found in comment text or a paper title, we probably don't want them.  If there's a list of the 33 somewhere, we can go through them one by one to make these determinations.
> 
> Progress is definitely being made!
> 
> 
>> John David N. Dionisio, PhD
>> Assistant Professor, Computer Science
>> Loyola Marymount University
> 
>