Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

#18 Make chado gff bulk loader respect alt ontology ids

open
Scott Cain
Chado (3)
5
2007-04-18
2007-04-18
Scott Cain
No

From Yuvan:

Hi Gmod,

When I load my genome data into CHADO, I get the following errors.
(but, the load itself was successful).

couldn't find GO:0005554 's cvterm_id in cvterm table
couldn't find GO:0008372 's cvterm_id in cvterm table
couldn't find GO:0000004 's cvterm_id in cvterm table

I have loaded the all the six ontologies, which comes as default options with CHADO (including GO).
These errors occur when my genome annotation uses the GO ids which are actually "alt_ids".
While, I dont know the reason to use alt_ids,
I am just wondering, if CHADO can be tricked to store, alt_ids as well.

lot of thanks in advance,
sincerely,
Yuvan

Discussion

  • I'm also finding this problem:

    (Re)creating the uniquename cache in the database...
    Creating table...
    Populating table...
    Creating indexes...
    Adjusting the primary key sequences (if necessary)...Done.
    Preparing data for inserting into the chado database
    (This may take a while ...)
    couldn't find GO:0009281 's cvterm_id in cvterm table
    couldn't find GO:0009054 's cvterm_id in cvterm table
    couldn't find GO:0009053 's cvterm_id in cvterm table
    couldn't find GO:0009281 's cvterm_id in cvterm table
    couldn't find GO:0009281 's cvterm_id in cvterm table
    couldn't find GO:0009281 's cvterm_id in cvterm table
    couldn't find GO:0009054 's cvterm_id in cvterm table
    couldn't find GO:0009054 's cvterm_id in cvterm table
    couldn't find GO:0009054 's cvterm_id in cvterm table
    couldn't find GO:0009054 's cvterm_id in cvterm table
    couldn't find GO:0009053 's cvterm_id in cvterm table
    couldn't find GO:0009053 's cvterm_id in cvterm table
    couldn't find GO:0009053 's cvterm_id in cvterm table
    couldn't find GO:0009053 's cvterm_id in cvterm table
    couldn't find GO:0009053 's cvterm_id in cvterm table
    couldn't find GO:0009054 's cvterm_id in cvterm table
    couldn't find GO:0009054 's cvterm_id in cvterm table
    couldn't find GO:0009281 's cvterm_id in cvterm table
    couldn't find GO:0009053 's cvterm_id in cvterm table
    [...]

    for many lines...

    Since I'm using GFF3 files generated by bp_genbank2gff3.pl, it seems like this could be solved at two points in the process - on GFF3 creation and in the upload script - by checking with AmiGO/something else without having to modify the upload code (assuming a live net connection).

     
  • Scott Cain
    Scott Cain
    2010-03-19

    First, I would say that finding out why the alt ids are being used is a good idea; perhaps there is a reason and that should be encoded into your data.

    I believe the alt IDs are stored in chado when the cv is loaded (probably in cvterm_dbxref) and making the bulk loader trying looking them up when it gets a failure on the main id could be done at the expense of added time to the load.

    Having the genbank script do the conversion is a bad idea though--doing an automated change at that stage is a little too much magic and might lead to misleading results.

    Of course, the short term fix is to change the GFF file manually, and in my experience, there are relatively few ids so doing a global search and replace would probably not take long.

     
  • Scott Cain
    Scott Cain
    2010-04-09

    When I wrote that there are usually only a few IDs in a given GFF file causing the problems, I meant relatively few unique IDs. Once you've looked at a given ID that is causing a problem, and satisfied yourself that a change to another ID is acceptable, a global search and replace for that ID is a simple task.

    I will look into automatic acceptance of alt IDs; do this automatically makes me uncomfortable though.