Re: [GMOD-devel] Storing Blast hits and the blast2gff script

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Eric (& Scott), 

there is I think still a problem with IDs for common uses converting
blast to chado analysis features.  The case I described here
  http://www.gmod.org/wiki/index.php/Load_BLAST_Into_Chado
avoids it by not having IDs created in the blast > gff step, 
(bp_search2gff.pl), so that the next step (gmod_bulk_load_gff3.pl) will
auto-generate unique IDs for the chado features.

bp_search2gff.pl is not smart enough to create unique ids.  In some cases
you wont see this, but consider the case of a chromosome dna x 2 proteins/ESTs: 
   geneA matches chr1:100-200,300-400 (first hit with 2 hsps, in blast-speak
                                  or first match with 2 match_parts in SO-speak).
   geneB matches chr1:500-600
   geneA matches chr1:700-800,900-1000  (like above, second match or hit).
   (this is a fairly common case: ~50% of genes are in gene families).

bp_search2gff.pl if run with the -match or -addid options will add ID=geneA for both
matches.  The only way to make those into two unique IDs given the blast input is either
(a) add the chromosome locations into the ID string (and possibly also protein target locations), 
or (b) generate a unique number for each match ID.

This latter is better done in gmod_bulk_load_gff3, which checks your chado DB for uniqueness.
The former (e.g. ID=geneA_chr1_700_1000) can get rather long and messy.

I think it would be good to have an option in gmod_bulk_load_gff3.pl, maybe the default for
analysis inputs that could be overridden if desired, where any ID in the input gff 
is replaced by an autogenerated ID.  However one would need to make it smart enough to 
also replace input Parent=XXX for any match parts, with the same generated ID.  But
assume that bp_search2gff output does NOT have unique IDs.

-- Don