From: Don G. <gil...@cr...> - 2007-10-06 02:04:11
|
Eric (& Scott), there is I think still a problem with IDs for common uses converting blast to chado analysis features. The case I described here http://www.gmod.org/wiki/index.php/Load_BLAST_Into_Chado avoids it by not having IDs created in the blast > gff step, (bp_search2gff.pl), so that the next step (gmod_bulk_load_gff3.pl) will auto-generate unique IDs for the chado features. bp_search2gff.pl is not smart enough to create unique ids. In some cases you wont see this, but consider the case of a chromosome dna x 2 proteins/ESTs: geneA matches chr1:100-200,300-400 (first hit with 2 hsps, in blast-speak or first match with 2 match_parts in SO-speak). geneB matches chr1:500-600 geneA matches chr1:700-800,900-1000 (like above, second match or hit). (this is a fairly common case: ~50% of genes are in gene families). bp_search2gff.pl if run with the -match or -addid options will add ID=geneA for both matches. The only way to make those into two unique IDs given the blast input is either (a) add the chromosome locations into the ID string (and possibly also protein target locations), or (b) generate a unique number for each match ID. This latter is better done in gmod_bulk_load_gff3, which checks your chado DB for uniqueness. The former (e.g. ID=geneA_chr1_700_1000) can get rather long and messy. I think it would be good to have an option in gmod_bulk_load_gff3.pl, maybe the default for analysis inputs that could be overridden if desired, where any ID in the input gff is replaced by an autogenerated ID. However one would need to make it smart enough to also replace input Parent=XXX for any match parts, with the same generated ID. But assume that bp_search2gff output does NOT have unique IDs. -- Don |