From: Eric J. <e-...@no...> - 2007-10-08 13:39:49
|
Hi Don, I have requested, however, that the bioperl script create the necessary IDs here: http://bugzilla.open-bio.org/show_bug.cgi?id=2376 This is a more robust solution, and if we can get the bioperl people on board, it will make for a better script on their end and less munging on our end. Scott had suggested that, and I think it's a good idea. Of course, we still want to be able to perform our use case of loading blast reports, so I'll create an ID hack here and contribute it to GMOD should the bioperl issue not get addressed. I think they are very good about tracking issues though so we'll have an answer soon enough. > > bp_search2gff.pl is not smart enough to create unique ids. In some cases > you wont see this, but consider the case of a chromosome dna x 2 proteins/ESTs: > geneA matches chr1:100-200,300-400 (first hit with 2 hsps, in blast-speak > or first match with 2 match_parts in SO-speak). > geneB matches chr1:500-600 > geneA matches chr1:700-800,900-1000 (like above, second match or hit). > (this is a fairly common case: ~50% of genes are in gene families). > bp_search2gff.pl if run with the -match or -addid options will add ID=geneA for both > matches. The only way to make those into two unique IDs given the blast input is either > (a) add the chromosome locations into the ID string (and possibly also protein target locations), > or (b) generate a unique number for each match ID. As far as having one gene match multiple places on a chromosome. BLAST is not smart enough to create different hit features for the two matches of geneA to the chromosome. Both sets of geneA HSPs would be reported in one hit to the chromosome in a BLAST report (four HSPs). I agree it's not an ideal situation, but for representing raw BLAST data in the bp_search2gff script, using the location in the ID would not be necessary . I agree though, after we sort out loading raw BLAST reports , it would be nice to have have some algorithm to split up those HSPs into different parent features, in that case a location would be necessary to make the parent feature ID unique. > > This latter is better done in gmod_bulk_load_gff3, which checks your chado DB for uniqueness. > The former (e.g. ID=geneA_chr1_700_1000) can get rather long and messy. > > I think it would be good to have an option in gmod_bulk_load_gff3.pl, maybe the default for > analysis inputs that could be overridden if desired, where any ID in the input gff > is replaced by an autogenerated ID. However one would need to make it smart enough to > also replace input Parent=XXX for any match parts, with the same generated ID. But > assume that bp_search2gff output does NOT have unique IDs. > > -- Don > > |