Re: [GMOD-devel] Storing Blast hits and the blast2gff script

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Don,

I have requested, however, that the bioperl script create the
necessary IDs here:

http://bugzilla.open-bio.org/show_bug.cgi?id=2376

This is a more robust solution, and if we can get the bioperl people
on board, it will make for a better script on their end and less
munging on our end.  Scott had suggested that, and I think it's a good
idea.  Of course, we still want to be able to perform our use case of
loading blast reports, so I'll create an ID hack here and contribute
it to GMOD should the bioperl issue not get addressed.  I think they
are very good about  tracking issues though so we'll have an answer
soon enough.

>
> bp_search2gff.pl is not smart enough to create unique ids.  In some cases
> you wont see this, but consider the case of a chromosome dna x 2 proteins/ESTs:
>    geneA matches chr1:100-200,300-400 (first hit with 2 hsps, in blast-speak
>                                   or first match with 2 match_parts in SO-speak).
>    geneB matches chr1:500-600
>    geneA matches chr1:700-800,900-1000  (like above, second match or hit).
>    (this is a fairly common case: ~50% of genes are in gene families).

> bp_search2gff.pl if run with the -match or -addid options will add ID=geneA for both
> matches.  The only way to make those into two unique IDs given the blast input is either
> (a) add the chromosome locations into the ID string (and possibly also protein target locations),
> or (b) generate a unique number for each match ID.

As far as having one gene match multiple places on a chromosome.
BLAST is not smart enough to create different hit features for the two
matches of geneA to the chromosome.  Both sets of geneA HSPs would be
reported in one hit to the chromosome in a BLAST report (four HSPs).
I agree it's not an ideal situation, but for representing raw BLAST
data in the bp_search2gff script, using the location in the ID would
not be necessary .

I agree though, after we sort out loading raw BLAST reports , it would
be nice to have have some algorithm to split up those HSPs into
different parent features, in that case a location would be necessary
to make the parent feature ID unique.

>
> This latter is better done in gmod_bulk_load_gff3, which checks your chado DB for uniqueness.
> The former (e.g. ID=geneA_chr1_700_1000) can get rather long and messy.
>
> I think it would be good to have an option in gmod_bulk_load_gff3.pl, maybe the default for
> analysis inputs that could be overridden if desired, where any ID in the input gff
> is replaced by an autogenerated ID.  However one would need to make it smart enough to
> also replace input Parent=XXX for any match parts, with the same generated ID.  But
> assume that bp_search2gff output does NOT have unique IDs.
>
> -- Don
>
>