When using a fasta header with spaces, the bsml2fasta component attempts to match (from the BSML) either the id attribute of the Sequence tag ( e.g. "g1_a"), or the identifier attribute of the Seq-data-import tag (e.g. "g1|a" ) to the description line of the fasta file. Neither of these match, because both have all information after the first space truncated, while the fasta file retains that information (e.g. "g1|a stuff"). The entire definition line is available in an Attribute element under the Sequence element.
This was found running bsml2fasta.prediction_CDS in the prokaryotic annotation pipeline.
Oops, I jumped the gun on this. I got the same error on a fasta file with no spaces in the description (BSML::Indexer::Fasta truncates the description at the first space). I tracked it down to a bug in the parse_multi_fasta subroutine in bsml2fasta.pl. The script builds a $sequencelookup hash based on the scrub fasta_id, and then parse_multi_fasta tries to access it with the unscrubbed version. The fix was relatively simple:
- if(exists $e{$h{$sequencelookup->{$specified_header}->{'fasta_id'}}}){
+ if(exists $e{$h{$specified_header}}){