My amos bank appears to have errors in it. I have detected these problems using both analyzeSNPs and hawkeye. My assembly is hybrid and was made by the celera assembler.
I used: toAmos -f gsb/antrc230_0.1_hybrid.frg -a gsb/gs35/9-terminator/gs35.asm -o - -utg | bank-transact -m - -b gsb-utg.bnk -c
I have included screen shots of a unitig with no reads showing in hawkeye (I realise that contigs can have missing reads because of surrogates), the basic statistics of that unitig, a unitig with consensus that does not match the reads, the corresponding section of the contig with the same reads with correct consensus and a unitig with some reads that seem misplaced.
The assembly used has not yet been published and thus I have not included it but I may be able to provide it confidentially.
a hawkeye screen shot of a unitig with incorrect consensussequence
a hawkeye screen shot of the corresponding contig showing the same reads with correct consensus sequence
a hawkeye screen shot of a surrogate with some reads that seem to be misplaced.
a hawkeye screen shot of a unitig with no reads shown.
a hawkeye screen shot of the corresponding data for the unitig with no reads shown.
From my supervisor:
"It appears to be a rather significant bug with how toAmos treats the data from an ASM file. It makes extensive use of associative arrays indexed by object identifiers. In the case of a normal output, there would only be a single reference to any read, whereas in the case of Unitig output, we get surrogates that break the assumption that associative identifiers are unique.
From the UTG you gave me, I can see that toAmos has used the same record in both placement instances. It might require extending the associative ids to avoid the redundancy."
From my supervisor:
"I believe I've fixed the problem. Let me know if you still have problems.
All I did was make sure that the hashes that looked to be the source of errors had a unique keys. I simply extended the key to be both the read ID and parent contig ID. You won't see any difference in the AFG output with respect to IID or EIDs."