Menu

#10 toAmos failing to handle alphanumeric IIDs

open
nobody
converters (5)
5
2009-05-20
2009-05-20
No

When converting from Celera WGS output to AMOS format this conversion tool fails to honor the updated clear ranges defined by augmented fragment records (AFG). This leads to invalid read coordinates in the .afg file which are observed as high number of discrepancies in Hawkeye or with amosvalidate.

An investigation of this issue lead me to find that the problem lies with the method AMOS::AmosLib::getCAId.

The parsing regular expression

/\((\d+),(\d+)\)/

assumes the both parts of the "paired" id to be strictly numeric but no such constraint exists for the first part within the Celera assembler. Consequently when the first part is alphanumeric the method incorrectly considers the passed string to be a real ID and returns its entirety.

The return values of this method are used as hash table keys by toAmos.pl and in particular the hash table seq_range. This hash is first initialized by the FRG file with keys that are real IDs, then any AFG record in the ASM file causes an update of the relevant element. Since getCAId returns the entire "paired" ID in the instance described above, hash table lookups will fail when reading the AFG records.

I have attached a patch which expands the first capturing parenthentical in getCAId to accept any non-whitespace string.

This has resolved the issue in our environment.

Discussion

  • Matthew Z DeMaere

    Simple patch.

     
  • floflooo

    floflooo - 2010-07-02

    Hi Matt,
    Thanks for your bug report.
    I know that this is not a reference, but this webpage mentions numbers only as IDs for CeleraAssembler: http://www.cbcb.umd.edu/research/contig_representation.shtml#CA
    Do you have link that explains in details the Celera Assembler output??
    Also, could you provide a small input file and the commands used to reproduce the bug you encounter? Thanks

     

Log in to post a comment.