An Interactive Visual Analytics Tool for Genome Assemblies.
Version 2.0 - August 2011
Schatz, M.C., Phillippy, A.M., Shneiderman, B., Salzberg, S.L. (2007) Genome Biology 8:R34.
Genome assembly remains an inexact science. Even when accomplished with the best software available, the assembly of a genome often contains numerous errors, both small and large. Hawkeye is a visual analytics tool for genome assembly analysis and validation, designed to aid in identifying and correcting assembly errors. Hawkeye blends the best practices from information and scientific visualization to facilitate inspection of large-scale assembly data while minimizing the time needed to detect mis-assemblies and make accurate judgments of assembly quality.
All levels of the assembly data hierarchy are made accessible to users, along with summary statistics and common assembly metrics. A ranking component guides investigation towards likely mis-assemblies or interesting features to support the task at hand. Wherever possible, high-level overviews, dynamic filtering, and automated clustering are leveraged to focus attention and highlight anomalies in the data. Hawkeyes effectiveness has been proven on several genome projects, where it has been used both to improve quality and to validate the correctness of complex genomes.
Hawkeye is compatible with most widely used assemblers, including Phrap, ARACHNE, Celera Assembler, Newbler, AMOS, and assemblies deposited in the NCBI Assembly Archive.
- Click for a presentation on AMOS Assembly Validation and Visualization. [1.4MB]
- Click for a recorded demonstration of using Hawkeye to analyze a mis-assembly. [2.2MB]
Build & Installation
Hawkeye comes in source form with the AMOS distribution. You should build the entire AMOS distribution even if you only want to run Hawkeye so all of the necessary convertors and libraries are available. You can download the AMOS source package from: http://sourceforge.net/project/showfiles.php?group_id=134326.
Hawkeye requires Qt 4.x is installed to run. The latest version of Qt is currently 4.7.3 and can be downloaded from the Qt website for Unix and Mac OS X: http://qt.nokia.com/downloads. Many linux distributions come with the Qt runtime libraries by default, but do not have the development package installed. You must install both the runtime libraries and the development package (header files) to build Hawkeye.
The general build process is to run './configure; make; make install' in the AMOS source directory. You may need to explicitly specify the Qt directories to configure when building AMOS with the following options:
$ configure --help <snip> --with-Qt4-dir=DIR DIR is equal to QTDIR if you have followed the installation instructions of Trolltech. Header files are in DIR/include, binary utilities are in DIR/bin and the library is in DIR/lib. Use the options below to override these defaults --with-Qt4-include-dir=DIR --with-Qt4-bin-dir=DIR --with-Qt4-lib-dir=DIR --with-Qt4-lib=LIB Use -lLIB to link the Qt4 library
More information is available in the INSTALL file within the AMOS tarball.
If you cannot get configure to find or recognize Qt on your system, the alternative build option is to build AMOS without hawkeye, and then use the Qt build script qmake to build hawkeye seperately
cd amos ./configure make cd src/hawkeye qmake make
Hawkeye reads the assembly data from an AMOS bank. A bank is a special directory of binary encoded files containing all information on an assembly. A bank is created by the AMOS assemblers directly, or by converting the results of others assemblers into AMOS format. This is typically done with the tools toAmos and bank-transact. toAmos reads the assembly files and converts them to plaintext AMOS message formats, and bank-transact reads those messages and creates the binary encoded bank directory. See the File conversion utilities for more information.
As a convenience, a few sample banks are available online at http://sourceforge.net/projects/amos/files/test_data/
After downloading and unpacking, these can be viewed using
$ hawkeye staph_aureus.ctg2533.bnk
$ hawkeye staph_aureus.bnk
Create the bank human.bnk from the files human.frg and human.asm, which are the input and output files for the Celera Assembler. More information on converting to AMOS is available in the toAmos documentation.
$ toAmos -f human.frg -a human.asm -o - | bank-transact -m - -b human.bnk -c
Create the bank human.bnk from an ace file, which is an output format for many assemblers including Phrap, Arachne, and Newbler. Check your assembler's documentation for more information on creating ACE files. Note the ACE file contains all of the sequence information, so it is not necessary to import the fasta files separately. More information on converting to AMOS is available in the toAmos documentation.
$ toAmos -ace human.ace -o - | bank-transact -m - -b human.bnk -c
Create the bank human.bnk from an assembly archive XML file called ASSEMBLY.xml. Note all of the read fasta files should be concatentated into a single TRACEINFO.seq file, and the read qualities files should be concatenated into a single TRACEINFO.qual file, and the TRACEINFO.xml file should be present as well. More information is available in the tarchive2amos documentation.
$ tarchive2amos -o human -assembly ASSEMBLY.xml TRACEINFO.seq; $ bank-transact -m human.afg -b human.bnk -c
Once the bank has been built, launch the viewer by running hawkeye on the bank directory. This will open your assembly to the Hawkeye Launch Pad where you can see an overview of your assembly and select scaffolds or contigs for closer investigation:
$ hawkeye human.bnk
Command Line Options
The options available are listed by specifying -h.
$ hawkeye -h Usage: hawkeye [options] [bankname [contigid [position]]] Options: -c <path> Add a chromatogram path -D <DB> Set the chromatogram DB -T Enable Trace Fetch cmd -p <port> Initialize Server on this port -K <kmer> Load File of kmers -h Display this help
A typical execution will be "hawkeye prefix.bnk" which will load the assembly from the bank named prefix.bnk.
Hawkeye can display k-mer coverage in addition to read or insert coverage in the coverage plot region. To do so, you must pre-compute the k-mer counts in your assembly. AMOS comes bundled with a tool 'count-kmers' that can be used for this purpose. A typical execution is to count the occurencs of k-mers (k=22) in your reads, and plot those values. A sufficient long k-mer should unique be in your genome, so the average k-mer coverage indicates the depth of read coverage, and spikes in k-mer coverage indicate repetitive regions. This is displayed as follows:
$ count-kmers -r human.bnk > human.22mers $ hawkeye -K human.22mers human.bnk
This work was supported in part by NIH award R01-LM06845, the National Institute of Allergy and Infectious Disease under contract NIH-NIAID-DMID- 04-34, HHSN266200400038C, and DHS/HSARPA award W81XWH-05-2-0051 to SLS.