InsertionMapper Wiki

Brought to you by: wenweipublish

Home

InsertionMapper v1.1 -- 09/20/2013

Welcome to InsertionMapper project!

Thank you for using InsertionMapper. If you find it useful for your work, please
cite our paper:

** Wenwei Xiong, Limei He, Yubin Li, Hugo K. Dooner, Chunguang Du
(2013) InsertionMapper: a pipeline tool for the identification of targeted
sequences from multidimensional high throughput sequencing data **

The latest version of InsertionMapper is freely accessible at our website:
https://sourceforge.net/p/insertionmapper

The software is open source and distributed under the GNU public license, which
is included in InsertionMapper package.
See 'GNU_GENERAL_PUBLIC_LICENSE.txt' for more information.

======== Download ========

[FOR REVIEWERS]

Please click InsertionMapper to download the latest InsertionMapper, with test data included.

A bigger test dataset in our real project is available at this link.

** For normal users, a different download link will be provided shortly. **

======== Installation ========

InsertionMapper is developed on the Java Virtual Machine (JVM) platform. Users
can easily download the zip file and unzip it to a local directory. No
compilation is needed thanks to JVM's cross platform compatibility.

======== System requirements ========

InsertionMapper requires JRE 1.6+. A full list of Oracle Certified Configuration
for Java SE 6 can be found at
http://www.oracle.com/technetwork/java/javase/system-configurations-135212.html.

To determine your java version, just open a system terminal, and type 'java
-version' (without quotes). If you haven't installed Java yet, please go to the
official Java website http://java.com/en/download/.

InsertionMapper is written in Scala, and compiled into Java bytecodes. Users
don't need Scala to run the program, unless they want to compile from source
codes. Please refer to http://www.scala-lang.org/ for details about Scala.

======== Quick start ========

Then simply type
'java -jar InsetionMapper.jar'
to run the test data in 'raw' directory. With default configuration, filtered
reads will be generated into 'filtered' directory, and a H2 database named
'test.db' will be in the top folder.

======== Introduction ========

As seen in directory structure, the "InsertionMapper.jar" file encapsulates all
the essential logics of this pipeline tool, complemented by third-party
libraries in 'lib' folder for database manipulation and parameter configuration.

'raw' and 'filtered' folders are recommend places for raw reads and filtered
reads, and the default thresholds for Grade-1 and Grade-2 thresholds are 1.5 and
3 respectively. All parameters can be changed by modifying 'application.conf'
file (with illustrations).

===== Configuration and parameters =====

InsertionMapper is suitable for various scenarios for sequence identification in
multidimensional pools because of the flexibility of configuration. All
parameters and embedded SQL sentences are contained in 'application.conf', which
is written in HOCON (Human-Optimized Config Object Notation) format. It should
be intuitive for users to modify parameters for InsertionMapper without digging
into the detailed HOCON specification
(https://github.com/typesafehub/config/blob/master/HOCON.md).

A list of common parameters, with values (in parentheses) used in the Dsg project:

Dimensions
count of dimensions (3)
names of dimensions (Plate, Row, Column)
number of libraries for each dimension ([10, 8,12])

Primers
primer name (ds17)
primer sequence ( ds17 = """CCGACCGTTTTCATCCCTA((.{8}).+)""" ) - in java regular expression syntax, (.{8}) for 8-bp long TSD (target site duplication) as the index sequence for database manipulation.

File mapping
directory of raw read files (./raw)
a list of file names of raw reads

Thresholds - definitions in the paper
grade 1 ratio, i.e. TH_G1 (1.5)
grade 2 ratio, i.e. TH_G2 (3)

Please consult the 'application.conf' file for detailed parameters and configuration.

======== Result interpretation ========

For each primer configured in 'application.conf' file, one report will be
finally generated, including all identified sequences amplified by that
particular primer. The report file is a plain csv (comma-separated values) text
file which can be opened by any text editor or spreadsheet programs (e.g.
Microsoft Excel). Count of columns may vary depending on how many dimensions in
sample pools.

For typical 3-D pools, columns in report file are explained as
follows:

"DIMENSION": Well coordinates in multidimension pools, library number in each
dimension separated by '_'.
"INDEXSEQ": Identified indexing sequences. (E.g. 8-bp TSD for transposable
elements)
"GRADE": Grade of the identified sequence.
"RANK_P"/"RANK_R"/"RANK_C": Ranks in each dimension.
"DIMENSION2"/"RANK2_P"/"RANK2_R"/"RANK2_C": Similar to the above columns,
indicating the significance of sequence assignment, compared to the second
possible assignment (e.g. no 2nd assignment or very low ranks suggests good
quality of the first assignment).
"FULLSEQ": Full sequence identified.
"COUNTOFSEQSINWELL": Count of sequences assigned in current well.
"COUNTOFWELLSFORSEQ": Count of wells the current sequence being assigned to.

Sequences identified with no ambiguities are those whose values in last two
columns are both 1.