1. Summary
  2. Files
  3. Support
  4. Report Spam
  5. Create account
  6. Log in

MaqPetToBedFormat

From vancouvershortr

Jump to: navigation, search

Contents

MaqPetToBedFormat Manual

Project Lead:

Anthony P. Fejes, afejes@bcgsc.ca, Graduate Student,University of British Columbia (Funded by The Michael Smith Foundation for Health Research)

Funding

This project was developed from research performed through the BC Cancer Agency, Genome Sciences Centre, Vancouver BC, Canada. Anthony is also supported by the Michael Smith Foundation for Health Research.

Support

For support, please send mail to vancouvershortr-findpeaks@lists.sourceforge.net

Notes:

  • Please note that this software and it's manual are released under the GPL. Users who modify this code are bound by the terms and conditions of the GPL. Use of this software and associated materials does not guarantee any results, and no warranty is implied by its distribution.
  • Please feel free to contact the author by email atvancouvershortr-findpeaks@lists.sourceforge.net with suggestions, comments and code modifications, all of which are gratefully accepted and will be credited.
  • Users are recommended to contact the developers of the project or to file a bug in case of discrepancies between the code and the manual.

Introduction:

This program is a conversion tool, which accepts binary output files from paired end alignments generated using the Maq Aligner. It performs two major functions: pairing the reads, and producing bed files, which can be viewed at the UCSC genome browser. The pairing of reads requires the majority of the time taken.


Requirements:

This application requires about 400Mb of space for a typical run. (Memory use will depend on number of unpaired reads in the .map file at any given time.)
On a multi-cpu computer, a small performance gain may be observed when not using the unix "nice" command. Runtime should be about 15 minutes for 7 Million reads (~3 million pairs) on a 2.4GHz Xeon CPU.


Work Flow

This application accepts PET Maq Binary .map files and produces three files, which be further processed for use with applications like FindPeaks, of for straight visualization with the UCSC genome browser.

  • <run_name>_paired(map)-local.bed.gz: this bed file contains all of the read pairs that were assigned to the same chromosome. It can be produced in either a simple format or as a complex format (see -bridge option below). This file contains all reads for each chromosome.
  • paired(map)-SET.bed.gz: this bed file contains all reads that could not be paired - either because their mate pair was unmappable by Maq, or because their mate pair was below the minimum quality filter.
  • paired(map)-spanning.txt: this file contains all read pairs that map across chromosomes, and thus could not be represented into a bed format.

Parameters:

-bridge

Bed files can be produced using two different formats. In the first format, the read is shown starting from the start point of the first read, and ending with the end position of the second read as a solid block, representing the coverage of the fragment, as predicted by the alignment. The second format shows the portion of the sequence which has been sequenced in block format with a narrow bridge connecting the two sequenced areas. This is a "tag and bridge" format, which is ideal for visualizing the actual sequence coverage. Both formats should be usable for any further downstream processing with FindPeaks or other application.

-hist_size <Integer>

This parameter allows the histogram to be expanded to any size desired by the user. Bin size remains constant at 1 base.

If flag is omitted: hist_size will default to 1000.

-input <String>

The Maq file to read. A single file name must be provided. To use more than one Maq .map file, we suggest using maqmerge to join them together. Please note that joining files together which contain reads that have the same name will cause name collisions and incorrect pairings. (This should not be a problem if the standard convention of maintaining flow cell numbers in the name is retained.)

If flag is omitted: program will not run.

-noflag

If the names of each read do not end in "/1" and "/2" to differentiate the forward and reverse reads, this program can still be used in "no flag" mode. Use the "-noflag" option to enable this mode. In this case, no checks will be made to determine if the two reads may have collision issues (eg, multiple sequences from the same cluster.)

-output <String>

The output directory into which the the log and processed files should be placed.

If flag is omitted: program will not run.

-maq_read_size <Integer>

This field is required to identify what the maximum read size should be for this .map file. Unfortunately, Maq does not provide this versioning information in the .map file, which is required to parse the file. If the .map file was generated with Maq version < 0.7, use "-maq_read_size 64". If the .map file was generated with Maq version 0.7 or later, use "-maq_read_size 128"

If flag is omitted: program will not run

-max_pet_size <integer>

(present as -maq_PET_length <integer> in versions below 3.3.3.2)

This parameter sets the maximum allowable paired end tag span, as indicated in the MAQ .map file. The number in the map file does not always reflect the true length of the paired end fragments, and thus we suggest setting -maq_pet_size for FindPeaks instead.

If flag is omitted: all reads in the .map file will be attempted to be paired, if possible.

-name <String>

The name of the run, used to generate the file names (output and log) produced.

If flag is omitted: program will not run.

-nobed_mode

This is a custom mode that produces files that are better suited for visualization with Circos. Turning on this flag stops the bed files from being produced, and results in an alternate file format being written out.

If flag is omitted: Bed files are produced - normal mode of operation.

-qualityfilter <Integer>

The minimum read alignment quality desired from the Maq read mapping quality field.

If flag is omitted: A default of zero (no filtering) is used.

Example:

time java -Xmx4G src/fileUtilities/MaqPetToBedFormat -output /projects/afejes/temp/ -name HS0727 -input /archive/solexa1_4/analysis/HS0727/30APCAAXX_6/maq/30APCAAXX_6.map -maq_read_size 128 -bridge -qualityfilter 30

Personal tools