Home
Name Modified Size InfoDownloads / Week
old_versions 2012-03-15
fastqz15.cpp 2012-03-15 34.8 kB
readme.txt 2012-03-15 3.2 kB
fapacks.cpp 2012-03-13 2.3 kB
Totals: 4 Items   40.3 kB 0
fastqz15.cpp is the source code for the latest version
of the FASTQ compressor. It compresses the common Sanger
variant. FASTQ is output by DNA sequencing machines.

fapack.cpp is a program to pack FASTA files into a format
suitable for input to fastqz as a reference genome for
better compression.
It packs 4 bases per byte and discards all but A,C,G,T.

fapacks.cpp works the same except that it does not ignore
lowercase a,c,g,t. Lowercase is used in hg19 to indicate
repeats. Generally it produces a larger reference but
gives better compression.

Other fastqz*.cpp are older versions. You don't need them.

Usage: fastqz {c[Q]|d|e[Q]|f} input output [reference]

Command c compresses input to output.fx?.zpaq (3 or 4 files)
Command d decompresses input.fx?.zpaq to output
Command e encodes input to output.fx?
Command f decodes input.fx? to output

Commands c and d are slow, require 1.5 GB memory, use 3 or
4 cores, but get very good compression. Commands e and f are much
faster, use little memory, and only one thread, but compression
ratio is not as good.

Commands cQ or eQ quantize the quality scores for lossy but
better compression. The default is c1 or e1, which is lossless.
Quality scores in the range 33..73 are rounded down to 35 plus
a multiple of Q.

You can supply a reference genome to improve compression.
If you use this, the same reference is needed to decompress.
It also increases the memory requirement to 1.2 GB for the
e command and 0.5 GB for the f command. c and d still need
1.5 GB.

You can prepare the reference genome from FASTA files like:

  fapacks hg19s *.fa

to produce the file hg19s. Then compress:

  fastqz c in.fastq arc hg19s

To decompress:

  fastqz d arc out.fastq hg19s

There are 4 compressed files:

  arc.fxh.zpaq - compressed headers
  arc.fxb.zpaq - compressed base calls
  arc.fxq.zpaq - compressed quality scores
  arc.fxa.zpaq - compressed alignments if a reference is used.

Commands e and f work the same way except the compressed
files do not have a .zpaq extension. If no reference is
used, then no .fxa or .fxa.zpaq file is produced or expected.

fastqz only works on the Sanger FASTQ variant. It assumes
that quality scores are Phred+33 (range ASCII 33 to 73).
Base calls must be A,C,G,T,N only. N must have a quality
score of 0, and all others 1 or higher. Maximum line length
is 4095. Lines must be terminated by linefeeds only (no
carriage returns). If a reference is used, it must be
smaller than 1 GB packed (4 billion bases).

To compile fastqz you will need the latest version of
libzpaq from https://sourceforge.net/projects/zpaq/
or http://mattmahoney.net/zpaq/
These programs will work in either Windows or Linux.
In Windows, you will also need Pthreads-Win32 from
http://sourceware.org/pthreads-win32/ to compile or run.
To compile (no Makefile, sorry):

  g++ -O3 -msse2 -s -lpthread fastqz.cpp libzpaq.cpp -o fastqz
  g++ -O3 -s fapack.cpp -o fapack

fastqz* and fapack* are written by Matt Mahoney, Dell Inc.
All are BSD-2 licensed. But note that libzpaq
is public domain and Pthreads-Win32 is LGPL.
Source: readme.txt, updated 2012-03-15