Informatics Research k-mer Tools Code

Brought to you by: brianwalenz, florea

Tree [r2013] / trunk / History

HTTPS access

File	Date	Author	Commit
ESTmapper	2014-04-11	brianwalenz	[r1970] Replace u32bit (etc) types with more standard u...
ESTmapper LaTeX	2014-11-14	brianwalenz	[r1994] Rename ESTmapper documentation.
atac-driver	2015-05-11	brianwalenz	[r2001] Rewrite input fasta files to resolve 'KeyError ...
developer-doc	2014-11-14	brianwalenz	[r1994] Rename ESTmapper documentation.
leaff	2014-10-07	brianwalenz	[r1987] Template-ify the intervalList.
libbio	2014-04-11	brianwalenz	[r1970] Replace u32bit (etc) types with more standard u...
libkmer	2015-05-20	brianwalenz	[r2004] Clean up meryl/libkmer installation. Add READM...
libmeryl	2014-04-11	brianwalenz	[r1970] Replace u32bit (etc) types with more standard u...
libseq	2015-04-22	brianwalenz	[r1995] Initialize the random access flag.
libsim4	2014-10-07	brianwalenz	[r1985] Replace min() and max() with MIN() and MAX() to...
libutil	2015-04-22	brianwalenz	[r1996] Minor; typo in comment.
meryl	2015-09-03	brianwalenz	[r2013] Big rewrite of interface. Much simpler to use....
seagen	2014-04-11	brianwalenz	[r1970] Replace u32bit (etc) types with more standard u...
seatac	2014-04-11	brianwalenz	[r1970] Replace u32bit (etc) types with more standard u...
sim4db	2014-04-11	brianwalenz	[r1970] Replace u32bit (etc) types with more standard u...
sim4dbutils	2015-05-29	brianwalenz	[r2005] Increase length of matches from 10k to 1m.
snapper	2014-10-07	brianwalenz	[r1986] More max() to MAX() conversion.
tapper	2014-10-07	brianwalenz	[r1987] Template-ify the intervalList.
trie	2014-04-11	brianwalenz	[r1970] Replace u32bit (etc) types with more standard u...
ESTmapper GSAC.pdf	2014-11-14	brianwalenz	[r1994] Rename ESTmapper documentation.
ESTmapper GSAC.ppt	2014-11-14	brianwalenz	[r1994] Rename ESTmapper documentation.
Make.include	2011-03-23	brianwalenz	[r1897] Re-enable building of atac and seatac.
Make.rules	2011-12-29	brianwalenz	[r1914] Fix building of shared libraries on OS-X. Mean...
Makefile	2011-01-11	brianwalenz	[r1888] Let a simple 'gmake' actually compile without '...
Makefile.wiki	2011-01-11	brianwalenz	[r1886] Translate the LaTex Makefile doc into mediawiki.
PACKAGING	2015-06-03	brianwalenz	[r2009] Add tar, remove subversion.
README.atac	2015-05-11	brianwalenz	[r2002] Add README.atac.
README.compiling	2015-05-11	brianwalenz	[r2003] Update, ESTmapper and ATAC are now installed in...
README.leaff	2011-01-20	florea	[r1893] Removed garbage collection for splice model.
README.meryl	2015-05-20	brianwalenz	[r2004] Clean up meryl/libkmer installation. Add READM...
README.sim4db	2011-01-20	florea	[r1893] Removed garbage collection for splice model.
configure.sh	2015-09-03	brianwalenz	[r2012] Remove bzip2 libraries from linking. They're n...

Read Me

meryl - in- and out-of-core kmer counting and utilities.

Copyright (C) 2002, and GNU GPL,       PE Corporation (NY) through the Celera Genomics Group
Copyright (C) 2003-2004, and GNU GPL,  Applied Biosystems
Copyright (C) 2004-2015, and GNU GPL,  Brian Walenz

=======================================================================

Content:

I.   What is meryl?
II.  Command line usage
III. Input/Output
IV.  Affiliated tools
V.   Terms of use
VI.  Support


I.   What is meryl?

meryl computes the kmer content of genomic sequences.  Kmer content is
represented as a list of kmers and the number of times each occurs in the
input sequences.  The kmer can be restricted to only the forward kmer, only
the reverse kmer, or the canonical kmer (lexicographically smaller of the forward and reverse
kmer at each location).  Meryl can report the histogram of counts,
the list of kmers and their counts, or can perform mathematical and set operations
on the processed data files.

The meryl process can run in one large memory batch, in many small memory batches, or under SGE
control, all with or without using multiple CPU cores.

The maximum kmer size is effectively unlimited, but set at compile time.  Larger kmers use more
memory, and are inefficient for counting smaller kmers, and since most applications have involved
kmers less than 32 bases, the default compile time limit is 32 bases.

The output of meryl is two binary files, called a meryl database, which can be
quickly dumped to provide a histogram of counts, or the actual counts.
A C++ library is supplied for direct access to the files.

The meryl program can perform many mathematical and set operations on
multiple database files: min, minexist, max, add, sub,abs, and, nand, or,
xor, lessthan, lessthanorequal, greaterthan, greatherthanorequal, and equal.

The ATAC pipeline uses meryl to find the unique kmers in two sequences ('lessthanorequal 1')
then computes the 'and' of them to find the unique kmers that exist in
both sequences.


II. Command line usage

A simple invocation:

meryl -B -C -m 22 -s /data/references/ecolik12.fasta -o ecoli-22mers

The above command will build (-B) a kmer database (-o ecoli-22mers)
of the canonical (-C) 22-mers (-m 22) in the FASTA file ecolik12.fasta.
The two output files are ecoli-22mers.mcidx and ecoli-22mers.mcdat.

meryl -Dh -s ecoli-22mers > ecoli-22mers.fasta

The above command will dump a histogram of the kmers in the 'ecoli-22mers'
database.  The histogram has four columns:

  kmer-count  number-of-kmers  fraction-distinct  fraction-total
  [example]

The first line tells us that there are X kmers that occur exactly once, that
these sequences make up XX% of lthe kmer composition, and that these sequences
account for YY% of all the kmers in the input.

meryl -M and -s seq1 -s seq2 -o both

The above command will report the kmers that are present in both meryl databases
'seq1' and 'seq1', writing them to a new meryl database 'both'.

Run with no options for a list of parameters.

See http://kmer.sourceforge.net/wiki/index.php/Getting_Started_with_Meryl for more.


III. Input/Output

For counting kmers, input is exactly one multi-FASTA or FASTQ file.  The
file must be uncompressed.

For processing databases, an input database is supplied by the prefix of the two
files:  the prefix of 'ecoli-ms22.mcidx' and 'ecoli-ms22.mcdat' is 'ecoli-ms22'.

Output is a 'meryl database' consisting of two binary files, '*.mcidx' and '*.mcdat'.

Output of the histogram command is a single text file to stdout.

Output of the threshold dump is a multi-FASTA file, with the name of the sequence
set to the count, and the sequence set to the kmer.


IV. Affiliated tools

Several additional kmer counting and analysis programs are included in the meryl package.

simple         - The obvious array based kmer counter.  It will allocate 4 bytes per
                 kmer; k=16 will need 16 GB to run.

                 NEEDS UPDATE

mapMers        - Report stats of mapping kmers to sequences.  Three modes of opeeration:
                  -stats     repotrs mean, min and max for each sequence, along with a
                             log2 histogram of the counts
                  -regions   reports regions with kmer coverage
                  -details   reports for each kmer in the sequence, the forward and reverse
                             count in the database

mapMers-depth  - Reports, for each sequence ordinal 's' and position 'p':
                  -count     the count (c) of the single kmer that starts at position (p).
                             Format: 's p c'
                  -depth     the number (n) of kmers that span position (p).  Format: 's p n'
                  -stats     the min (m), max (M), ave (a) count of all mers that span
                             position (p).  Format: 's p m M a t n'
                             (also reports total count (t) and number of kmers (n))

kmer-mask      - Mask and filter set of sequences (presumed to be reads) by kmer content.
                 Masking can be done to retain novel sequence not in the database, or to retain
                 confirmed sequence present in the database.  Filtering will segregate sequences
                 fully, partially or not masked.

existDB        - (installed by libkmer) Management of existDB files.
positionDB     - (installed by libkmer) Management of posDB files.


V. Terms of use

This program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or (at your
option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received (LICENSE.txt) a copy of the GNU General
Public License along with this program; if not, you can obtain one from
http://www.gnu.org/licenses/gpl.txt or by writing to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA


VI. Support 

Brian Walenz (brianwalenz@users.sourceforge.net)

Please check the parent project's Sourceforge page at
http://kmer.sourceforge.net for details and updates.


Last updated: May 13, 2015

Informatics Research k-mer Tools Code

Tree [r2013] / trunk / Download Snapshot History

Read Me

Tree [r2013] / trunk /

History