|
From: Heng Li <lh...@sa...> - 2012-03-24 01:02:25
|
Following the discussion on subsampling sequence from fasta/fastq, I think perhaps it is time to more openly advertise my in-house tool: seqtk. Currently, seqtk supports quality based trimming with the phred algorithm, converting fastq to fasta, reverse complementing sequences, extracting or masking subsequences in regions given in a BED/name list file, and more. I have just added a subsampling module to sample exactly n sequences or a fraction of sequences. Seqtk supports both fasta and fastq input files, which can be optionally gzip compressed. Each module is perhaps the most efficient among tools of the same functionality. For example, I know fasta-to-fastq is 10X faster than another converter, while being more flexible. Seqtk is implemented in a single .c file and two header files and only depends on zlib. The source code is freely available here (MIT license): https://github.com/lh3/seqtk Heng -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |