Home / muffinec2-beta
Name Modified Size InfoDownloads / Week
Parent folder
README.txt 2016-04-12 7.7 kB
muffinec-2.0b.zip 2016-04-12 132.7 kB
Totals: 2 Items   140.5 kB 2
MuffinEC 2.0b
Contact: asalic@posgrado.upv.es

--------
Overview
--------

MuffinEc is an error correction software for NGS datasets. It can handle multiple technologies by its ability of correcting mismatches as well as insertions and deletions. It uses a greedy approach where it groups reads having a certain number of kmers in common. Afterwards, MuffinEc refines each group using a custom vertion of the Smith-Waterman algorithm, generating multiple sequence alignments structures. Finally, it corrects each of these structures using a "base holding the majority wins" approach on each column.

-------------------
System Requirements
-------------------

The software was built and tested under Debian and Ubuntu Linux x64. The program should compile and run under Mac OS X like it does  under Linux. MuffinEc requires GCC version is 4.7.2 or above with support for OpenMP 3.0 or higher and C++11. We advise the user to utilize a 64 bit system because we are heavily making use of 64bit integers through our code.

-----------
Compilation
-----------

   - <version> denotes the version of the program (major followed by minor), e.g. "muffinec1_0.tar.gz"   
Linux (and Mac) Users:
   - download the archive file "muffinec<version>.tar.gz"
   - uncompress it in the current directory:
      tar -xvf muffinec<version>.tar.gz
   - enter the newly created directory:
      cd muffinec<version>/
   - run make to compile the code
      make
   - finally, an executable called "muffinec" will be created in "build" directory
   
-----
Usage   
-----

MuffinEC allows the user to fine-tune many of its parameters. This way it can adapt to a variety of datasets, without recompilation. There are only three mandatory parameters, namely the technology used, the input file path and the  output file path. Bellow it's a short list with all available parameters.

MuffinEC creates the k-table in a separate process than the actual correction. As a result, the first argument must be the name of the step (either "ktbl" or "correct"). 

Programs:

****************
   muffinec ktbl
****************

      Mandatory parameters:

         -a, --fasta:   The full path of the FASTA file containing the reads to be corrected; Must be set if -q is not (exclusive parameter) 
                                        
         -q, --fastq:   The full path of the FASTQ file containing the reads to be corrected; Must be set if -a is not (exclusive parameter)
         
              
          
         --454:   The input data is from a Roche 454; Must be set if none of --illumina, --pac, --ion or --generic are set (exclusive parameter)
                  
         --illumina:   The input data is from Illumina technologies; Must be set if none of --454, --pac, --ion or --generic are set (exclusive parameter)
                  
         --pac:   The input data is from Pacific Biosiences; Must be set if none of --454, --pac, --ion or --generic are set (exclusive parameter)
                     
         --ion:   The input data is from Ion Torrent; Must be set if none of --454, --pac, --ion or --generic are set (exclusive parameter)
                     
         --generic:   Use this option if you don't want to set/don't know the source technology; Must be set if none of --454, --pac, --ion or --generic are set (exclusive parameter)

                   
      Optional Parameters:  
         
         -k, --kmerlen:   The length of the kmer used to do the coverage and to create the neighborhood [15]
         
         --minsn:   Minimum size of a neighborhood, created by the greedy grouping method and which will be further processed by the correction mechanism [3]

         --trimq:   Value marking the minimum accepted quality score for a base at the ends of the reads such that the position won't get axed [43]

         --kmerq:   Minimum accepted quality score for each base in a kmer [0]

         -s, --hashk:   The hashing method used to store kmers; The possible options are: umap and vector [umap]

*******************
   muffinec correct
*******************

      Mandatory parameters:

         -a, --fasta:   The full path of the FASTA file containing the reads to be corrected; Must be set if -q is not (exclusive parameter) 
                                        
         -q, --fastq:   The full path of the FASTQ file containing the reads to be corrected; Must be set if -a is not (exclusive parameter)
                           
         -o, --output:  The name of the output file containing the corrected reads
         

                        
         --454:   The input data is from a Roche 454; Must be set if none of --illumina, --pac, --ion or --generic are set (exclusive parameter)
                  
         --illumina:   The input data is from Illumina technologies; Must be set if none of --454, --pac, --ion or --generic are set (exclusive parameter)
                  
         --pac:   The input data is from Pacific Biosiences; Must be set if none of --454, --pac, --ion or --generic are set (exclusive parameter)
                     
         --ion:   The input data is from Ion Torrent; Must be set if none of --454, --pac, --ion or --generic are set (exclusive parameter)
                     
         --generic:   Use this option if you don't want to set/don't know the source technology; Must be set if none of --454, --pac, --ion or --generic are set (exclusive parameter)

                   
      Optional Parameters:

         -p, --threads:   Number of OpenMP threads used by the program [1]
         
         -m, --pmj:   The percent of reads which must have the same base on a column of the consensus [0.51]
         
         --almis:   SW Aligner mismatch score [-1]
         
         --almat:   SW Aligner match score [2]
         
         --algap:   SW Aligner gap opening score [-3]
         
         --algapext:   SW Aligner gap extending score [-1]

         --pswerr:   The max number of errors accepted by the Smith Waterman algorithm when comparing the distance between  a read and a subgroup; if the total number of differences between the read and the subgroup are greater than the limit set by this parameter the read won't be added to the subgroup [0.1]

         --maxsn:   Maximum size of a neighborhood, created by the grouping method and which will be further processed by the correction  mechanism [500]

         --perrov:   The percentage of errors found in the overlap region of two reads compared by the fast gapped kmer algorithm [0.02]

         --pmincomk:   Percentage of overlapping for two reads, calculated from the smaller between the two [0.5]   

         --pminovsw:   Minimum SW determined overlap percentage for a read over consensus such that it will be considered a part of the consensus [0.0]

         --gencon:   Add this flag to generate the contigs for the consensus [false]

         --allowIndelsArg:   Add this flag to allow the SW algorithm to use indels when aligning two reads [false]

-------
Example  
-------

To correct a 454 dataset, one must first generate the k-mer table and then start the correction.

./muffinec ktbl --454 -a test.fa
./muffinec correct -p 6 -a test.fa -o test_corr.fa

----------------------------
License and Third-Party Libs
----------------------------

MuffinEc uses TCLAP to parse the input parameters. The aforementioned library is released under The MIT License (http://opensource.org/licenses/mit-license.php). MuffinEc is licensed under LGPL3.

-----------
How to Cite
-----------

Alic, A. S., Tomas, A., Medina, I., & Blanquer, I. (2016). MuffinEc: Error correction for de Novo assembly via greedy partitioning and sequence alignment. Information Sciences, 329, 206-219.


   
Source: README.txt, updated 2016-04-12