Home / src
Name Modified Size InfoDownloads / Week
Parent folder
cope-src-v1.2.5.tgz 2015-11-11 406.2 kB
cope-src-v1.1.3.tgz 2012-10-12 400.9 kB
cope-src-v1.1.2.tgz 2012-10-08 395.1 kB
README 2012-10-08 6.4 kB
Totals: 4 Items   1.2 MB 0
Introdution
-----------------------
COPE (Connecting Overlapped Pair-End reads) is a method to align and connect the sequenced pair
reads of which the insert size is smaller than the sum of the two read length.

INSTALLATION
-----------------------
Download the package and run

tar -xzvf cope-1.1.2.tar.gz 
make (to build the executable file "cope")

In the compiled version, you can use the cope directly.

USAGE
-----------------------
./cope -a read1.fq -b read2.fq -o connect.fq -2 left1.fq -3 left2.fq -m 0 >cope.log 2>cope.error

Options:
 -a  <str>    query read1 file, (*.fq, *.fa, *.fq.gz, *.fa.gz)
 -b  <str>    query read2 file
 -o  <str>    output connected file in *.fq or *fa
 -2  <str>    output fail connected read1.fq
 -3  <str>    output fail connected read2.fq
 -l  <int>    lower bound overlap length, 8 or 10 is preferred for 100bp read with 170~180 insert 
              size, default=10
 -u  <int>    higher bound overlap length, 70 is preferred for 100 bp reads, default=70
 -c  <float>  match ratio cutoff, default=0.75
 -d  <float>  match max2_ratio/max_ratio, important for mode 0, default=0.7
 -B  <float>  ratio cut off value of Base-quality=2, default=0.9
 -N  <int>    N filter( 1 filte 0 not filter), default=1
 -T  <int>    read pair number threshold for parameter training or testing: 0
 -s  <int>    the smallest ASCII value used to represent base quality in fastq files. 33 for the 
              later Illumina platforms and Sanger reads, and 64 for the earlier Illumina platforms. 
              default=64
 -m  <int>    mode type: 0 simple connect; 1 k-mer frequency assisted connection; 2 Auxiliary reads 
              and cross connection for left reads from mode 1; 3 full mode ,default=3

 when set mode large than 0, set the following options:
 -k  <int>   Kmer size, default=17
 -t  <str>   compressed kmer frequency table
 -f  <str>   kmer frequency table len, for parallelly compressed kmer freq table
 -L  <int>   set the threshold of freq-low(Kmer with frequency lower than(<=) this threshold will 
             not be considered for base selection), default=3
 -M  <int>   set the threshold of freq-normal(Kmer whith frequency lower than(<=) this threshold 
             will not be considered for spanning Kmer selection), default=10
 -H  <int>   set the threshold of freq-high(Kmer whith frequency higher than(>=) this threshold 
             will not be considered for spanning Kmer selection, 2 times the average kmer frequency 
             is preferred), default=60
 -R  <int>   set the threshold of spanning k-mers number of extremely high frequency, threshold=3 
             is preferred for high repeat genome, set -1 for disable this threshold, default=-1
 -K  <str>   input pair-kmer file for mode=2, default=./pair.kmer.list.gz

 when set mode=2 or 3, you also need to set the following options:
 -D  <int>   set the batch number of output cross-read file(cross.read.inf*.gz), default=1
 -r  <int>   use connected reads(1) or raw reads(0) to find cross reads, default=0
 -F  <str>   input extra read-files list(each line for one read file) while need to find cross reads 
             with the help of other read-files
 -h          print help information


OUTPUT
-----------------------
##Alignment-based connection mode(mode 0): 
example: ./cope -a read1.fq -b read2.fq -o connect.fq -2 left1.fq -3 left2.fq -m 0 >cope.log 2>cope.error

creating file:
1. connect.fq: The connected reads in fastq format, the last number in the id line is the final read length.
2. left1.fq
3. left2.fq

4. cope.log 
#stat table:
Simple connection table:
total_pairs     connected_pairs connect_ratio(%)        low_quality_pairs low_quality_ratio(%)
150000  139859  93.2393 0       0

Kmer frequency based connection table:
total_pairs     connected_pairs connect_ratio(%)        low_quality_pairs       low_quality_ratio(%)
150000  139486  92.9907 0       0

Cross connect stat table:
total_pairs     no_marker       have_cross      have_cross_ratio(%)     have_cross_connect      cross_connect_ratio(%)
10514   3       10511   99.9715 10292   97.9165

5. cope.error

##k-mer frequency assisted connection(mode 1): 
example: ./cope -a read1.fq -b read2.fq -o connect.fq -2 left1.fq -3 left2.fq -m 1 -t kmer_table.cz -f kmer_table.cz.len 
         >cope.log 2>cope.error
1-5 files are the same as above.

6. pair.kmer.list.gz: pair kmers from pair end reads.
#format: 
pair_id kmer1 read1_len pos1 kmer2 read2_len pos2
1 GTTGTACGACGATAAAG 100 75 ACATAGCAATCCCTAGA 100 24

##Auxiliary reads and cross connection(mode 2):
example: ./cope -a read1.fq -b read2.fq -o connect.fq -2 left1.fq -3 left2.fq -m 2 -t kmer_table.cz -f kmer_table.cz.len 
         >cope.log 2>cope.error 
1-3 and 6 will not be re-created.

7. cross.read.inf*.gz: cross read information for each pair reads.
#format: If insert size is larger than sum of read length, the append_seq and qual will be "-", or else we will append the 
         sequence from the cross reads.
pair_id	insert_size	append_seq append_qual
917 210 TCGTTGACCG fffegdfffg

8. cross_connect.fq
9. cross_left_1.fq
10. cross_left_2.fq

When you run with full mode, the ten files will be created. The final connected files include the connect.fq 
and cross_connect.fq, the un-connected file include cross_left_1.fq and cross_left_2.fq.

PERFORMANCE
-----------------------
cope is a  fast tool for connecting the pair end reads with insert size shorter than the 
sum of their read length. For mode=0 to use simple method to connect pair reads, the memory 
will be near 1M. For mode=1 to connect with kmer frequency, the memory will be near 4G . 
For mode=3 to connect with cross reads, the memory will be related to the unconnected read 
pair number and the genome size. The running time for mode 0 and 1 is related to the read 
pair number linearly and the time consumption for simple mode is about half of that for mode 2.
For mode 2 and mode 3, all the reads will be checked to obtain the cross reads, so the 
running time will be longer than before. Because it is also related to the unconnected read 
number from step 1, so the connection ratio in step 1 will also related to the running time 
of mode 2 and mode 3.

COMMENTS/QUESTIONS/REQUESTS
-----------------------
Please send an e-mail to liubinghang@genomics.cn

Source: README, updated 2012-10-08