clumpify on paired reads
BBMap short read aligner, and other bioinformatic tools.
Brought to you by:
brian-jgi
Hi Brian,
I have trouble applying clumpify to highseq paired reads.
my command is:
clumpify.sh in=read1_1 in2=read1_2 out=c_read1_1 out2=c_read1_2 \
dupedist=2500 \
dedupe optical
the c_read1_1 are twice less than the _2 after that.
I understand from the manpage that unpair and repair can be applied but I do not get how
Pairing/ordering parameters (for use with error-correction):
unpair=f For paired reads, clump all of them rather than just
read 1. Destroys pairing. Without this flag, for paired
reads, only read 1 will be error-corrected.
repair=f After clumping and error-correction, restore pairing.
If groups>1 this will sort by name which will destroy
clump ordering; with a single group, clumping will
be retained.
should I add 'unpair=t' (and) 'repair=t' ?
Also, can I operate error correction on both reads of a pair based on the optical duplicate cluster and produce only one pair of consensus paired-reads per cluster? (command examples would be welcome here too)
could you please comment on how to clean both of the pair and obtain output still paired?
Thanks
Stephane
using v38.32
weird!
in fact the number of reads "zgrep -c '^@'" is the same but the size of the compressed read c_read_1 is 50% of that of c_read_2
Hi Stephane,
This is not surprising; the reads are clumped by read 1, and read 2 is just along for the ride, getting put out in the same order as read 1. As such, when you have 2 files, file 1 will compress much better when you have variable insert size. I had not personally noticed this since I work with interleaved files.
You do not want to use "unpair" and "repair" unless you are doing error-correction. If you ARE doing error-correction, then yes, add those flags.
Honestly, I'm not entirely sure of the impact of doing certain operations in conjunction with each other, like "unpair" + "dedupe". That's probably a bad idea since then duplicates will be found based on single reads, and then some reads will be lost so they can't be re-paired, etc. If you want to do duplicate removal and error-correction, I'd run 2 passes, for example:
clumpify.sh in=reads.fq out=deduped.fq dedupe optical
clumpify.sh in=deduped.fq out=ecc.fq ecc unpair repair
You cannot explicitly error-correct only duplicate clusters. But by default, the highest-quality pair should be retained when duplicates are found.
Hi again Brian,
Thanks for the explanations, I will help both and compare.
best
S