BBMap / Tickets / #11 clumpify on paired reads

#11 clumpify on paired reads

Milestone: 1.0

Status: open

Owner: nobody

Labels: clumpify (3)

Updated: 2018-11-27

Created: 2018-11-23

Creator: Stephane Plaisance

Private: No

Hi Brian,

I have trouble applying clumpify to highseq paired reads.

my command is:

clumpify.sh in=read1_1 in2=read1_2 out=c_read1_1 out2=c_read1_2 \
    dupedist=2500 \
    dedupe optical

the c_read1_1 are twice less than the _2 after that.

I understand from the manpage that unpair and repair can be applied but I do not get how

Pairing/ordering parameters (for use with error-correction):
unpair=f            For paired reads, clump all of them rather than just
                    read 1.  Destroys pairing.  Without this flag, for paired
                    reads, only read 1 will be error-corrected.
repair=f            After clumping and error-correction, restore pairing.
                    If groups>1 this will sort by name which will destroy
                    clump ordering; with a single group, clumping will
                    be retained.

should I add 'unpair=t' (and) 'repair=t' ?

Also, can I operate error correction on both reads of a pair based on the optical duplicate cluster and produce only one pair of consensus paired-reads per cluster? (command examples would be welcome here too)

could you please comment on how to clean both of the pair and obtain output still paired?

Thanks
Stephane

Discussion

Stephane Plaisance - 2018-11-23

using v38.32

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stephane Plaisance - 2018-11-23

weird!
in fact the number of reads "zgrep -c '^@'" is the same but the size of the compressed read c_read_1 is 50% of that of c_read_2

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Brian Bushnell - 2018-11-26

Hi Stephane,

This is not surprising; the reads are clumped by read 1, and read 2 is just along for the ride, getting put out in the same order as read 1. As such, when you have 2 files, file 1 will compress much better when you have variable insert size. I had not personally noticed this since I work with interleaved files.

You do not want to use "unpair" and "repair" unless you are doing error-correction. If you ARE doing error-correction, then yes, add those flags.

Honestly, I'm not entirely sure of the impact of doing certain operations in conjunction with each other, like "unpair" + "dedupe". That's probably a bad idea since then duplicates will be found based on single reads, and then some reads will be lost so they can't be re-paired, etc. If you want to do duplicate removal and error-correction, I'd run 2 passes, for example:

clumpify.sh in=reads.fq out=deduped.fq dedupe optical
clumpify.sh in=deduped.fq out=ecc.fq ecc unpair repair

You cannot explicitly error-correct only duplicate clusters. But by default, the highest-quality pair should be retained when duplicates are found.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stephane Plaisance - 2018-11-27

Hi again Brian,
Thanks for the explanations, I will help both and compare.
best
S

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

clumpify on paired reads

BBMap short read aligner, and other bioinformatic tools.

Milestone

Searches

Help

#11 clumpify on paired reads

Discussion