Whole-Genome Shotgun Assembler / Support Requests / #23 how to figure how many subread was created

Sergey Koren - 2015-11-02

It looks like you're using an internal log file (asm.[0-9]*.log). There should be a <libraryname>.log file in your top-level directory with a header on the first line matching the description in the documentation. In your file the 4th and 5th lines state the region of the original read being corrected.

However, in the current version, there is always only one region per subread that gets corrected by consensus. This is why you see all the IDs appear only once. The consensus module can split a sequence due to low support in which case it will add the information to the header line of the file. For example if a read h_1 coming from positions 1072 to 6700 of a subread was corrected and split into mulitple segments, your corrected fasta file would have reads
h_1_1 and h_1_2 or h_1/<pos1begin><pos1end> and h_1<pos2begin><pos2end>, depending on the consensus module used.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

medhat - 2015-11-03

I am Using the right file as sugested from documentation in such case it is ecoli_assembly.log

and it was based on this command

(date ; echo -e "\n========\n"; time ~/source/wgs-8.3rc2/wgs-8.3rc2/Linux-amd64/bin/PBcR -libraryname ecoliassembly -s /home/medhat/source/wgs-8.3rc2/wgs-8.3rc2/Linux-amd64/bin/pacbio.spec -fastq /data/testdatafromservermaize/pbcr/ecoli/ecolipacbio.fa0001.fastq /data/testdatafromservermaize/pbcr/ecoli/ecoli.frg -length 200 -partitions 200 -threads 11 -genomeSize 4641652 -noclean ) > run.log.date +%F 2>&1

no assemply where done
the log file can be found here

http://textuploader.com/5rslc

just correction

and yes it have the header

INPUT_NAME OUTPUT_NAME SUBREAD START END LENGTH
S1_1217 ecoli_assembly_2015073_1 1 2 521 519
S1_1217 ecoli_assembly_2015073_2 2 689 1077 388
S1_1217 ecoli_assembly_2015073_5 5 1432 1812 380
S1_1217 ecoli_assembly_2015073_7 7 1942 2324 382
S1_1217 ecoli_assembly_2015073_8 8 2330 2754 424
S1_1217 ecoli_assembly_2015073_9 9 3016 3470 454
S1_1217 ecoli_assembly_2015073_10 10 3528 3876 348
S1_1217 ecoli_assembly_2015073_13 13 4401 5139 738
S1_1217 ecoli_assembly_2015073_14 14 5174 5492 318
S1_1217 ecoli_assembly_2015073_15 15 6037 6527 490
S1_1217 ecoli_assembly_2015073_16 16 6615 7063 448

but I can not understand it will , for my read S1_1217 now I have 11 row for the same and the third coulm which is the subread count it is different from 1 till 16.
how can I count the subread and how to know which one of those will be in my fastaq file?

so that I can say at the end that this read S1_1217 was broken to this number "n" of reads

from what you said here
"For example if a read h1 coming from positions 1072 to 6700 of a subread was corrected and split into mulitple segments, your corrected fasta file would have reads
h11 and h12 or h1/<pos1begin><pos1end> and h1<pos2begin><pos2end>, depending on the consensus module "

do my read S1_1217 was broken to 11 subreads
and there names is

OUTPUT_NAME
ecoli_assembly_2015073_1
ecoli_assembly_2015073_2
ecoli_assembly_2015073_5
ecoli_assembly_2015073_7
ecoli_assembly_2015073_8
ecoli_assembly_2015073_9
ecoli_assembly_2015073_10
ecoli_assembly_2015073_13
ecoli_assembly_2015073_14
ecoli_assembly_2015073_15
ecoli_assembly_2015073_16

am I understanding right?

and If I understand right all this reads should be in fastq file ?! but when I grep

grep ecoliassembly2015073 ecoliassembly.fastq

@ecoli_assembly_2015073_3/0_89
@ecoli_assembly_2015073_4/0_88
@ecoli_assembly_2015073_2/21_237
@ecoli_assembly_2015073_1/10_486
@ecoli_assembly_2015073_5/107_187
@ecoli_assembly_2015073_8/0_86
@ecoli_assembly_2015073_7/0_155
@ecoli_assembly_2015073_9/0_97
@ecoli_assembly_2015073_10/0_92
@ecoli_assembly_2015073_13/4_176
@ecoli_assembly_2015073_14/0_91
@ecoli_assembly_2015073_15/0_91
@ecoli_assembly_2015073_16/86_179

but

@ecoli_assembly_2015073_3/0_89
@ecoli_assembly_2015073_4/0_88

not in the log file

and now I have two versions one from log
S1_1217 ecoli_assembly_2015073_1 1 2 521 519
and another in fastq file
@ecoli_assembly_2015073_1/10_486

Thanks a lot for your time

Last edit: medhat 2015-11-23

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sergey Koren - 2015-11-16

Ah, OK I was confused since the reads were so split up which is surprising and seems like the Illumina data couldn't be mapped to a large fraction of your data. If you have sufficient coverage (>20X), I'd suggest self-correction instead of correction with Illumina data as I would expect it would split the sequences less.

There should be a minimum length filter on what gets recorded in the log file which in your run is 200bp in your run so that is why the first two are missing from the log file. It looks like the consensus module ignored that threshold and output the corrected read anyway leading to the discrepancy. The log file is the pre-consensus length of the sequence, that is, this is the part of the PacBio sequence that has good alignments that gets passed to consensus. The consensus step can futher refine that alignmenta and trim the sequence if needed. You can see that in the numbers after the slash (21_237 for sample for read 2). In this case there are alignments covering positions 689 1077 in your pacbio sequence, ideally generating a read that is 388 bp in length. However, the consensus step only had sufficient support for positions 21 to 237 of that so it really corrected 689+21 = 710 to 689 + 227 = 916 of your original read.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

medhat - 2015-11-23

Now It is clear, but I think the end should be 689 + 237 = 926 instead of 689 + 227 = 916
Right?

And now there is question. there is reads in the corrected fasta file that does not exist in the log file, how can I know there original read " from which read they where corrected ? "

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sergey Koren - 2015-11-27

Yes, it should be 237. Unfortunately if a read is not in the log you cannot easily determine which uncorrected read it came from. The easiest would be mapping those reads to your raw sequences. Another option, the read name should contain which # raw sequence it was in your input (so ecoli_assembly_2015073_1 is the 2015073rd read in the raw fastq). However, you wouldn't have the positions within the raw sequence in this case.

Last edit: Sergey Koren 2015-11-27

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

how to figure how many subread was created

Group

Searches

Help

#23 how to figure how many subread was created

Discussion