Menu

#23 how to figure how many subread was created

v1.0_(example)
open
None
5
2015-11-27
2015-11-02
medhat
No

here is the result fo correction of some read

S1_1217 ecoli_assembly_2015073_1 1 2 521 519
S1_1217 ecoli_assembly_2015073_2 2 689 1077 388
S1_1217 ecoli_assembly_2015073_5 5 1432 1812 380
S1_1217 ecoli_assembly_2015073_7 7 1942 2324 382
S1_1217 ecoli_assembly_2015073_8 8 2330 2754 424
S1_1217 ecoli_assembly_2015073_9 9 3016 3470 454
S1_1217 ecoli_assembly_2015073_10 10 3528 3876 348
S1_1217 ecoli_assembly_2015073_13 13 4401 5139 738
S1_1217 ecoli_assembly_2015073_14 14 5174 5492 318
S1_1217 ecoli_assembly_2015073_15 15 6037 6527 490
S1_1217 ecoli_assembly_2015073_16 16 6615 7063 448

according to documentation
The first column is the input name of the PacBio RS sequencing read which was corrected. The second column is the name of the output corrected sequence. The third column indicates if the input PacBio RS sequence was split into more than a single fragment

first all the first column is the same sequence is this mean it was splitted to 11 "number of rows" or shall I add all the numbers in the third column ? and why the end of second column always equals to the third column ?

Discussion

  • Sergey Koren

    Sergey Koren - 2015-11-02

    It looks like you're using an internal log file (asm.[0-9]*.log). There should be a <libraryname>.log file in your top-level directory with a header on the first line matching the description in the documentation. In your file the 4th and 5th lines state the region of the original read being corrected.

    However, in the current version, there is always only one region per subread that gets corrected by consensus. This is why you see all the IDs appear only once. The consensus module can split a sequence due to low support in which case it will add the information to the header line of the file. For example if a read h_1 coming from positions 1072 to 6700 of a subread was corrected and split into mulitple segments, your corrected fasta file would have reads
    h_1_1 and h_1_2 or h_1/<pos1begin><pos1end> and h_1<pos2begin><pos2end>, depending on the consensus module used.

     
  • medhat

    medhat - 2015-11-03

    I am Using the right file as sugested from documentation in such case it is ecoli_assembly.log

    and it was based on this command

    (date ; echo -e "\n========\n"; time ~/source/wgs-8.3rc2/wgs-8.3rc2/Linux-amd64/bin/PBcR -libraryname ecoliassembly -s /home/medhat/source/wgs-8.3rc2/wgs-8.3rc2/Linux-amd64/bin/pacbio.spec -fastq /data/testdatafromservermaize/pbcr/ecoli/ecolipacbio.fa0001.fastq /data/testdatafromservermaize/pbcr/ecoli/ecoli.frg -length 200 -partitions 200 -threads 11 -genomeSize 4641652 -noclean ) > run.log.date +%F 2>&1

    no assemply where done
    the log file can be found here

    http://textuploader.com/5rslc

    just correction

    and yes it have the header

    INPUT_NAME OUTPUT_NAME SUBREAD START END LENGTH
    S1_1217 ecoli_assembly_2015073_1 1 2 521 519
    S1_1217 ecoli_assembly_2015073_2 2 689 1077 388
    S1_1217 ecoli_assembly_2015073_5 5 1432 1812 380
    S1_1217 ecoli_assembly_2015073_7 7 1942 2324 382
    S1_1217 ecoli_assembly_2015073_8 8 2330 2754 424
    S1_1217 ecoli_assembly_2015073_9 9 3016 3470 454
    S1_1217 ecoli_assembly_2015073_10 10 3528 3876 348
    S1_1217 ecoli_assembly_2015073_13 13 4401 5139 738
    S1_1217 ecoli_assembly_2015073_14 14 5174 5492 318
    S1_1217 ecoli_assembly_2015073_15 15 6037 6527 490
    S1_1217 ecoli_assembly_2015073_16 16 6615 7063 448

    but I can not understand it will , for my read S1_1217 now I have 11 row for the same and the third coulm which is the subread count it is different from 1 till 16.
    how can I count the subread and how to know which one of those will be in my fastaq file?

    so that I can say at the end that this read S1_1217 was broken to this number "n" of reads

    from what you said here
    "For example if a read h1 coming from positions 1072 to 6700 of a subread was corrected and split into mulitple segments, your corrected fasta file would have reads
    h11 and h12 or h1/<pos1begin><pos1end> and h1<pos2begin><pos2end>, depending on the consensus module "

    do my read S1_1217 was broken to 11 subreads
    and there names is

    OUTPUT_NAME
    ecoli_assembly_2015073_1
    ecoli_assembly_2015073_2
    ecoli_assembly_2015073_5
    ecoli_assembly_2015073_7
    ecoli_assembly_2015073_8
    ecoli_assembly_2015073_9
    ecoli_assembly_2015073_10
    ecoli_assembly_2015073_13
    ecoli_assembly_2015073_14
    ecoli_assembly_2015073_15
    ecoli_assembly_2015073_16

    am I understanding right?

    and If I understand right all this reads should be in fastq file ?! but when I grep

    grep ecoliassembly2015073 ecoliassembly.fastq

    @ecoli_assembly_2015073_3/0_89
    @ecoli_assembly_2015073_4/0_88
    @ecoli_assembly_2015073_2/21_237
    @ecoli_assembly_2015073_1/10_486
    @ecoli_assembly_2015073_5/107_187
    @ecoli_assembly_2015073_8/0_86
    @ecoli_assembly_2015073_7/0_155
    @ecoli_assembly_2015073_9/0_97
    @ecoli_assembly_2015073_10/0_92
    @ecoli_assembly_2015073_13/4_176
    @ecoli_assembly_2015073_14/0_91
    @ecoli_assembly_2015073_15/0_91
    @ecoli_assembly_2015073_16/86_179

    but

    @ecoli_assembly_2015073_3/0_89
    @ecoli_assembly_2015073_4/0_88

    not in the log file

    and now I have two versions one from log
    S1_1217 ecoli_assembly_2015073_1 1 2 521 519
    and another in fastq file
    @ecoli_assembly_2015073_1/10_486

    Thanks a lot for your time

     

    Last edit: medhat 2015-11-23
  • Sergey Koren

    Sergey Koren - 2015-11-16

    Ah, OK I was confused since the reads were so split up which is surprising and seems like the Illumina data couldn't be mapped to a large fraction of your data. If you have sufficient coverage (>20X), I'd suggest self-correction instead of correction with Illumina data as I would expect it would split the sequences less.

    There should be a minimum length filter on what gets recorded in the log file which in your run is 200bp in your run so that is why the first two are missing from the log file. It looks like the consensus module ignored that threshold and output the corrected read anyway leading to the discrepancy. The log file is the pre-consensus length of the sequence, that is, this is the part of the PacBio sequence that has good alignments that gets passed to consensus. The consensus step can futher refine that alignmenta and trim the sequence if needed. You can see that in the numbers after the slash (21_237 for sample for read 2). In this case there are alignments covering positions 689 1077 in your pacbio sequence, ideally generating a read that is 388 bp in length. However, the consensus step only had sufficient support for positions 21 to 237 of that so it really corrected 689+21 = 710 to 689 + 227 = 916 of your original read.

     
  • medhat

    medhat - 2015-11-23

    Now It is clear, but I think the end should be 689 + 237 = 926 instead of 689 + 227 = 916
    Right?

    And now there is question. there is reads in the corrected fasta file that does not exist in the log file, how can I know there original read " from which read they where corrected ? "

     
  • Sergey Koren

    Sergey Koren - 2015-11-27

    Yes, it should be 237. Unfortunately if a read is not in the log you cannot easily determine which uncorrected read it came from. The easiest would be mapping those reads to your raw sequences. Another option, the read name should contain which # raw sequence it was in your input (so ecoli_assembly_2015073_1 is the 2015073rd read in the raw fastq). However, you wouldn't have the positions within the raw sequence in this case.

     

    Last edit: Sergey Koren 2015-11-27

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.