Menu

MateAn

2016-06-14
2016-07-02
  • Jonathan Featherston

    Hi There

    Big fan of your work. Hopefully we'll be purchasing a PacBio Sequel soon. to get "near-perfect de novo assemblies"

    Right now I am working on a few plant genomes sequenced with Illumina data and I'm struggling to get Metassembler to work. I used it a while back (over a year ago) and got great results with it but now I seem to be stymied.

    It seems that my problem happens at the MateAn step as a matean file for the secondary assembly isn't produced. I have run the MateAn result on it's own and the same thing happens. I get no errors but basically it just produces an empty *.sort.bedpe.gz file.

    This seems to happen with all my assemblies now (different species/assemblies). The report file just states "Writing Sort bedpe" but doesn't produce a result. With the metassemble pipeline the program runs well beyond this stage but throws up errors later like under the M1 folder error file "No ce-stat data no? No can do". It has taken me a while to figure out where the problem is because the pipeline runs beyond this point (I'm assembling multiple assemblies). The first assembly is with Allpaths and this work fine (there is a .matean file in the folder for that genome) but it produces no results for the secondary assemblies (Abyss and masurca). All assemblies have been renamed to have the fasta headers ">scaffold.1" i.e. the Allpaths, Masurca and Abyss assemblies all have the same style of header starting from scaffold.1 to scaffold.N.

    I have the run the sample dataset and it produced no errors and ran to completetion to I feel like there is something wrong with my data.... But I can't see how. The headers are very simple altough they are the same i.e. there is a scaffold.1 for each assembly.

    Regards
    Jonathan

     
    • Michael Schatz

      Michael Schatz - 2016-06-15

      Hi Jonathan,

      Thanks for your comments. Very weird results -- if it works on the sample
      data, then it must be something about your dataset or environment. It might
      be just something mundane like out of memory or out of disk? Did you check
      that MateAn didnt crash? It sometimes helps to run inside of gdb just so
      it is clear if it exits cleanly or not. Otherwise, maybe try downsampling
      the assemblies and/or reads to a smaller test set? Sorry I cant be more
      helpful, but if you get totally stuck maybe you could send me some of the
      data to try locally

      Good luck

      Mike

      On Tue, Jun 14, 2016 at 6:23 PM, Jonathan Featherston featherstonj@users.sf.net wrote:

      Hi There

      Big fan of your work. Hopefully we'll be purchasing a PacBio Sequel soon.
      to get "near-perfect de novo assemblies"

      Right now I am working on a few plant genomes sequenced with Illumina data
      and I'm struggling to get Metassembler to work. I used it a while back
      (over a year ago) and got great results with it but now I seem to be
      stymied.

      It seems that my problem happens at the MateAn step as a matean file for
      the secondary assembly isn't produced. I have run the MateAn result on it's
      own and the same thing happens. I get no errors but basically it just
      produces an empty *.sort.bedpe.gz file.

      This seems to happen with all my assemblies now (different
      species/assemblies). The report file just states "Writing Sort bedpe" but
      doesn't produce a result. With the metassemble pipeline the program runs
      well beyond this stage but throws up errors later like under the M1 folder
      error file "No ce-stat data no? No can do". It has taken me a while to
      figure out where the problem is because the pipeline runs beyond this point
      (I'm assembling multiple assemblies). The first assembly is with Allpaths
      and this work fine (there is a .matean file in the folder for that genome)
      but it produces no results for the secondary assemblies (Abyss and
      masurca). All assemblies have been renamed to have the fasta headers
      ">scaffold.1" i.e. the Allpaths, Masurca and Abyss assemblies all have the
      same style of header starting from scaffold.1 to scaffold.N.

      I have the run the sample dataset and it produced no errors and ran to
      completetion to I feel like there is something wrong with my data.... But I
      can't see how. The headers are very simple altough they are the same i.e.
      there is a scaffold.1 for each assembly.

      Regards
      Jonathan


      MateAn


      Sent from sourceforge.net because you indicated interest in <
      https://sourceforge.net/p/metassembler/discussion/general/>

      To unsubscribe from further messages, please visit <
      https://sourceforge.net/auth/subscriptions/>

       
    • Alejandro Hernandez Wences

      Hi,

      Thanks for your interest.

      Could you check the reports on the alignments steps just to make sure that
      the problem is not on the alignment step itself rather than on the mateAn
      setp; you can take a look at the *.mtp.err files under the corresponding
      BWTaln directories of each input assembly to get an overview of the
      alignment rates.

      Wences

      On Tue, Jun 14, 2016 at 5:23 PM, Jonathan Featherston featherstonj@users.sf.net wrote:

      Hi There

      Big fan of your work. Hopefully we'll be purchasing a PacBio Sequel soon.
      to get "near-perfect de novo assemblies"

      Right now I am working on a few plant genomes sequenced with Illumina data
      and I'm struggling to get Metassembler to work. I used it a while back
      (over a year ago) and got great results with it but now I seem to be
      stymied.

      It seems that my problem happens at the MateAn step as a matean file for
      the secondary assembly isn't produced. I have run the MateAn result on it's
      own and the same thing happens. I get no errors but basically it just
      produces an empty *.sort.bedpe.gz file.

      This seems to happen with all my assemblies now (different
      species/assemblies). The report file just states "Writing Sort bedpe" but
      doesn't produce a result. With the metassemble pipeline the program runs
      well beyond this stage but throws up errors later like under the M1 folder
      error file "No ce-stat data no? No can do". It has taken me a while to
      figure out where the problem is because the pipeline runs beyond this point
      (I'm assembling multiple assemblies). The first assembly is with Allpaths
      and this work fine (there is a .matean file in the folder for that genome)
      but it produces no results for the secondary assemblies (Abyss and
      masurca). All assemblies have been renamed to have the fasta headers
      ">scaffold.1" i.e. the Allpaths, Masurca and Abyss assemblies all have the
      same style of header starting from scaffold.1 to scaffold.N.

      I have the run the sample dataset and it produced no errors and ran to
      completetion to I feel like there is something wrong with my data.... But I
      can't see how. The headers are very simple altough they are the same i.e.
      there is a scaffold.1 for each assembly.

      Regards
      Jonathan


      MateAn
      https://sourceforge.net/p/metassembler/discussion/general/thread/187d0ce3/?limit=25#4e9c


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/metassembler/discussion/general/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       
  • Jonathan Featherston

    Hi There

    Thanks for the really quick replies. I didn't get any errors from mateAn but I did "nohup" it so let me run it again with the program in the foreground. The nohup file has no error though. I have looked at both the ".mtp.bam.err" file and the ".mtp.err" file.

    The "mtp.err" file contains a lot of warnings about reads being too short but otherwise although the overall mapping rate is poor it appears ok.

    "
    93312431 reads; of these:
    93312431 (100.00%) were paired; of these:
    80928977 (86.73%) aligned concordantly 0 times
    10844552 (11.62%) aligned concordantly exactly 1 time
    1538902 (1.65%) aligned concordantly >1 times
    ----
    80928977 pairs aligned concordantly 0 times; of these:
    14251925 (17.61%) aligned discordantly 1 time
    ----
    66677052 pairs aligned 0 times concordantly or discordantly; of these:
    133354104 mates make up the pairs; of these:
    29190956 (21.89%) aligned 0 times
    35338098 (26.50%) aligned exactly 1 time
    68825050 (51.61%) aligned >1 times
    84.36% overall alignment rate

    "

    I wonder if it a resource problem although I doubt it... I ran it on a node with 65gb memory. The cluster just got an extra 180tb of extra storage so that isn't a problem. Could the memory be an issue? I could run it on a node with 500gb.

    Regards
    Jonathan

     
  • Jonathan Featherston

    Hi There

    So I ran mateAn on our 500gb fat node and it gave me a segmentation fault (core dumped).

    I'm surprised by this really! If needed I can put the data up onto a bigger cluster (one of our national facilities). But it kind of surprises me that this process would crash because of memory. Is this a highly memory intensive process?

    Kind Regards
    Jonathan

     
    • Michael Schatz

      Michael Schatz - 2016-06-15

      Can you confirm that the same data runs correctly on the big node?
      Sometimes there can be weird cross platform compatibility issues.
      Otherwise, we may need to dig into what is causing the crash. Can you next
      test if the first 1000 or first 10000 reads cause the crash?

      Thanks!
      Mike

      On Wed, Jun 15, 2016 at 11:17 AM, Jonathan Featherston featherstonj@users.sf.net wrote:

      Hi There

      So I ran mateAn on our 500gb fat node and it gave me a segmentation fault
      (core dumped).

      I'm surprised by this really! If needed I can put the data up onto a
      bigger cluster (one of our national facilities). But it kind of surprises
      me that this process would crash because of memory. Is this a highly memory
      intensive process?

      Kind Regards
      Jonathan


      MateAn


      Sent from sourceforge.net because you indicated interest in <
      https://sourceforge.net/p/metassembler/discussion/general/>

      To unsubscribe from further messages, please visit <
      https://sourceforge.net/auth/subscriptions/>

       
    • Alejandro Hernandez Wences

      Hi,

      Though it is still an option I really doubt that the problem is a lack of
      memory.

      It is possible that all reads are being filtered out because of the mapping
      quality (I noticed that only 11.62% map concordantly). Coul you run:

      samtools view -q 20 *.mtp.bam | wc

      on your *.mtp.bam files just to see how many have a mapping quality of at
      least 20 (this is the minimum mapping quality by default)?

      Also, among these, please verify that the pairs do align to the same
      scaffold, you can use:

      samtools view -q 20 *.mtp.bam | less

      To get a quick glimpse.

      The code should check for this kind of cases and I just noticed it doesn't,
      thanks for pointing it out (even if this might not be the issue here).

      Thanks a lot.

      On Wed, Jun 15, 2016 at 10:17 AM, Jonathan Featherston featherstonj@users.sf.net wrote:

      Hi There

      So I ran mateAn on our 500gb fat node and it gave me a segmentation fault
      (core dumped).

      I'm surprised by this really! If needed I can put the data up onto a
      bigger cluster (one of our national facilities). But it kind of surprises
      me that this process would crash because of memory. Is this a highly memory
      intensive process?

      Kind Regards
      Jonathan


      MateAn
      https://sourceforge.net/p/metassembler/discussion/general/thread/187d0ce3/?limit=25#c8f3


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/metassembler/discussion/general/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       
  • Jonathan Featherston

    Hi there.

    Sorry about my slow response. I think mails from soureforge are getting filtered in my junk mail and I've been away for a little while. I will check the sam file and try downscaling the number of reads.

    Regards
    Jonathan

     
  • Jonathan Featherston

    Hi Alejandro

    Samtools view of q20 mappings doesn't seems too bad:

    76549018 1516943537 27630563317

    Regards
    Jonathan

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.