metassembler / Discussion / General Discussion: MateAn

Jonathan Featherston - 2016-06-14

Hi There

Big fan of your work. Hopefully we'll be purchasing a PacBio Sequel soon. to get "near-perfect de novo assemblies"

Right now I am working on a few plant genomes sequenced with Illumina data and I'm struggling to get Metassembler to work. I used it a while back (over a year ago) and got great results with it but now I seem to be stymied.

It seems that my problem happens at the MateAn step as a matean file for the secondary assembly isn't produced. I have run the MateAn result on it's own and the same thing happens. I get no errors but basically it just produces an empty *.sort.bedpe.gz file.

This seems to happen with all my assemblies now (different species/assemblies). The report file just states "Writing Sort bedpe" but doesn't produce a result. With the metassemble pipeline the program runs well beyond this stage but throws up errors later like under the M1 folder error file "No ce-stat data no? No can do". It has taken me a while to figure out where the problem is because the pipeline runs beyond this point (I'm assembling multiple assemblies). The first assembly is with Allpaths and this work fine (there is a .matean file in the folder for that genome) but it produces no results for the secondary assemblies (Abyss and masurca). All assemblies have been renamed to have the fasta headers ">scaffold.1" i.e. the Allpaths, Masurca and Abyss assemblies all have the same style of header starting from scaffold.1 to scaffold.N.

I have the run the sample dataset and it produced no errors and ran to completetion to I feel like there is something wrong with my data.... But I can't see how. The headers are very simple altough they are the same i.e. there is a scaffold.1 for each assembly.

Regards
Jonathan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Michael Schatz - 2016-06-15
  
  Hi Jonathan,
  
  Thanks for your comments. Very weird results -- if it works on the sample
  data, then it must be something about your dataset or environment. It might
  be just something mundane like out of memory or out of disk? Did you check
  that MateAn didnt crash? It sometimes helps to run inside of gdb just so
  it is clear if it exits cleanly or not. Otherwise, maybe try downsampling
  the assemblies and/or reads to a smaller test set? Sorry I cant be more
  helpful, but if you get totally stuck maybe you could send me some of the
  data to try locally
  
  Good luck
  
  Mike
  
  On Tue, Jun 14, 2016 at 6:23 PM, Jonathan Featherston featherstonj@users.sf.net wrote:
  
  Hi There
  
  Big fan of your work. Hopefully we'll be purchasing a PacBio Sequel soon.
  to get "near-perfect de novo assemblies"
  
  Right now I am working on a few plant genomes sequenced with Illumina data
  and I'm struggling to get Metassembler to work. I used it a while back
  (over a year ago) and got great results with it but now I seem to be
  stymied.
  
  It seems that my problem happens at the MateAn step as a matean file for
  the secondary assembly isn't produced. I have run the MateAn result on it's
  own and the same thing happens. I get no errors but basically it just
  produces an empty *.sort.bedpe.gz file.
  
  This seems to happen with all my assemblies now (different
  species/assemblies). The report file just states "Writing Sort bedpe" but
  doesn't produce a result. With the metassemble pipeline the program runs
  well beyond this stage but throws up errors later like under the M1 folder
  error file "No ce-stat data no? No can do". It has taken me a while to
  figure out where the problem is because the pipeline runs beyond this point
  (I'm assembling multiple assemblies). The first assembly is with Allpaths
  and this work fine (there is a .matean file in the folder for that genome)
  but it produces no results for the secondary assemblies (Abyss and
  masurca). All assemblies have been renamed to have the fasta headers
  ">scaffold.1" i.e. the Allpaths, Masurca and Abyss assemblies all have the
  same style of header starting from scaffold.1 to scaffold.N.
  
  I have the run the sample dataset and it produced no errors and ran to
  completetion to I feel like there is something wrong with my data.... But I
  can't see how. The headers are very simple altough they are the same i.e.
  there is a scaffold.1 for each assembly.
  
  Regards
  Jonathan
  
  MateAn
  
  Sent from sourceforge.net because you indicated interest in <
  https://sourceforge.net/p/metassembler/discussion/general/>
  
  To unsubscribe from further messages, please visit <
  https://sourceforge.net/auth/subscriptions/>
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Alejandro Hernandez Wences - 2016-06-15
  
  Hi,
  
  Thanks for your interest.
  
  Could you check the reports on the alignments steps just to make sure that
  the problem is not on the alignment step itself rather than on the mateAn
  setp; you can take a look at the *.mtp.err files under the corresponding
  BWTaln directories of each input assembly to get an overview of the
  alignment rates.
  
  Wences
  
  On Tue, Jun 14, 2016 at 5:23 PM, Jonathan Featherston featherstonj@users.sf.net wrote:
  
  Hi There
  
  Big fan of your work. Hopefully we'll be purchasing a PacBio Sequel soon.
  to get "near-perfect de novo assemblies"
  
  Right now I am working on a few plant genomes sequenced with Illumina data
  and I'm struggling to get Metassembler to work. I used it a while back
  (over a year ago) and got great results with it but now I seem to be
  stymied.
  
  It seems that my problem happens at the MateAn step as a matean file for
  the secondary assembly isn't produced. I have run the MateAn result on it's
  own and the same thing happens. I get no errors but basically it just
  produces an empty *.sort.bedpe.gz file.
  
  This seems to happen with all my assemblies now (different
  species/assemblies). The report file just states "Writing Sort bedpe" but
  doesn't produce a result. With the metassemble pipeline the program runs
  well beyond this stage but throws up errors later like under the M1 folder
  error file "No ce-stat data no? No can do". It has taken me a while to
  figure out where the problem is because the pipeline runs beyond this point
  (I'm assembling multiple assemblies). The first assembly is with Allpaths
  and this work fine (there is a .matean file in the folder for that genome)
  but it produces no results for the secondary assemblies (Abyss and
  masurca). All assemblies have been renamed to have the fasta headers
  ">scaffold.1" i.e. the Allpaths, Masurca and Abyss assemblies all have the
  same style of header starting from scaffold.1 to scaffold.N.
  
  I have the run the sample dataset and it produced no errors and ran to
  completetion to I feel like there is something wrong with my data.... But I
  can't see how. The headers are very simple altough they are the same i.e.
  there is a scaffold.1 for each assembly.
  
  Regards
  Jonathan
  
  MateAn
  https://sourceforge.net/p/metassembler/discussion/general/thread/187d0ce3/?limit=25#4e9c
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/metassembler/discussion/general/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jonathan Featherston - 2016-06-15

Hi There

Thanks for the really quick replies. I didn't get any errors from mateAn but I did "nohup" it so let me run it again with the program in the foreground. The nohup file has no error though. I have looked at both the ".mtp.bam.err" file and the ".mtp.err" file.

The "mtp.err" file contains a lot of warnings about reads being too short but otherwise although the overall mapping rate is poor it appears ok.

"
93312431 reads; of these:
93312431 (100.00%) were paired; of these:
80928977 (86.73%) aligned concordantly 0 times
10844552 (11.62%) aligned concordantly exactly 1 time
1538902 (1.65%) aligned concordantly >1 times
----
80928977 pairs aligned concordantly 0 times; of these:
14251925 (17.61%) aligned discordantly 1 time
----
66677052 pairs aligned 0 times concordantly or discordantly; of these:
133354104 mates make up the pairs; of these:
29190956 (21.89%) aligned 0 times
35338098 (26.50%) aligned exactly 1 time
68825050 (51.61%) aligned >1 times
84.36% overall alignment rate

"

I wonder if it a resource problem although I doubt it... I ran it on a node with 65gb memory. The cluster just got an extra 180tb of extra storage so that isn't a problem. Could the memory be an issue? I could run it on a node with 500gb.

Regards
Jonathan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jonathan Featherston - 2016-06-15

Hi There

So I ran mateAn on our 500gb fat node and it gave me a segmentation fault (core dumped).

I'm surprised by this really! If needed I can put the data up onto a bigger cluster (one of our national facilities). But it kind of surprises me that this process would crash because of memory. Is this a highly memory intensive process?

Kind Regards
Jonathan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Michael Schatz - 2016-06-15
  
  Can you confirm that the same data runs correctly on the big node?
  Sometimes there can be weird cross platform compatibility issues.
  Otherwise, we may need to dig into what is causing the crash. Can you next
  test if the first 1000 or first 10000 reads cause the crash?
  
  Thanks!
  Mike
  
  On Wed, Jun 15, 2016 at 11:17 AM, Jonathan Featherston featherstonj@users.sf.net wrote:
  
  Hi There
  
  So I ran mateAn on our 500gb fat node and it gave me a segmentation fault
  (core dumped).
  
  I'm surprised by this really! If needed I can put the data up onto a
  bigger cluster (one of our national facilities). But it kind of surprises
  me that this process would crash because of memory. Is this a highly memory
  intensive process?
  
  Kind Regards
  Jonathan
  
  MateAn
  
  Sent from sourceforge.net because you indicated interest in <
  https://sourceforge.net/p/metassembler/discussion/general/>
  
  To unsubscribe from further messages, please visit <
  https://sourceforge.net/auth/subscriptions/>
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Michael Schatz - 2016-06-15
    
    Sorry, can you confirm the sample data runs correctly!
    
    Mike
    
    On Wed, Jun 15, 2016 at 12:41 PM, Michael Schatz mcschatz@users.sf.net
    wrote:
    
    Can you confirm that the same data runs correctly on the big node?
    Sometimes there can be weird cross platform compatibility issues.
    Otherwise, we may need to dig into what is causing the crash. Can you next
    test if the first 1000 or first 10000 reads cause the crash?
    
    Thanks!
    Mike
    
    On Wed, Jun 15, 2016 at 11:17 AM, Jonathan Featherston featherstonj@users.sf.net wrote:
    
    Hi There
    
    So I ran mateAn on our 500gb fat node and it gave me a segmentation fault
    (core dumped).
    
    I'm surprised by this really! If needed I can put the data up onto a
    bigger cluster (one of our national facilities). But it kind of surprises
    me that this process would crash because of memory. Is this a highly
    memory
    intensive process?
    
    Kind Regards
    Jonathan
    
    [MateAn](
    
    https://sourceforge.net/p/metassembler/discussion/general/thread/187d0ce3/?limit=25#c8f3
    )
    
    Sent from sourceforge.net because you indicated interest in <
    https://sourceforge.net/p/metassembler/discussion/general/>
    
    To unsubscribe from further messages, please visit <
    https://sourceforge.net/auth/subscriptions/>
    
    MateAn
    
    Sent from sourceforge.net because you indicated interest in <
    https://sourceforge.net/p/metassembler/discussion/general/>
    
    To unsubscribe from further messages, please visit <
    https://sourceforge.net/auth/subscriptions/>
    
    alternate
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Alejandro Hernandez Wences - 2016-06-15
  
  Hi,
  
  Though it is still an option I really doubt that the problem is a lack of
  memory.
  
  It is possible that all reads are being filtered out because of the mapping
  quality (I noticed that only 11.62% map concordantly). Coul you run:
  
  samtools view -q 20 *.mtp.bam | wc
  
  on your *.mtp.bam files just to see how many have a mapping quality of at
  least 20 (this is the minimum mapping quality by default)?
  
  Also, among these, please verify that the pairs do align to the same
  scaffold, you can use:
  
  samtools view -q 20 *.mtp.bam | less
  
  To get a quick glimpse.
  
  The code should check for this kind of cases and I just noticed it doesn't,
  thanks for pointing it out (even if this might not be the issue here).
  
  Thanks a lot.
  
  On Wed, Jun 15, 2016 at 10:17 AM, Jonathan Featherston featherstonj@users.sf.net wrote:
  
  Hi There
  
  So I ran mateAn on our 500gb fat node and it gave me a segmentation fault
  (core dumped).
  
  I'm surprised by this really! If needed I can put the data up onto a
  bigger cluster (one of our national facilities). But it kind of surprises
  me that this process would crash because of memory. Is this a highly memory
  intensive process?
  
  Kind Regards
  Jonathan
  
  MateAn
  https://sourceforge.net/p/metassembler/discussion/general/thread/187d0ce3/?limit=25#c8f3
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/metassembler/discussion/general/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jonathan Featherston - 2016-06-21

Hi there.

Sorry about my slow response. I think mails from soureforge are getting filtered in my junk mail and I've been away for a little while. I will check the sam file and try downscaling the number of reads.

Regards
Jonathan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jonathan Featherston - 2016-06-21

Hi Alejandro

Samtools view of q20 mappings doesn't seems too bad:

76549018 1516943537 27630563317

Regards
Jonathan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jonathan Featherston - 2016-06-22

Maybe the best would be if I can forward you a sam file to see if you can get mateAN to run.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Alejandro Hernandez Wences - 2016-07-02
  
  Hi, yes that would be helpful, can you host it someplace where we could
  download it?
  
  On Wed, Jun 22, 2016 at 3:08 AM, Jonathan Featherston featherstonj@users.sf.net wrote:
  
  Maybe the best would be if I can forward you a sam file to see if you can
  get mateAN to run.
  
  MateAn
  https://sourceforge.net/p/metassembler/discussion/general/thread/187d0ce3/?limit=25#a4f4
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/metassembler/discussion/general/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

MateAn

Forums

Help

MateAn

MateAn

Forums

Help

MateAn document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

MateAn