Big fan of your work. Hopefully we'll be purchasing a PacBio Sequel soon. to get "near-perfect de novo assemblies"
Right now I am working on a few plant genomes sequenced with Illumina data and I'm struggling to get Metassembler to work. I used it a while back (over a year ago) and got great results with it but now I seem to be stymied.
It seems that my problem happens at the MateAn step as a matean file for the secondary assembly isn't produced. I have run the MateAn result on it's own and the same thing happens. I get no errors but basically it just produces an empty *.sort.bedpe.gz file.
This seems to happen with all my assemblies now (different species/assemblies). The report file just states "Writing Sort bedpe" but doesn't produce a result. With the metassemble pipeline the program runs well beyond this stage but throws up errors later like under the M1 folder error file "No ce-stat data no? No can do". It has taken me a while to figure out where the problem is because the pipeline runs beyond this point (I'm assembling multiple assemblies). The first assembly is with Allpaths and this work fine (there is a .matean file in the folder for that genome) but it produces no results for the secondary assemblies (Abyss and masurca). All assemblies have been renamed to have the fasta headers ">scaffold.1" i.e. the Allpaths, Masurca and Abyss assemblies all have the same style of header starting from scaffold.1 to scaffold.N.
I have the run the sample dataset and it produced no errors and ran to completetion to I feel like there is something wrong with my data.... But I can't see how. The headers are very simple altough they are the same i.e. there is a scaffold.1 for each assembly.
Regards
Jonathan
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for your comments. Very weird results -- if it works on the sample
data, then it must be something about your dataset or environment. It might
be just something mundane like out of memory or out of disk? Did you check
that MateAn didnt crash? It sometimes helps to run inside of gdb just so
it is clear if it exits cleanly or not. Otherwise, maybe try downsampling
the assemblies and/or reads to a smaller test set? Sorry I cant be more
helpful, but if you get totally stuck maybe you could send me some of the
data to try locally
Big fan of your work. Hopefully we'll be purchasing a PacBio Sequel soon.
to get "near-perfect de novo assemblies"
Right now I am working on a few plant genomes sequenced with Illumina data
and I'm struggling to get Metassembler to work. I used it a while back
(over a year ago) and got great results with it but now I seem to be
stymied.
It seems that my problem happens at the MateAn step as a matean file for
the secondary assembly isn't produced. I have run the MateAn result on it's
own and the same thing happens. I get no errors but basically it just
produces an empty *.sort.bedpe.gz file.
This seems to happen with all my assemblies now (different
species/assemblies). The report file just states "Writing Sort bedpe" but
doesn't produce a result. With the metassemble pipeline the program runs
well beyond this stage but throws up errors later like under the M1 folder
error file "No ce-stat data no? No can do". It has taken me a while to
figure out where the problem is because the pipeline runs beyond this point
(I'm assembling multiple assemblies). The first assembly is with Allpaths
and this work fine (there is a .matean file in the folder for that genome)
but it produces no results for the secondary assemblies (Abyss and
masurca). All assemblies have been renamed to have the fasta headers
">scaffold.1" i.e. the Allpaths, Masurca and Abyss assemblies all have the
same style of header starting from scaffold.1 to scaffold.N.
I have the run the sample dataset and it produced no errors and ran to
completetion to I feel like there is something wrong with my data.... But I
can't see how. The headers are very simple altough they are the same i.e.
there is a scaffold.1 for each assembly.
Could you check the reports on the alignments steps just to make sure that
the problem is not on the alignment step itself rather than on the mateAn
setp; you can take a look at the *.mtp.err files under the corresponding
BWTaln directories of each input assembly to get an overview of the
alignment rates.
Big fan of your work. Hopefully we'll be purchasing a PacBio Sequel soon.
to get "near-perfect de novo assemblies"
Right now I am working on a few plant genomes sequenced with Illumina data
and I'm struggling to get Metassembler to work. I used it a while back
(over a year ago) and got great results with it but now I seem to be
stymied.
It seems that my problem happens at the MateAn step as a matean file for
the secondary assembly isn't produced. I have run the MateAn result on it's
own and the same thing happens. I get no errors but basically it just
produces an empty *.sort.bedpe.gz file.
This seems to happen with all my assemblies now (different
species/assemblies). The report file just states "Writing Sort bedpe" but
doesn't produce a result. With the metassemble pipeline the program runs
well beyond this stage but throws up errors later like under the M1 folder
error file "No ce-stat data no? No can do". It has taken me a while to
figure out where the problem is because the pipeline runs beyond this point
(I'm assembling multiple assemblies). The first assembly is with Allpaths
and this work fine (there is a .matean file in the folder for that genome)
but it produces no results for the secondary assemblies (Abyss and
masurca). All assemblies have been renamed to have the fasta headers
">scaffold.1" i.e. the Allpaths, Masurca and Abyss assemblies all have the
same style of header starting from scaffold.1 to scaffold.N.
I have the run the sample dataset and it produced no errors and ran to
completetion to I feel like there is something wrong with my data.... But I
can't see how. The headers are very simple altough they are the same i.e.
there is a scaffold.1 for each assembly.
Thanks for the really quick replies. I didn't get any errors from mateAn but I did "nohup" it so let me run it again with the program in the foreground. The nohup file has no error though. I have looked at both the ".mtp.bam.err" file and the ".mtp.err" file.
The "mtp.err" file contains a lot of warnings about reads being too short but otherwise although the overall mapping rate is poor it appears ok.
"
93312431 reads; of these:
93312431 (100.00%) were paired; of these:
80928977 (86.73%) aligned concordantly 0 times
10844552 (11.62%) aligned concordantly exactly 1 time
1538902 (1.65%) aligned concordantly >1 times
----
80928977 pairs aligned concordantly 0 times; of these:
14251925 (17.61%) aligned discordantly 1 time
----
66677052 pairs aligned 0 times concordantly or discordantly; of these:
133354104 mates make up the pairs; of these:
29190956 (21.89%) aligned 0 times
35338098 (26.50%) aligned exactly 1 time
68825050 (51.61%) aligned >1 times
84.36% overall alignment rate
"
I wonder if it a resource problem although I doubt it... I ran it on a node with 65gb memory. The cluster just got an extra 180tb of extra storage so that isn't a problem. Could the memory be an issue? I could run it on a node with 500gb.
Regards
Jonathan
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So I ran mateAn on our 500gb fat node and it gave me a segmentation fault (core dumped).
I'm surprised by this really! If needed I can put the data up onto a bigger cluster (one of our national facilities). But it kind of surprises me that this process would crash because of memory. Is this a highly memory intensive process?
Kind Regards
Jonathan
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Can you confirm that the same data runs correctly on the big node?
Sometimes there can be weird cross platform compatibility issues.
Otherwise, we may need to dig into what is causing the crash. Can you next
test if the first 1000 or first 10000 reads cause the crash?
So I ran mateAn on our 500gb fat node and it gave me a segmentation fault
(core dumped).
I'm surprised by this really! If needed I can put the data up onto a
bigger cluster (one of our national facilities). But it kind of surprises
me that this process would crash because of memory. Is this a highly memory
intensive process?
Can you confirm that the same data runs correctly on the big node?
Sometimes there can be weird cross platform compatibility issues.
Otherwise, we may need to dig into what is causing the crash. Can you next
test if the first 1000 or first 10000 reads cause the crash?
So I ran mateAn on our 500gb fat node and it gave me a segmentation fault
(core dumped).
I'm surprised by this really! If needed I can put the data up onto a
bigger cluster (one of our national facilities). But it kind of surprises
me that this process would crash because of memory. Is this a highly
memory
intensive process?
So I ran mateAn on our 500gb fat node and it gave me a segmentation fault
(core dumped).
I'm surprised by this really! If needed I can put the data up onto a
bigger cluster (one of our national facilities). But it kind of surprises
me that this process would crash because of memory. Is this a highly memory
intensive process?
Sorry about my slow response. I think mails from soureforge are getting filtered in my junk mail and I've been away for a little while. I will check the sam file and try downscaling the number of reads.
Regards
Jonathan
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi There
Big fan of your work. Hopefully we'll be purchasing a PacBio Sequel soon. to get "near-perfect de novo assemblies"
Right now I am working on a few plant genomes sequenced with Illumina data and I'm struggling to get Metassembler to work. I used it a while back (over a year ago) and got great results with it but now I seem to be stymied.
It seems that my problem happens at the MateAn step as a matean file for the secondary assembly isn't produced. I have run the MateAn result on it's own and the same thing happens. I get no errors but basically it just produces an empty *.sort.bedpe.gz file.
This seems to happen with all my assemblies now (different species/assemblies). The report file just states "Writing Sort bedpe" but doesn't produce a result. With the metassemble pipeline the program runs well beyond this stage but throws up errors later like under the M1 folder error file "No ce-stat data no? No can do". It has taken me a while to figure out where the problem is because the pipeline runs beyond this point (I'm assembling multiple assemblies). The first assembly is with Allpaths and this work fine (there is a .matean file in the folder for that genome) but it produces no results for the secondary assemblies (Abyss and masurca). All assemblies have been renamed to have the fasta headers ">scaffold.1" i.e. the Allpaths, Masurca and Abyss assemblies all have the same style of header starting from scaffold.1 to scaffold.N.
I have the run the sample dataset and it produced no errors and ran to completetion to I feel like there is something wrong with my data.... But I can't see how. The headers are very simple altough they are the same i.e. there is a scaffold.1 for each assembly.
Regards
Jonathan
Hi Jonathan,
Thanks for your comments. Very weird results -- if it works on the sample
data, then it must be something about your dataset or environment. It might
be just something mundane like out of memory or out of disk? Did you check
that MateAn didnt crash? It sometimes helps to run inside of gdb just so
it is clear if it exits cleanly or not. Otherwise, maybe try downsampling
the assemblies and/or reads to a smaller test set? Sorry I cant be more
helpful, but if you get totally stuck maybe you could send me some of the
data to try locally
Good luck
Mike
On Tue, Jun 14, 2016 at 6:23 PM, Jonathan Featherston featherstonj@users.sf.net wrote:
Hi,
Thanks for your interest.
Could you check the reports on the alignments steps just to make sure that
the problem is not on the alignment step itself rather than on the mateAn
setp; you can take a look at the *.mtp.err files under the corresponding
BWTaln directories of each input assembly to get an overview of the
alignment rates.
Wences
On Tue, Jun 14, 2016 at 5:23 PM, Jonathan Featherston featherstonj@users.sf.net wrote:
Hi There
Thanks for the really quick replies. I didn't get any errors from mateAn but I did "nohup" it so let me run it again with the program in the foreground. The nohup file has no error though. I have looked at both the ".mtp.bam.err" file and the ".mtp.err" file.
The "mtp.err" file contains a lot of warnings about reads being too short but otherwise although the overall mapping rate is poor it appears ok.
"
93312431 reads; of these:
93312431 (100.00%) were paired; of these:
80928977 (86.73%) aligned concordantly 0 times
10844552 (11.62%) aligned concordantly exactly 1 time
1538902 (1.65%) aligned concordantly >1 times
----
80928977 pairs aligned concordantly 0 times; of these:
14251925 (17.61%) aligned discordantly 1 time
----
66677052 pairs aligned 0 times concordantly or discordantly; of these:
133354104 mates make up the pairs; of these:
29190956 (21.89%) aligned 0 times
35338098 (26.50%) aligned exactly 1 time
68825050 (51.61%) aligned >1 times
84.36% overall alignment rate
"
I wonder if it a resource problem although I doubt it... I ran it on a node with 65gb memory. The cluster just got an extra 180tb of extra storage so that isn't a problem. Could the memory be an issue? I could run it on a node with 500gb.
Regards
Jonathan
Hi There
So I ran mateAn on our 500gb fat node and it gave me a segmentation fault (core dumped).
I'm surprised by this really! If needed I can put the data up onto a bigger cluster (one of our national facilities). But it kind of surprises me that this process would crash because of memory. Is this a highly memory intensive process?
Kind Regards
Jonathan
Can you confirm that the same data runs correctly on the big node?
Sometimes there can be weird cross platform compatibility issues.
Otherwise, we may need to dig into what is causing the crash. Can you next
test if the first 1000 or first 10000 reads cause the crash?
Thanks!
Mike
On Wed, Jun 15, 2016 at 11:17 AM, Jonathan Featherston featherstonj@users.sf.net wrote:
Sorry, can you confirm the sample data runs correctly!
Mike
On Wed, Jun 15, 2016 at 12:41 PM, Michael Schatz mcschatz@users.sf.net
wrote:
Hi,
Though it is still an option I really doubt that the problem is a lack of
memory.
It is possible that all reads are being filtered out because of the mapping
quality (I noticed that only 11.62% map concordantly). Coul you run:
samtools view -q 20 *.mtp.bam | wc
on your *.mtp.bam files just to see how many have a mapping quality of at
least 20 (this is the minimum mapping quality by default)?
Also, among these, please verify that the pairs do align to the same
scaffold, you can use:
samtools view -q 20 *.mtp.bam | less
To get a quick glimpse.
The code should check for this kind of cases and I just noticed it doesn't,
thanks for pointing it out (even if this might not be the issue here).
Thanks a lot.
On Wed, Jun 15, 2016 at 10:17 AM, Jonathan Featherston featherstonj@users.sf.net wrote:
Hi there.
Sorry about my slow response. I think mails from soureforge are getting filtered in my junk mail and I've been away for a little while. I will check the sam file and try downscaling the number of reads.
Regards
Jonathan
Hi Alejandro
Samtools view of q20 mappings doesn't seems too bad:
76549018 1516943537 27630563317
Regards
Jonathan
Maybe the best would be if I can forward you a sam file to see if you can get mateAN to run.
Hi, yes that would be helpful, can you host it someplace where we could
download it?
On Wed, Jun 22, 2016 at 3:08 AM, Jonathan Featherston featherstonj@users.sf.net wrote: