Hey there, this is the second time I try to run the A5Miseq pipeline on my 2 .fastq.gz files (Illumina 300 bp paired-end reads, and I got the same following error. (I tried a second time with a more powerful computer). Seemingly the problem occurs almost at the end at the step of detecting missasemblies. I dont know if it is just because I use long illumina reads and I dont have the longread capable version of A5Miseq (which I just discovered the existence), or because my java is somehow not able to allow enough memory to execute the task? Or anything else I can't figure out... I copied the lines near where the error is happening, til the end. Thank you for your help!
[bam_sort_core] merging from 4 files...
[a5] java -Xmx5240m -jar A5qc.jar CD211_A5.s4/CD211_A5.qc.libraw1.sam CD211_A5.crude.scaffolds.fasta CD211_A5.s4/CD211_A5.qc.libraw1.broken.fasta 1 > CD211_A5.s4/CD211_A5.qc.libraw1.qc.out
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.TreeMap.put(TreeMap.java:569)
at java.util.TreeSet.add(TreeSet.java:255)
at org.halophiles.assembly.qc.MatchPoint.addNeighbor(MatchPoint.java:57)
at org.halophiles.assembly.qc.SpatialClusterer.locateNeighbors(SpatialClusterer.java:254)
at org.halophiles.assembly.qc.SpatialClusterer.buildReadPairClusters(SpatialClusterer.java:182)
at org.halophiles.assembly.qc.MisassemblyBreaker.main(MisassemblyBreaker.java:208)
[a5] Error in detecting misassemblies.
labolcf@labolcf:~/Documents/A5Miseq/CD211$
Hi Julian, you are probably using A5-miseq if the pipeline made it to the A5qc step with 300nt reads. What kind of organism are you trying to assemble? If it is a bacterium you can probably just reduce the number of reads for assembly to get a dataset that is small enough to assemble on your machine. If you are hitting memory limits it is likely that you have far more data than necessary or useful for a bacterial genome.
Thanks for the answer!
Yes its bacterial genome, but the fact is that I did not want to haveto modify the original files or do any kind of subsampling of sequences? I tried to allow more memory for java to use, but got the same result, but not sure if I did it in the proper way. I also tried to see what is being done in the perl script, but I am not very very confortable with that language... I was just wondering if the script was setting its own memory usage limits, and if it was of any use trying to change it in my environement variable (because java default is kinda set to much lower memory that is available on my cumputer). What would you suggest if I just want to make it work without reducing number of reads? If I count the number of reads I have in the forward reads file I have 1 552 081 reads
ok, your best bet is to check out the latest source code (important bugfixes), and edit around line 1473 to set the $mem variable to what you desire. Then build a linux package by running the script ./build_pipeline.sh
The default behavior is to use 2/3 of available system memory for java in the qc step.
Sorry to bother that much mr Darling, your hints are helping me very much at that moment..!
I just downloaded the very recent update : a5_miseq_macOS_20141120.tar.gz
Will this include the very latest source code your just mentioned me? Could you give me some hint on how to update a version of A5-Miseq I have with the new sources codes when they are available? (I am not sure of the procedure... And also, I dont understand the ./buil_pipeline.sh step... and why I do it... ) I know where the lines are, but I am not sure what I should write to be sure I use the most memory I can. (I dont just save the .pl file and use it in the command line). Obviously I have to get more knowledge on functionning with all the languages and good habits to have to make them work.
I also got to read that having a lot of read can indeed make the genome alignment much worse that with few reads... Although I dont understand completely why it does that (Always thought that more reads and longer reads would make things better), I am wondering what could be a good amount of reads to deal with in order to have the optimal genome reconstruction and no problem with memory issues on A5-Miseq. How do you pick the reads to be sure the selection is not biaised? What do you do when you have too much sequence, more than you need?
Last edit: Anonymous 2014-11-20
Hi Julian, these are big questions about sequencing strategy. You're better off asking in a forum like seqanswers or biostars.
./build_pipeline.sh is only needed if you check out the source code with subversion. Since you downloaded an already packaged build you can proceed directly to editing the script at the aforementioned line.