wgs-assembler-users Mailing List for Whole-Genome Shotgun Assembler (Page 8)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Unfortunately, I'm on vacation at the moment, and finding little time to
spend helping you.

"Too many open files" is a limit imposed by the OS.  Can you increase
this?  We've set our large memory machines to allow 100,000 open files.

The output files sizes -- and the problem you're suffering from -- are all
caused by the way overlaps are created.  Correction asked for only overlaps
between Illumina and PacBio reads.  All the illumina reads are 'first' in
the store, and all the pacbio reads are at the end.  Overlap jobs will find
overlaps between 'other' reads and some subset of the store - e.g., the
first overlap job will process the first 10% of the reads, the second will
do the second 10% of the reads, etc.  Since the pacbio are last, the last
job found all the overlaps, so only the last file is of significant size.
This also breaks the partitioning scheme used when sorting overlaps.  It
assumes overlaps are distributed randomly, but yours are all piled up at
the end.

I don't see an easy fix here, but I think I can come up with a one-off hack
to get your store built.  Are you comfortable working with C code and
compiling?  Send the output of 'gatekeeper -dumpinfo *gkpStore' so I can
see the number of reads per library.

On Tue, Jun 17, 2014 at 6:45 PM, Santiago Revale <san...@gm...>
wrote:

> Hi Brian,
>
> When using 1024, it said the OS wasn't able to handle it, and it
> recommended using 1008.
> When using 1008, CA ended arguing "Failed to open output file... Too many
> open files".
>
> Now I'm trying with fewer parts, but I don't think this would solve the
> problem.
>
> Do you have any more ideas?
>
> Thanks again in advance.
>
> Regards,
> Santiago
>
>
> On Sun, Jun 15, 2014 at 10:10 PM, Santiago Revale <
> san...@gm...> wrote:
>
>> Hi Brian,
>>
>> Thanks for your reply. In regards of your suggestions:
>>
>> 1) the PBcR process generates OVB files without zipping them; just to be
>> sure, I've tried to unzip some of them just in case the extension were
>> missing;
>>
>> 2) I've re-launched the process with the suggested parameters, but using
>> 512 instead of 1024; the result was exactly the same: same error in the
>> same step. Also, again 511 out of 512 files had a size of 2.3Gb while the
>> last file was 1.2Tb long. Do you know why does this happens?
>>
>> I'm trying one last time using 1024 instead.
>>
>> Thanks again for your reply. I'm open to some more suggestions.
>>
>> Regards,
>> Santiago
>>
>>
>> On Fri, Jun 13, 2014 at 4:25 PM, Brian Walenz <th...@gm...> wrote:
>>
>>> Hi-
>>>
>>> This is a flaw in gzip, where it doesn't report the uncompressed size
>>> correctly for files larger than 2gb.  I'm not intimately familiar with this
>>> pipeline, so don't know exactly how to implement the fixes below.
>>>
>>> Fix with either:
>>>
>>> 1) gzip -d the *gz files before building the overlap store.  The 'find'
>>> command in the log indicates the pipeline will pick up the uncompressed
>>> files.  You might need to remove the 'asm.ovlStore.list' file before
>>> restarting (this has the list of inputs to overlapStoreBuild).
>>>
>>> 2) Set ovlStoreMemory to (exactly) "0 -f 1024".  This will tell it to
>>> use 0MB memory, and instead use 1024 files regardless of the size.  512
>>> files will also work, and is a little safer (not near some Linux 'number of
>>> open files' limits).
>>>
>>> 3) Build the overlap store by hand (with either the uncompressed input,
>>> or the -f instead of -M option), outside the script, and then restart the
>>> script.  The script will notice there is an overlap store already present,
>>> and skip the build.  The command is in the log file -- make sure the final
>>> store is called 'asm.ovlStore', and not 'asm.ovlStore.BUILDING'.
>>>
>>> Option 1 should work, but option 2 is the easiest to try.  I wouldn't
>>> try option 3 until Sergey speaks up.
>>>
>>> b
>>>
>>>
>>>
>>>
>>> On Fri, Jun 13, 2014 at 12:33 PM, Santiago Revale <
>>> san...@gm...> wrote:
>>>
>>>> Dear CA community,
>>>>
>>>> I'm running the correction of some PacBio reads with high-identity
>>>> Illumina reads, in a high memory server, for a 750 Mbp genome. I've
>>>> considered the known issues addressed in the website when starting the
>>>> correction.
>>>>
>>>> When executing the pipeline, I've reached to the overlapStoreBuild step
>>>> with 48 ovb files, size 26 Gb each (totaling 1.2Tb). ovls files have
>>>> already been deleted by the script. The error happened while executing
>>>> overlapStoreBuild:
>>>>
>>>> ...
>>>> bucketizing DONE!
>>>> overlaps skipped:
>>>>                0 OBT - low quality
>>>>                0 DUP - non-duplicate overlap
>>>>                0 DUP - different library
>>>>                0 DUP - dedup not requested
>>>> terminate called after throwing an instance of 'std::bad_alloc'
>>>>   what():  std::bad_alloc
>>>>
>>>> Failed with 'Aborted'
>>>> ...
>>>>
>>>>
>>>> I ran this step twice: the first one having set ovlStoreMemory to 8192
>>>> Mb, but the second one, set it on 160000 (160 Gb). In the "Overlap store
>>>> failure" FAQ, it mentioned as possible causes "Out of disk space" (which is
>>>> not my case) and "Corrupt gzip files / too many fragments". I don't have
>>>> gzip files and I have only 15 fragments. Also, bucketizing step finishes OK.
>>>>
>>>> Also, some odd thing I've noticed (at least odd for me) is that 14 of
>>>> the 15 temp files (tmp.sort.XXX) of the asm.ovlStore.BUILDING folder have a
>>>> size 79Gb while the last one size is 1.2Tb.
>>>>
>>>> Could anybody tell me what could be the cause of this error and how to
>>>> solve it?
>>>>
>>>> I'm attaching the asm.ovlStore.err and the pacBioToCA log files for
>>>> complete descriptions of the error and the executed commands.
>>>>
>>>> Thank you very much in advance.
>>>>
>>>> Regards,
>>>> Santiago
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk
>>>> Solutions
>>>> Find What Matters Most in Your Big Data with HPCC Systems
>>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>>>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>>>> http://p.sf.net/sfu/hpccsystems
>>>> _______________________________________________
>>>> wgs-assembler-users mailing list
>>>> wgs...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users
>>>>
>>>>
>>>
>>
>

2012	Jan (1)	Feb (2)	Mar	Apr (29)	May (8)	Jun (5)	Jul (46)	Aug (16)	Sep (5)	Oct (6)	Nov (17)	Dec (7)
2013	Jan (5)	Feb (2)	Mar (10)	Apr (13)	May (20)	Jun (7)	Jul (6)	Aug (14)	Sep (9)	Oct (19)	Nov (17)	Dec (3)
2014	Jan (3)	Feb	Mar (7)	Apr (1)	May (1)	Jun (30)	Jul (10)	Aug (2)	Sep (18)	Oct (3)	Nov (4)	Dec (13)
2015	Jan (27)	Feb	Mar (19)	Apr (12)	May (10)	Jun (18)	Jul (4)	Aug (2)	Sep (2)	Oct	Nov (1)	Dec (9)
2016	Jan (6)	Feb	Mar	Apr	May	Jun	Jul (1)	Aug (1)	Sep (1)	Oct	Nov	Dec

wgs-assembler-users Mailing List for Whole-Genome Shotgun Assembler (Page 8)

wgs-assembler-users — Discussion about Celera Assembler