wgs-assembler-users Mailing List for Whole-Genome Shotgun Assembler (Page 15)

Brought to you by: brianwalenz, jasonmiller9704, mcschatz, skoren

wgs-assembler-users — Discussion about Celera Assembler

You can subscribe to this list here.

2012	_Jan (1)	_Feb (2)	_Mar	_Apr (29)	_May (8)	_Jun (5)	_Jul (46)	_Aug (16)	_Sep (5)	_Oct (6)	_Nov (17)	_Dec (7)
2013	_Jan (5)	_Feb (2)	_Mar (10)	_Apr (13)	_May (20)	_Jun (7)	_Jul (6)	_Aug (14)	_Sep (9)	_Oct (19)	_Nov (17)	_Dec (3)
2014	_Jan (3)	_Feb	_Mar (7)	_Apr (1)	_May (1)	_Jun (30)	_Jul (10)	_Aug (2)	_Sep (18)	_Oct (3)	_Nov (4)	_Dec (13)
2015	_Jan (27)	_Feb	_Mar (19)	_Apr (12)	_May (10)	_Jun (18)	_Jul (4)	_Aug (2)	_Sep (2)	_Oct	_Nov (1)	_Dec (9)
2016	_Jan (6)	_Feb	_Mar	_Apr	_May	_Jun	_Jul (1)	_Aug (1)	_Sep (1)	_Oct	_Nov	_Dec

Flat | Threaded

<< < 1 .. 13 14 15 16 17 .. 19 > >> (Page 15 of 19)

[wgs-assembler-users] gatekeeper failed to add fragments

From: Paul C. <pca...@gm...> - 2012-10-19 20:01:40

Hi

I'm trying for the first time to assemble Illumina fastq reads. After
running runCA 7.0 with:

runCA -d cabogout -p SRR073769.uniq.bowtie.unmap
SRR073769.uniq.bowtie.unmap.frg

I got this output:

----------------------------------------START Fri Oct 19 15:54:42 2012
/Users/pgc92/Public/usr/local/wgs-7.0/Darwin-amd64/bin/gatekeeper  -o
/Users/pgc92/vdiscovery/analysis/Prensner2011/GSM618509/cabogout/SRR073769.uniq.bowtie.unmap.gkpStore.BUILDING
-T  -F
/Users/pgc92/vdiscovery/analysis/Prensner2011/GSM618509/SRR073769.uniq.bowtie.unmap.frg
>
/Users/pgc92/vdiscovery/analysis/Prensner2011/GSM618509/cabogout/SRR073769.uniq.bowtie.unmap.gkpStore.err
2>&1
----------------------------------------END Fri Oct 19 15:54:46 2012 (4
seconds)
numFrags = 0
================================================================================

runCA failed.

----------------------------------------
Stack trace:

 at /Users/pgc92/Public/usr/local/wgs/Darwin-i386/bin/runCA line 1237
        main::caFailure('gatekeeper failed to add fragments',
'/Users/pgc92/vdiscovery/analysis/Prensner2011/GSM618509/cabog...') called
at /Users/pgc92/Public/usr/local/wgs/Darwin-i386/bin/runCA line 1698

main::preoverlap('/Users/pgc92/vdiscovery/analysis/Prensner2011/GSM618509/SRR07...')
called at /Users/pgc92/Public/usr/local/wgs/Darwin-i386/bin/runCA line 5874
----------------------------------------
Last few lines of the relevant log file
(/Users/pgc92/vdiscovery/analysis/Prensner2011/GSM618509/cabogout/SRR073769.uniq.bowtie.unmap.gkpStore.err):

Starting file
'/Users/pgc92/vdiscovery/analysis/Prensner2011/GSM618509/SRR073769.uniq.bowtie.unmap.frg'.

Processing SINGLE-ENDED SANGER QV encoding reads from:

'/Users/pgc92/vdiscovery/analysis/Prensner2011/GSM618509/SRR073769.uniq.bowtie.unmap.fq'
GKP finished with no alerts or errors.
----------------------------------------
Failure message:

gatekeeper failed to add fragments



What am I doing wrong? My fastq file contains ~1 million 45 bp reads with
sanger quality values. Here is head output of the fastq file

(57989) $ head SRR073769.uniq.bowtie.unmap.fq
@SRR073769.109 PATHBIO-SOLEXA2:2:1:3:1029 length=45
CTGCCCAGGCATAGTTCACCATCTTTCGGGTCCTAACACGTGCGC
+SRR073769.109 PATHBIO-SOLEXA2:2:1:3:1029 length=45
@@?@>@>@7@?9==@B@;@@@29>@6>3950:467>#########
@SRR073769.111 PATHBIO-SOLEXA2:2:1:3:1362 length=45
TGGTTAGTTTCTTCTCCTCCGCTGACTAATATGCTTAAATTCAGA
+SRR073769.111 PATHBIO-SOLEXA2:2:1:3:1362 length=45
CCCCCCC@CCCCCBCCBCCA@ABBCBBBCCBB8AB?6@ACB;?97
@SRR073769.113 PATHBIO-SOLEXA2:2:1:3:1458 length=45
GATCCACGGGGGCCGACCCGGTGACCCGGTTACCCGCCAGGTCCT



Here is the output of the FRG file:
(57990) $ cat *frg
{VER
ver:2
}
{LIB
act:A
acc:SRR073769.uniq.bowtie.unmap
ori:U
mea:0.000
std:0.000
src:
.
nft:16
fea:
forceBOGunitigger=1
isNotRandom=0
doNotTrustHomopolymerRuns=0
doTrim_initialNone=0
doTrim_initialMerBased=1
doTrim_initialFlowBased=0
doTrim_initialQualityBased=0
doRemoveDuplicateReads=1
doTrim_finalLargestCovered=1
doTrim_finalEvidenceBased=0
doRemoveSpurReads=1
doRemoveChimericReads=1
doConsensusCorrection=0
fastqQualityValues=sanger
fastqOrientation=innie
fastqReads=/Users/pgc92/vdiscovery/analysis/Prensner2011/GSM618509/SRR073769.uniq.bowtie.unmap.fq
.
}
{VER
ver:1
}



Thank you,

Paul


Paul Cantalupo
University of Pittsburgh

Re: [wgs-assembler-users] merTrim aggressiveness

From: Ole K. T. <o.k...@bi...> - 2012-09-10 14:28:50

On 10 September 2012 12:31, Ole Kristian Tørresen
<o.k...@bi...> wrote:
> On 10 September 2012 09:31, Walenz, Brian <bw...@jc...> wrote:
>> Hi, Ole-
>>
>> The _average_ dropped to 68?  The _minimum_ allowed is 64.
>
> Yes, and this is the cause for some concern from my part. This number
> includes reads with no length (because merTrim does not remove the
> reads, just record them with 0 length sequence and quality), but I'm
> not sure about the average length of not deleted reads.
>
>>
>> In the merTrim stderr output there should be mention of the thresholds it is
>> using.  There are two thresholds:
>>
>> 'minVerified' tells what kmers can be used for correcting some other kmer.
>> By default, this is 1/4 the guessed coverage in the reads.
>>
>> 'minCorrect' tells what kmers can be corrected.  Any kmer with count at most
>> this can be corrected.  By default, this is 1/3 the guessed coverage in the
>> reads.
>>
>> After all corrections are done, read ends are trimmed if they are not
>> covered by 'trusted' kmers.
>>
>> Possibly the guessed coverage was artificially high, resulting in
>> artificially high thresholds.  You can set these thresholds manually with
>> -correct (for minCorrect) and -evidence (for minVerified).  If the values
>> are less than 1, they are interpreted as a fraction of the guessed coverage,
>> otherwise, an absolute count threshold.
>>
>> Does the guessed coverage make sense?  Does the kmer count histogram look
>> sane?

I reran with logging now, and the guessed coverage look insane:
Guessed X coverage is 183
Use minCorrect=61 minVerified=45


I think the coverage should be around 16x, so I'll set -correct to 5
and -evidence to 4. Hopefully that should do it.


Thank you.


Ole

>
> I forgot to redirect the stderr to a file, but are running it again
> now to check the output.
>
>>
>> You can turn on verbose mode, which dumps a picture of the corrections,
>> trusted kmer coverage with -V.  You probably don't want to do this for all
>> reads.  Maybe just a sample of 100 reads or so.  Super verbose mode (-V -V
>> -V) will dump the same picture after each step in the algorithm.
>
> I think the issue, at least with this library, is that the second read
> is really bad. Almost every second read has more than half it length
> in quality '#', which is just trash. So this is probably not a cause
> where merTrim does something wrong, but where the sequencing has gone
> wrong.
>
> Thank you.
>
> Ole
>
>>
>> b
>>
>>
>> On 9/6/12 3:08 PM, "Ole Kristian Tørresen" <o.k...@bi...> wrote:
>>
>>> Hi,
>>> I just ran merTrim on a relatively low coverage library, well, we
>>> don't really know whether it is low coverage or not since we don't
>>> know the genome size accurately yet. The original library was 16 Gbp,
>>> but after merTrim and then loading it into an assembly, only 6 Gbp
>>> survived. This might give a relatively good assembly, but I'm a bit
>>> worried that it removed too much sequence. Can I adjust how much it
>>> throws out?
>>>
>>> My reads are 150 bp, PE. I followed the preprocessing page on the CA
>>> site, and created a database of trusted kmers and used that  to
>>> correct my reads.
>>>
>>> Of 56,188,107 reads of mate 1, 18,030,094 were deleted  and 36,122,767
>>> were clean, and the average length dropped to 68 bp. I expected it to
>>> remove about 10 % of my sequences (from what I've seen on other
>>> merTrim runs), but 2/3 seems a bit much.
>>>
>>> Thank you.
>>>
>>> Ole
>>>
>>> ------------------------------------------------------------------------------
>>> Live Security Virtual Conference
>>> Exclusive live event will cover all the ways today's security and
>>> threat landscape has changed and how IT managers can respond. Discussions
>>> will include endpoint security, mobile security and the latest in malware
>>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>> _______________________________________________
>>> wgs-assembler-users mailing list
>>> wgs...@li...
>>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users
>>

Re: [wgs-assembler-users] merTrim aggressiveness

From: Ole K. T. <o.k...@bi...> - 2012-09-10 10:31:44

On 10 September 2012 09:31, Walenz, Brian <bw...@jc...> wrote:
> Hi, Ole-
>
> The _average_ dropped to 68?  The _minimum_ allowed is 64.

Yes, and this is the cause for some concern from my part. This number
includes reads with no length (because merTrim does not remove the
reads, just record them with 0 length sequence and quality), but I'm
not sure about the average length of not deleted reads.

>
> In the merTrim stderr output there should be mention of the thresholds it is
> using.  There are two thresholds:
>
> 'minVerified' tells what kmers can be used for correcting some other kmer.
> By default, this is 1/4 the guessed coverage in the reads.
>
> 'minCorrect' tells what kmers can be corrected.  Any kmer with count at most
> this can be corrected.  By default, this is 1/3 the guessed coverage in the
> reads.
>
> After all corrections are done, read ends are trimmed if they are not
> covered by 'trusted' kmers.
>
> Possibly the guessed coverage was artificially high, resulting in
> artificially high thresholds.  You can set these thresholds manually with
> -correct (for minCorrect) and -evidence (for minVerified).  If the values
> are less than 1, they are interpreted as a fraction of the guessed coverage,
> otherwise, an absolute count threshold.
>
> Does the guessed coverage make sense?  Does the kmer count histogram look
> sane?

I forgot to redirect the stderr to a file, but are running it again
now to check the output.

>
> You can turn on verbose mode, which dumps a picture of the corrections,
> trusted kmer coverage with -V.  You probably don't want to do this for all
> reads.  Maybe just a sample of 100 reads or so.  Super verbose mode (-V -V
> -V) will dump the same picture after each step in the algorithm.

I think the issue, at least with this library, is that the second read
is really bad. Almost every second read has more than half it length
in quality '#', which is just trash. So this is probably not a cause
where merTrim does something wrong, but where the sequencing has gone
wrong.

Thank you.

Ole

>
> b
>
>
> On 9/6/12 3:08 PM, "Ole Kristian Tørresen" <o.k...@bi...> wrote:
>
>> Hi,
>> I just ran merTrim on a relatively low coverage library, well, we
>> don't really know whether it is low coverage or not since we don't
>> know the genome size accurately yet. The original library was 16 Gbp,
>> but after merTrim and then loading it into an assembly, only 6 Gbp
>> survived. This might give a relatively good assembly, but I'm a bit
>> worried that it removed too much sequence. Can I adjust how much it
>> throws out?
>>
>> My reads are 150 bp, PE. I followed the preprocessing page on the CA
>> site, and created a database of trusted kmers and used that  to
>> correct my reads.
>>
>> Of 56,188,107 reads of mate 1, 18,030,094 were deleted  and 36,122,767
>> were clean, and the average length dropped to 68 bp. I expected it to
>> remove about 10 % of my sequences (from what I've seen on other
>> merTrim runs), but 2/3 seems a bit much.
>>
>> Thank you.
>>
>> Ole
>>
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> wgs-assembler-users mailing list
>> wgs...@li...
>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users
>

Re: [wgs-assembler-users] merTrim aggressiveness

From: Walenz, B. <bw...@jc...> - 2012-09-10 07:32:05

Hi, Ole-

The _average_ dropped to 68?  The _minimum_ allowed is 64.

In the merTrim stderr output there should be mention of the thresholds it is
using.  There are two thresholds:

'minVerified' tells what kmers can be used for correcting some other kmer.
By default, this is 1/4 the guessed coverage in the reads.

'minCorrect' tells what kmers can be corrected.  Any kmer with count at most
this can be corrected.  By default, this is 1/3 the guessed coverage in the
reads.

After all corrections are done, read ends are trimmed if they are not
covered by 'trusted' kmers.

Possibly the guessed coverage was artificially high, resulting in
artificially high thresholds.  You can set these thresholds manually with
-correct (for minCorrect) and -evidence (for minVerified).  If the values
are less than 1, they are interpreted as a fraction of the guessed coverage,
otherwise, an absolute count threshold.

Does the guessed coverage make sense?  Does the kmer count histogram look
sane?

You can turn on verbose mode, which dumps a picture of the corrections,
trusted kmer coverage with -V.  You probably don't want to do this for all
reads.  Maybe just a sample of 100 reads or so.  Super verbose mode (-V -V
-V) will dump the same picture after each step in the algorithm.

b


On 9/6/12 3:08 PM, "Ole Kristian Tørresen" <o.k...@bi...> wrote:

> Hi,
> I just ran merTrim on a relatively low coverage library, well, we
> don't really know whether it is low coverage or not since we don't
> know the genome size accurately yet. The original library was 16 Gbp,
> but after merTrim and then loading it into an assembly, only 6 Gbp
> survived. This might give a relatively good assembly, but I'm a bit
> worried that it removed too much sequence. Can I adjust how much it
> throws out?
> 
> My reads are 150 bp, PE. I followed the preprocessing page on the CA
> site, and created a database of trusted kmers and used that  to
> correct my reads.
> 
> Of 56,188,107 reads of mate 1, 18,030,094 were deleted  and 36,122,767
> were clean, and the average length dropped to 68 bp. I expected it to
> remove about 10 % of my sequences (from what I've seen on other
> merTrim runs), but 2/3 seems a bit much.
> 
> Thank you.
> 
> Ole
> 
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> wgs-assembler-users mailing list
> wgs...@li...
> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users

[wgs-assembler-users] merTrim aggressiveness

From: Ole K. T. <o.k...@bi...> - 2012-09-06 19:09:08

Hi,
I just ran merTrim on a relatively low coverage library, well, we
don't really know whether it is low coverage or not since we don't
know the genome size accurately yet. The original library was 16 Gbp,
but after merTrim and then loading it into an assembly, only 6 Gbp
survived. This might give a relatively good assembly, but I'm a bit
worried that it removed too much sequence. Can I adjust how much it
throws out?

My reads are 150 bp, PE. I followed the preprocessing page on the CA
site, and created a database of trusted kmers and used that  to
correct my reads.

Of 56,188,107 reads of mate 1, 18,030,094 were deleted  and 36,122,767
were clean, and the average length dropped to 68 bp. I expected it to
remove about 10 % of my sequences (from what I've seen on other
merTrim runs), but 2/3 seems a bit much.

Thank you.

Ole

[wgs-assembler-users] number of overlapper jobs

From: Quan, X. <x....@im...> - 2012-09-03 08:46:15

Hi

I am running assemblying for a large genome. I forgot to set the ovlRefBlockSize to a larger number (currently it is the default number 2000000). Now the overlapInCore is running for more than three days and not finished yet. There are more than 4000 jobs have been finished without error. Below is the output statistics for one of the job
"HASH LOADING STOPPED: strings       5420544 out of      5420544 max.
HASH LOADING STOPPED: length      700000034 out of    700000034 max.
HASH LOADING STOPPED: entries     242755113 out of    264241152 max (load 68.90).
### realloc  Extra_Ref_Space  max_extra_ref_ct = 386385931
String_Ct = 5420544  Extra_String_Ct = 13533  Extra_String_Subcount = 21
Read 12224632 kmers to mark to skip
 Kmer hits without olaps = 6269530
    Kmer hits with olaps = 2108710
  Multiple overlaps/pair = 0
 Total overlaps produced = 2107177
      Contained overlaps = 0
       Dovetail overlaps = 0
"

According to the ovljob file, there are  038952 jobs. Is this number the real job number to run? Is it worth I kill the process and restart the overlapper with larger ovlRefBlockSize number?

Thanks!

Dr. Xueping Quan
Research Associate in BioInformatics
Imperial College London
Tel: +44(0)207 594 17 80
email:x....@im...
Personal:http://www3.imperial.ac.uk/people/x.quan
Group:  www3.imperial.ac.uk/savolainenlab<https://exchange.imperial.ac.uk/ecp/Customize/www3.imperial.ac.uk/savolainenlab>

Re: [wgs-assembler-users] gatekeeper failed - too many open files

From: Paul C. <pca...@gm...> - 2012-08-27 18:57:43

Hi

runCA has moved past the gatekeeper step and is onto the consensus!
Thank you everyone for your help,

Paul

Paul Cantalupo
University of Pittsburgh


On Mon, Aug 27, 2012 at 2:53 PM, Sebastian Jaenicke
<sja...@ce...> wrote:
> Hi,
>
> On Mon, Aug 27, 2012 at 11:51:45AM -0400, Paul Cantalupo wrote:
> [..]
>> I'm trying to assemble a 4.4 Gb fastq file containing ~8.3 million 454
>> sequences. I'm using runCA version 7.0 on a Mac Server Snow Leopard
>> 10.6. I got an error that says "to many open files" (see attached
>> error.txt file). I've attached the spec file as well.
>
> try rising the number of open files permitted by your system; I don't
> know the defaults for Snow Leopard, but 2048 should be sufficient.
> I.e., invoke
>
>     ulimit -n 2048
>
> before runCA. You can check defaults with just 'ulimit -n'.
>
> Regards,
>
> - Sebastian
>
> --
> A: Maybe because some people are too annoyed by top-posting.
> Q: Why do I not get an answer to my question(s)?
> A: Because it messes up the order in which people normally read text.
> Q: Why is top-posting such a bad thing?

Re: [wgs-assembler-users] gatekeeper failed - too many open files

From: Sebastian J. <sja...@Ce...> - 2012-08-27 18:54:07

Hi,

On Mon, Aug 27, 2012 at 11:51:45AM -0400, Paul Cantalupo wrote:
[..]
> I'm trying to assemble a 4.4 Gb fastq file containing ~8.3 million 454
> sequences. I'm using runCA version 7.0 on a Mac Server Snow Leopard
> 10.6. I got an error that says "to many open files" (see attached
> error.txt file). I've attached the spec file as well.

try rising the number of open files permitted by your system; I don't
know the defaults for Snow Leopard, but 2048 should be sufficient.
I.e., invoke

    ulimit -n 2048

before runCA. You can check defaults with just 'ulimit -n'.

Regards,

- Sebastian

-- 
A: Maybe because some people are too annoyed by top-posting.
Q: Why do I not get an answer to my question(s)?
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

[wgs-assembler-users] gatekeeper failed - too many open files

From: Paul C. <pca...@gm...> - 2012-08-27 15:51:56

Attachments: error.txt Turnbaugh2009all.spec

Hello all,

I'm trying to assemble a 4.4 Gb fastq file containing ~8.3 million 454
sequences. I'm using runCA version 7.0 on a Mac Server Snow Leopard
10.6. I got an error that says "to many open files" (see attached
error.txt file). I've attached the spec file as well.

Thank you for any help you can provide,

Paul


Paul Cantalupo
University of Pittsburgh

Re: [wgs-assembler-users] Correlation between long unitigs after unitigging and long contigs after scaffolding?

From: Walenz, B. <bw...@jc...> - 2012-08-23 19:14:21

Larger unitigs almost always lead to better assemblies.  (As one early
developers said: "I've never seen better assemblies from smaller unitigs".)
Unitigs can be split after they are formed (at 1x coverage areas with only
bad mates spanning it) and so the stats out of unitigger aren't exactly what
is input to scaffolder.

Mate happiness (5-consensus-insert-size) probably won't show much difference
here.

We've talked about making scaffolder output size statistics periodically,
but haven't implemented anything.

Even for a running assembly, you can output size statistics for contigs
using tigStore:
 tigStore -g *gkpStore -t *tigStore V -C -d sizes

Where V == the last complete version (has ctg, utg and dat files) in the
tigStore.

A bit heavy weight, but you can (in theory) run terminator using just a
checkpoint and a tigStore, even when scaffolder is running.  Some of the
labeling will be wrong (mate pairs won't be labeled as happy, etc;
contigs/unitigs probably won't be labeled either) but you can get sequence
files.

b

On 8/23/12 5:11 AM, "Ole Kristian Tørresen" <o.k...@bi...> wrote:

> Hi,
> I have several assemblies running, based on different input and
> configurations, and want to have an idea of how well they are doing.
> In the 4-unitigger folder, there is a log2 length histogram. Can I use
> that to get an idea of how well my assembly is going? For example,
> this is from one assembly (bogart):
> checkUnitigMembership()-- 13 (     8192-    16384) 2953
> checkUnitigMembership()-- 14 (    16384-    32768) 168
> checkUnitigMembership()-- 15 (    32768-    65536) 4
> checkUnitigMembership()-- 16 (    65536-   131072) 1
> 
> and this is from another (bog):
> checkUnitigMembership()-- 13 (     8192-    16384) 2302
> checkUnitigMembership()-- 14 (    16384-    32768) 74
> checkUnitigMembership()-- 15 (    32768-    65536) 1
> 
> and a third (bog):
> checkUnitigMembership()-- 13 (     8192-    16384) 2718
> checkUnitigMembership()-- 14 (    16384-    32768) 48
> 
> 
> Since there are more and longer unitigs in the first assembly, will
> that probably turn out to have longer contigs in the end, or is there
> no correlation between this? Is there other places where I can get a
> feel of my assembly? Parsing the scaffold log files in any particular
> way?
> 
> Thank you.
> 
> Ole
> 
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> wgs-assembler-users mailing list
> wgs...@li...
> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users

[wgs-assembler-users] Correlation between long unitigs after unitigging and long contigs after scaffolding?

From: Ole K. T. <o.k...@bi...> - 2012-08-23 09:11:42

Hi,
I have several assemblies running, based on different input and
configurations, and want to have an idea of how well they are doing.
In the 4-unitigger folder, there is a log2 length histogram. Can I use
that to get an idea of how well my assembly is going? For example,
this is from one assembly (bogart):
checkUnitigMembership()-- 13 (     8192-    16384) 2953
checkUnitigMembership()-- 14 (    16384-    32768) 168
checkUnitigMembership()-- 15 (    32768-    65536) 4
checkUnitigMembership()-- 16 (    65536-   131072) 1

and this is from another (bog):
checkUnitigMembership()-- 13 (     8192-    16384) 2302
checkUnitigMembership()-- 14 (    16384-    32768) 74
checkUnitigMembership()-- 15 (    32768-    65536) 1

and a third (bog):
checkUnitigMembership()-- 13 (     8192-    16384) 2718
checkUnitigMembership()-- 14 (    16384-    32768) 48


Since there are more and longer unitigs in the first assembly, will
that probably turn out to have longer contigs in the end, or is there
no correlation between this? Is there other places where I can get a
feel of my assembly? Parsing the scaffold log files in any particular
way?

Thank you.

Ole

Re: [wgs-assembler-users] Input a library with wrong orientation, can I edit gkp?

From: Ole K. T. <o.k...@bi...> - 2012-08-17 10:43:43

Hi Brian,
I got some comments/questions/answers below.

On 13 August 2012 04:34, Walenz, Brian <bw...@jc...> wrote:
> Hi, Ole-
>
> Very sorry to hear.  I've been stung by this a few times too.
>
> There is minimal support for non-innie oriented mates.  The assembler was
> developed with innie-oriented mates assumed, and there are still lots of
> places where we make that assumption.  In particular, finding evidence for
> merging two scaffolds assumes innie oriented mates; computing gap sizes
> based on mate pairs also does.  Both explicitly exclude non-innie oriented
> mates from contributing.
>
> The same issue comes up after classifyMates runs.  We're left with a pile of
> now outtie-oriented PE pairs that we can do nothing with.  We thought about
> updating the stores (reverse complementing the read), but as every overlap
> involving these reads would need to be modified, we decided this was just
> too risky.
>
> So, I'm sad to say, recomputing is the only real option.  If it makes you
> feel any better, I had to run overlaps on a big assembly three times because
> our scratch disk policy is to delete files older than a week, and I kept
> getting pulled away from it.

I reran the entire run because I was a bit too eager and deleted the
store before I was able to look into the issues you mentioned here.
I've done it now though, and got some questions about them.

>
> You might be able to learn something from this run though.  Bogart can't use
> all 3.3tb of those overlaps, so maybe you can reduce the number of overlaps.

The store is 2.5 TB, still too big I guess. I don't think I've seen
this big a store before, the largest was about 1 TB (and approx. the
same input data, about 51x coverage in Illumina reads and 26x in 454
reads). I have some files that number from 0001 to 0250 where the 0001
file is 5 GB and the 0250 file is 20 GB, so I guess that correct. And
a ovs and idx in addition.

>
> Is the minimum overlap length too low?  You could spot check some overlaps
> to see what the longest overlap is.  You might be able to get away with,
> say, a minimum overlap length of 64 bases.

How do I do this precisely? I tried running some commands like this:
 ~/src/wgs-August2/Linux-amd64/bin/overlapStore -p 375000001
51xillumina_26x454_bac-ends_bog.ovlStore
51xillumina_26x454_bac-ends_bog.gkpStore OBTINITIAL
Output:
375000001  A:    1    0
--------------------------------------------------------------------------------------------------->
288037584  A:    0 1975 (  -1)  B:  124 2047 (  -1)   0.00%    +124> +-2048
283419705  A:    0 1976 (  -1)  B:  124 2047 (  -1)   3.45%    +124> +-2048
Bus error (core dumped)

~/src/wgs-August2/Linux-amd64/bin/overlapStore -p 250000000
51xillumina_26x454_bac-ends_bog.ovlStore
51xillumina_26x454_bac-ends_bog.gkpStore OBTINITIAL
Output:
DUMPING PICTURE for ID 250000000 in store
51xillumina_26x454_bac-ends_bog.ovlStore (gkp
51xillumina_26x454_bac-ends_bog.gkpStore clear OBTINITIAL)
250000000  A:    1    0
--------------------------------------------------------------------------------------------------->
362293433  A:    0 1912 (  -1)  B:   80 2047 (  -1)   0.00%     +80> +-2048
308391868  A:    0 1913 (  -1)  B:   60 2047 (  -1)   0.00%     +60> +-2048
117489298  A:    0 1914 (  -1)  B:   54 2047 (  -1)   0.00%     +54> +-2048
231078346  A:    0 1917 (  -1)  B:   51 2047 (  -1)   0.00%     +51> +-2048
92028512  A:    0 1922 (  -1)  B:   37 2047 (  -1)   0.00%     +37> +-2048
Bus error (core dumped)

That does not look like what I expected (from
http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=OvlStore).
I also ran with this:
 ~/src/wgs-August2/Linux-amd64/bin/overlapStore -d
51xillumina_26x454_bac-ends_bog.ovlStore -b 375000000 -e 375000000
That gave med 5596 overlaps, so I guess it was a bad choice fragment.

If I choose another fragment (454 shotgun read):
~/src/wgs-August2/Linux-amd64/bin/overlapStore -d
51xillumina_26x454_bac-ends_bog.ovlStore -b 400000001 -e 400000001
I get 153 overlaps, and some of them:
400000001  3314999  I     59  -104  0.00  0.00
400000001  4090114  I     42  -121  0.00  0.00
400000001  5386505  I     59  -104  0.00  0.00
400000001  9054281  I    171     8  2.17  2.17
400000001 25877453  I     11  -170  0.00  0.00
400000001 34584423  I    195    32  2.94  2.94
400000001 35861704  N    221    26  2.38  2.38
400000001 36059151  N    168     5  2.11  2.11
400000001 36573727  N     10  -184  0.00  0.00
400000001 37990829  N    175    12  1.14  1.14
400000001 39350033  I     59  -104  0.00  0.00
400000001 39934425  N    218    39  4.44  4.44
400000001 41436906  I    133   -46  0.00  0.00
400000001 41776173  N    168     5  2.11  2.11
400000001 42439341  I    211    45  1.92  1.92
400000001 42876704  N    189    26  1.35  1.35
400000001 44051983  I    103   -60  0.00  0.00
400000001 47862199  N      2  -184  0.00  0.00
400000001 48017281  N    202    39  3.28  3.28
400000001 48875374  N    168     5  2.11  2.11
400000001 51196018  I    130   -58  0.00  0.00
400000001 54522446  N    205    39  3.45  3.45
400000001 56546126  I     72   -91  0.00  0.00
400000001 56653818  N     89   -74  0.00  0.00
400000001 66204913  I      5  -158  0.00  0.00
400000001 67154193  I    171     8  2.17  2.17
<snip>
400000001 367185391  I    220   225  4.65  4.65
400000001 367771473  I      0    18  0.76  0.76
400000001 371107629  N    -13    55  0.76  0.76
400000001 372107538  N      0   -31  0.00  0.00
400000001 372408998  N    152   347  0.90  0.90
400000001 377636043  N      0    54  0.38  0.38
400000001 377646282  N      0    53  0.38  0.38
400000001 377655226  N     99    72  0.61  0.61
400000001 377896151  N      0   -96  0.00  0.00
400000001 377911031  N      0    57  0.38  0.38
400000001 383651644  N    189   336  1.35  1.35
400000001 383688429  N    192   109  1.41  1.41
400000001 383754638  N    189   338  1.35  1.35
400000001 383766287  N    189   338  1.35  1.35
400000001 383932105  I     99   345  0.61  0.61
400000001 385033058  I     69   124  1.55  1.55
400000001 387893078  I      0    88  0.76  0.76
400000001 387946865  I      0    88  0.76  0.76
400000001 398390290  I    205   489  1.72  1.72
400000001 398406551  I    214   449  4.08  4.08
400000001 398998877  I     99   368  0.61  0.61
400000001 399078177  I    153   368  0.91  0.91
400000001 405995246  N    166   387  1.03  1.03
400000001 409651281  I      0    63  0.38  0.38
400000001 409656569  I     79    63  0.54  0.54
400000001 409813283  N      0    48  0.38  0.38
400000001 410422960  I      0   -95  0.00  0.00

But I can't see the overlap length from that can I? I guess I could
get the read length for each read and test though. It seems that the
error rate is mostly below 4 %, so that might not help much to set it
to 4 % (just a little bit).

For another read (Illumina from a MP library), 66 overlaps:
~/src/wgs-August2/Linux-amd64/bin/overlapStore -d
51xillumina_26x454_bac-ends_bog.ovlStore -b 40000000 -e 40000000
40000000 18888403  N      3     7  0.00  0.00
40000000 19154662  I    -22   -37  3.39  3.39
40000000 42344478  N     15     7  0.00  0.00
40000000 53346801  I    -51   -50  4.35  4.35
40000000 61865542  N     49    53  0.00  0.00
40000000 64381669  N      0    -7  2.25  2.25
40000000 65643742  N     41    26  0.00  0.00
40000000 66612098  I    -53   -54  4.76  4.76
40000000 69287018  N     -3     0  0.00  0.00
40000000 70886837  I    -55   -51  4.44  4.44
40000000 73035047  N    -46   -42  3.70  3.70
40000000 77363047  I     48    52  0.00  0.00
40000000 83770223  N      3     7  0.00  0.00
40000000 92992529  I    -53   -49  4.26  4.26
40000000 93109879  N    -25   -21  0.00  0.00
<snip>
40000000 317454557  N   -121   -32  0.00  0.00
40000000 320220222  N     -4     6  2.11  2.11
40000000 323660284  I    -22    55  0.00  0.00
40000000 330106339  N   -128   -47  0.00  0.00
40000000 330437997  I    -48    37  2.08  2.08
40000000 333633813  I     -4    31  0.00  0.00
40000000 350457623  N     34   139  0.00  0.00
40000000 366189823  I    -10   -26  2.86  2.86
40000000 373715828  I    -61   -48  0.00  0.00
40000000 380074140  I    -51   -15  0.00  0.00
40000000 380125547  I     -9    -9  2.30  2.30
40000000 383202147  N    -18    -4  0.00  0.00
40000000 383480660  N     34    27  0.00  0.00
40000000 392470256  N    -26   -31  3.08  3.08
40000000 395518457  N     -2    -5  0.00  0.00
40000000 409319564  I    -26   -54  4.76  4.76

There seems to be a lot of 0 % error overlaps (but I don't know the length).

Can I see from bogart's output how much of the overlaps it would have
loaded? Would it have loaded all 2.5 TB? I have  410,962,052 reads in
total, and bogart says 21,884,230,828 overlaps. If I do chose to rerun
overlaps, will it run faster with more stringent options? (4 % errors,
50 overlap length or maybe 64).

Thank you for your help, once again.

Ole

>
> Is the error rate too high?  Again spot checking, are there reads with no
> low-error overlaps?  Maybe you can get away with only 4% error.
>
> b
>
>
>
> On 8/10/12 3:18 PM, "Ole Kristian Tørresen" <o.k...@bi...> wrote:
>
>> Hi,
>> I ran classify on an Illumina mate pair library, and managed to use
>> one of the old versions of gatekeeper to dump the reads, so I guess
>> they were dumped as innie reads. I thought the library still was
>> outtie, and input that into an assembly. Now, after finishing
>> overlapper (using grid and grid version of overlapStoreBuild) I have a
>> ovlStore of 3.3 TB, so I'd rather not run that again if I can avoid
>> it.
>>
>> I see from this page:
>> http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Gatekeeper
>> #Library
>> that there are some options in changing orientation of the library,
>> but only "innie" is supported it says. Do you have any suggestions of
>> what I can do? Would it not work changing the library to "outtie"?
>>
>> Thank you.
>>
>> Ole
>>
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> wgs-assembler-users mailing list
>> wgs...@li...
>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users
>

Re: [wgs-assembler-users] pacBioToCA spec parameter optimization

From: Walenz, B. <bw...@jc...> - 2012-08-16 05:50:45

I think most of the time will be spent in overlaps, and this page should
help:

http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=RunCA#OV
L_Overlapper

Or just confuse.  I've been told that the page is a bit obtuse for
non-hackers, but it's all we have right now.

In general 4 cores and 8gb memory works great, for 'normal' genomes.  There
are a few more options that you can fiddle with to optimize memory loading,
but the defaults -- once you work through the page above -- should work
reasonably well (I hope).

Happy to help more if you want to get more specific.

b

On 8/14/12 9:26 AM, "Thomas Hackl" <tho...@un...> wrote:

> Hi,
> 
> I would like to use the pacBioToCA correction pipeline on data for a
> bigger genomes (Gb size), hence computation time quite concerns me.
> Judging from the *.spec files, there are quite some parameters to tweak.
> We have machines, 40/80 cores, 200/500Gb memory, and I was wondering if
> you could give me some advice on how to modify the spec file parameters
> to make optimal use of this potential.
> 
> Thanks
> Thomas
>

Re: [wgs-assembler-users] Time of running for overlap stage

From: Walenz, B. <bw...@jc...> - 2012-08-16 05:37:04

Hi-

Without details, I can’t really say what expected is.  It depends on genome properties (repeats, duplications), configuration (kmer size, threshold), read properties (number of reads, quality), and, of course, the amount of hardware you have.  I’d be unhappy with one week of wall clock too.

It is (sadly) easy to badly configure overlapper so that it runs forever.  Can you share your spec?  Hardware?  Are you willing to share details of genome and reads?

b


On 8/14/12 10:01 AM, "Quan, Xueping" <x....@im...> wrote:

Hi

I am running hybrid assembling using celera assembler version 7.0 for my genome (3.5gb) and wondering what's the normal expected time to finish. I started the assemle job one week ago and it is still on the overlap stage.

Thanks!

Xueping

Dr. Xueping Quan
Research Associate in BioInformatics
Imperial College London
Tel: +44(0)207 594 17 80
email:x....@im...
Personal:http://www3.imperial.ac.uk/people/x.quan
Group:  www3.imperial.ac.uk/savolainenlab <https://exchange.imperial.ac.uk/ecp/Customize/www3.imperial.ac.uk/savolainenlab>

[wgs-assembler-users] Time of running for overlap stage

From: Quan, X. <x....@im...> - 2012-08-14 14:02:28

Hi

I am running hybrid assembling using celera assembler version 7.0 for my genome (3.5gb) and wondering what's the normal expected time to finish. I started the assemle job one week ago and it is still on the overlap stage.

Thanks!

Xueping

Dr. Xueping Quan
Research Associate in BioInformatics
Imperial College London
Tel: +44(0)207 594 17 80
email:x....@im...
Personal:http://www3.imperial.ac.uk/people/x.quan
Group:  www3.imperial.ac.uk/savolainenlab<https://exchange.imperial.ac.uk/ecp/Customize/www3.imperial.ac.uk/savolainenlab>

[wgs-assembler-users] pacBioToCA spec parameter optimization

From: Thomas H. <tho...@un...> - 2012-08-14 13:27:09

Hi,

I would like to use the pacBioToCA correction pipeline on data for a 
bigger genomes (Gb size), hence computation time quite concerns me. 
Judging from the *.spec files, there are quite some parameters to tweak. 
We have machines, 40/80 cores, 200/500Gb memory, and I was wondering if 
you could give me some advice on how to modify the spec file parameters 
to make optimal use of this potential.

Thanks
Thomas


-- 
Thomas Hackl
Julius-Maximilians-Universität
Department of Bioinformatics
97074 Würzburg, Germany
Fon:  +49 931 - 31 86883
Mail: tho...@un...

Re: [wgs-assembler-users] coverage information for scaffolds/contigs

From: Walenz, B. <bw...@jc...> - 2012-08-13 02:48:18

Hi-

I had to run one myself, and then check the code.

There is a definite uninitialized value problem here.  In effect, any gap in
the scaffold was not set to zero coverage.  I've fixed it in CVS.  If you're
not using CVS, patch AS_RUN/fragmentDepth.c by adding

memset(histogram, 0, sizeof(uint32) * histogramMax);

near the start of the computeStuff() function (line 71 - after the block of
variables are defined, before the first 'if' test works).  The patch in CVS
is slightly different, but equivalent.

Big scaffolds (with few gaps) didn't seem to be affected too much.  Small
scaffolds - yours are two contigs joined by a fosmid? - were.

I'm stunned this has survived for so long!  Thanks for noticing.

b


On 8/9/12 10:23 AM, "Christoph Hahn" <chr...@gm...> wrote:

> Hi Brian,
> 
> thanks for your reply! The fragmentDepth utility does basically what I
> was interested in, thanks! I am a little confused with its output,
> though. If I run it in -scaffold mode like:
> fragmentDepth -scaffold < *.posmap.frgscf.sorted
> 
> In the fragmentDepth output I get the following as an example:
> uid     start   end     mode    mean    median
> 7180006953248   0       33010   40294   42.589286       2
> 7180006953249   0       31936   1518    42.845247       1
> 7180006953250   0       26539   62454   41.643727       41
> 
> A few questions there:
> What exactly is the mode (40294,1518,62454) column? According to
> *.posmap.scflen scaffold 7180006953248 is 33204 long - why does it
> calculate the coverage only until position 33010? Also, I am not sure
> how to understand the median value. To reach a value of 1 or 2 as in the
> first two scaffolds in the example about half of the positions need to
> have a coverage of 0-1 or 0-2, right? can that be correct, or am I
> misunderstanding something here?
> 
> Thanks for your help!
> 
> cheers,
> Christoph
> 
> 
> 
> On 08/09/2012 05:03 AM, Walenz, Brian wrote:
>> Hi, Christoph-
>> 
>> [Sorry, wrote this 16 hours ago and forgot to send]
>> 
>> Check out the 'fragmentDepth' utility.  It computes coverage, and outputs in
>> three different ways: coverage of each scaffold, a histogram of coverage (as
>> at the end of *.qc), and a fasta-like output of the actual depth of coverage
>> at each base in the scaffold.
>> 
>> I can't think of a reason it would fail on contigs, but I haven't tried it.
>> 
>> The posmap files should be capturing most of the important stuff from the
>> (agreed: very difficult to use) asm file.  If you can't get what you're
>> looking for out of the posmap files, we need to add to them.
>> 
>> b
>> 
>> 
>> 
>> On 8/8/12 6:20 AM, "Christoph Hahn" <chr...@gm...> wrote:
>> 
>>> Hello CA developers and experts,
>>> 
>>> I have just finished my first big 454+illumina hybrid assembly using CA7
>>> and I am about to assess the result now in comparison to purely illumina
>>> based assemblies.
>>> 
>>> One question there: What is the easiest way to get coverage information
>>> for the scaffolds, contigs, unitigs in the *.scf.fasta, *.ctg.fasta,
>>> etc. files? I figured, that it is possible to calculate it manually
>>> using the information in the *.posmap.frgscf and *.posmap.scflen files
>>> (in case of scaffolds). I guess, the information is also in the *.asm
>>> file, but I am having problems reading/parsing the file.
>>> Is there an easy way you can think about?
>>> The reason, why I want to do this is that I want to bin the
>>> scaffolds/contigs based on coverage, GC-content and length.
>>> 
>>> Any ideas are highly appreciated, thanks!
>>> 
>>> Much obliged,
>>> Christoph
>>> 
>>> University of Oslo, Norway
>>> 
>>> 
>>> ----------------------------------------------------------------------------
>>> --
>>> Live Security Virtual Conference
>>> Exclusive live event will cover all the ways today's security and
>>> threat landscape has changed and how IT managers can respond. Discussions
>>> will include endpoint security, mobile security and the latest in malware
>>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>> _______________________________________________
>>> wgs-assembler-users mailing list
>>> wgs...@li...
>>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users
>

Re: [wgs-assembler-users] Input a library with wrong orientation, can I edit gkp?

From: Walenz, B. <bw...@jc...> - 2012-08-13 02:34:37

Hi, Ole-

Very sorry to hear.  I've been stung by this a few times too.

There is minimal support for non-innie oriented mates.  The assembler was
developed with innie-oriented mates assumed, and there are still lots of
places where we make that assumption.  In particular, finding evidence for
merging two scaffolds assumes innie oriented mates; computing gap sizes
based on mate pairs also does.  Both explicitly exclude non-innie oriented
mates from contributing.

The same issue comes up after classifyMates runs.  We're left with a pile of
now outtie-oriented PE pairs that we can do nothing with.  We thought about
updating the stores (reverse complementing the read), but as every overlap
involving these reads would need to be modified, we decided this was just
too risky.

So, I'm sad to say, recomputing is the only real option.  If it makes you
feel any better, I had to run overlaps on a big assembly three times because
our scratch disk policy is to delete files older than a week, and I kept
getting pulled away from it.

You might be able to learn something from this run though.  Bogart can't use
all 3.3tb of those overlaps, so maybe you can reduce the number of overlaps.

Is the minimum overlap length too low?  You could spot check some overlaps
to see what the longest overlap is.  You might be able to get away with,
say, a minimum overlap length of 64 bases.

Is the error rate too high?  Again spot checking, are there reads with no
low-error overlaps?  Maybe you can get away with only 4% error.

b

On 8/10/12 3:18 PM, "Ole Kristian Tørresen" <o.k...@bi...> wrote:

> Hi,
> I ran classify on an Illumina mate pair library, and managed to use
> one of the old versions of gatekeeper to dump the reads, so I guess
> they were dumped as innie reads. I thought the library still was
> outtie, and input that into an assembly. Now, after finishing
> overlapper (using grid and grid version of overlapStoreBuild) I have a
> ovlStore of 3.3 TB, so I'd rather not run that again if I can avoid
> it.
> 
> I see from this page:
> http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Gatekeeper
> #Library
> that there are some options in changing orientation of the library,
> but only "innie" is supported it says. Do you have any suggestions of
> what I can do? Would it not work changing the library to "outtie"?
> 
> Thank you.
> 
> Ole
> 
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> wgs-assembler-users mailing list
> wgs...@li...
> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users

[wgs-assembler-users] Input a library with wrong orientation, can I edit gkp?

From: Ole K. T. <o.k...@bi...> - 2012-08-10 19:18:21

Hi,
I ran classify on an Illumina mate pair library, and managed to use
one of the old versions of gatekeeper to dump the reads, so I guess
they were dumped as innie reads. I thought the library still was
outtie, and input that into an assembly. Now, after finishing
overlapper (using grid and grid version of overlapStoreBuild) I have a
ovlStore of 3.3 TB, so I'd rather not run that again if I can avoid
it.

I see from this page:
http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Gatekeeper#Library
that there are some options in changing orientation of the library,
but only "innie" is supported it says. Do you have any suggestions of
what I can do? Would it not work changing the library to "outtie"?

Thank you.

Ole

Re: [wgs-assembler-users] coverage information for scaffolds/contigs

From: Christoph H. <chr...@gm...> - 2012-08-09 14:23:37

Hi Brian,

thanks for your reply! The fragmentDepth utility does basically what I 
was interested in, thanks! I am a little confused with its output, 
though. If I run it in -scaffold mode like:
fragmentDepth -scaffold < *.posmap.frgscf.sorted

In the fragmentDepth output I get the following as an example:
uid     start   end     mode    mean    median
7180006953248   0       33010   40294   42.589286       2
7180006953249   0       31936   1518    42.845247       1
7180006953250   0       26539   62454   41.643727       41

A few questions there:
What exactly is the mode (40294,1518,62454) column? According to 
*.posmap.scflen scaffold 7180006953248 is 33204 long - why does it 
calculate the coverage only until position 33010? Also, I am not sure 
how to understand the median value. To reach a value of 1 or 2 as in the 
first two scaffolds in the example about half of the positions need to 
have a coverage of 0-1 or 0-2, right? can that be correct, or am I 
misunderstanding something here?

Thanks for your help!

cheers,
Christoph



On 08/09/2012 05:03 AM, Walenz, Brian wrote:
> Hi, Christoph-
>
> [Sorry, wrote this 16 hours ago and forgot to send]
>
> Check out the 'fragmentDepth' utility.  It computes coverage, and outputs in
> three different ways: coverage of each scaffold, a histogram of coverage (as
> at the end of *.qc), and a fasta-like output of the actual depth of coverage
> at each base in the scaffold.
>
> I can't think of a reason it would fail on contigs, but I haven't tried it.
>
> The posmap files should be capturing most of the important stuff from the
> (agreed: very difficult to use) asm file.  If you can't get what you're
> looking for out of the posmap files, we need to add to them.
>
> b
>
>
>
> On 8/8/12 6:20 AM, "Christoph Hahn" <chr...@gm...> wrote:
>
>> Hello CA developers and experts,
>>
>> I have just finished my first big 454+illumina hybrid assembly using CA7
>> and I am about to assess the result now in comparison to purely illumina
>> based assemblies.
>>
>> One question there: What is the easiest way to get coverage information
>> for the scaffolds, contigs, unitigs in the *.scf.fasta, *.ctg.fasta,
>> etc. files? I figured, that it is possible to calculate it manually
>> using the information in the *.posmap.frgscf and *.posmap.scflen files
>> (in case of scaffolds). I guess, the information is also in the *.asm
>> file, but I am having problems reading/parsing the file.
>> Is there an easy way you can think about?
>> The reason, why I want to do this is that I want to bin the
>> scaffolds/contigs based on coverage, GC-content and length.
>>
>> Any ideas are highly appreciated, thanks!
>>
>> Much obliged,
>> Christoph
>>
>> University of Oslo, Norway
>>
>>
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> wgs-assembler-users mailing list
>> wgs...@li...
>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users

Re: [wgs-assembler-users] coverage information for scaffolds/contigs

From: Walenz, B. <bw...@jc...> - 2012-08-09 03:03:21

Hi, Christoph-

[Sorry, wrote this 16 hours ago and forgot to send]

Check out the 'fragmentDepth' utility.  It computes coverage, and outputs in
three different ways: coverage of each scaffold, a histogram of coverage (as
at the end of *.qc), and a fasta-like output of the actual depth of coverage
at each base in the scaffold.

I can't think of a reason it would fail on contigs, but I haven't tried it.

The posmap files should be capturing most of the important stuff from the
(agreed: very difficult to use) asm file.  If you can't get what you're
looking for out of the posmap files, we need to add to them.

b



On 8/8/12 6:20 AM, "Christoph Hahn" <chr...@gm...> wrote:

> Hello CA developers and experts,
> 
> I have just finished my first big 454+illumina hybrid assembly using CA7
> and I am about to assess the result now in comparison to purely illumina
> based assemblies.
> 
> One question there: What is the easiest way to get coverage information
> for the scaffolds, contigs, unitigs in the *.scf.fasta, *.ctg.fasta,
> etc. files? I figured, that it is possible to calculate it manually
> using the information in the *.posmap.frgscf and *.posmap.scflen files
> (in case of scaffolds). I guess, the information is also in the *.asm
> file, but I am having problems reading/parsing the file.
> Is there an easy way you can think about?
> The reason, why I want to do this is that I want to bin the
> scaffolds/contigs based on coverage, GC-content and length.
> 
> Any ideas are highly appreciated, thanks!
> 
> Much obliged,
> Christoph
> 
> University of Oslo, Norway
> 
> 
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> wgs-assembler-users mailing list
> wgs...@li...
> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users

[wgs-assembler-users] coverage information for scaffolds/contigs

From: Christoph H. <chr...@gm...> - 2012-08-08 10:20:30

Hello CA developers and experts,

I have just finished my first big 454+illumina hybrid assembly using CA7 
and I am about to assess the result now in comparison to purely illumina 
based assemblies.

One question there: What is the easiest way to get coverage information 
for the scaffolds, contigs, unitigs in the *.scf.fasta, *.ctg.fasta, 
etc. files? I figured, that it is possible to calculate it manually 
using the information in the *.posmap.frgscf and *.posmap.scflen files 
(in case of scaffolds). I guess, the information is also in the *.asm 
file, but I am having problems reading/parsing the file.
Is there an easy way you can think about?
The reason, why I want to do this is that I want to bin the 
scaffolds/contigs based on coverage, GC-content and length.

Any ideas are highly appreciated, thanks!

Much obliged,
Christoph

University of Oslo, Norway

Re: [wgs-assembler-users] Problem with Hybrid Assembly using both Illumina and 454 reads

From: Walenz, B. <bw...@jc...> - 2012-07-31 16:13:56

hi-

It is these two lines:

ovlThreads = 2
ovlConcurrency = 24

The first says that each process will use 2 threads, and the second says to run 24 processes at the same time, for a total of 48 cores used. Dropping ovlConcurrency to 16 should work.

b
--
Brian Walenz
Sr. Software Engineer
J. Craig Venter Institute

On 7/31/12 9:18 AM, "Quan, Xueping" <x....@im...> wrote:

I am working on a large plant genome (genome size about 3.5Gb), I got about 131Gb Illumina paired-end and 1.8Gb 454 mate pair reads. The assembling is running on a HPC with shared memory with upper memory limit and number of CPUs I could use being 800Gb and 32 cores.
However, the assembling job was killed by the system because "ncpus 33.30 exceeded limit 32" in the overlapInCore stage. Below is my spec file, could you please have a look to see where it is wrong and how to optimize:

"
#
# Expected rate of sequencing error. Allow pairwise alignments up to this rate.
# Sanger reads can use values less than one. Titanium reads require 3% in unitig.
#
utgErrorRate=0.03
utgErrorLimit=2.5 # Allow mismatches over and above the utgErrorRate. This helps with Illumina reads.
ovlErrorRate=0.06 # Larger than utg to allow for correction.
cnsErrorRate=0.10 # Larger than utg to avoid occasional consensus failures
cgwErrorRate=0.10 # Larger than utg to allow contig merges across high-error ends
#
merSize = 22 # default=22; use lower to combine across heterozygosity, higher to separate near-identical repeat copies
overlapper=ovl # the mer overlapper for 454-like data is insensitive to homopolymer problems but requires more RAM and disk
#
unitigger = bog
utgBubblePopping = 1
# utgGenomeSize = 3.5gb
#
# MERYL calculates K-mer seeds
merylMemory = 512000
merylThreads = 24
#
# OVERLAPPER calculates overlaps
ovlHashBits=25
ovlHashBlockLength=180000000
ovlThreads = 2
ovlConcurrency = 24
ovlRefBlockSize = 32000000
#
# OVERLAP STORE build the database
#ovlStoreMemory = 8GB # Oops! That doesn't work. See correction below.
ovlStoreMemory = 8192 # Mbp
#
# ERROR CORRECTION applied to overlaps
frgCorrThreads = 10
frgCorrConcurrency = 3
ovlCorrBatchSize = 1000000
ovlCorrConcurrency = 25
#
# UNITIGGER configuration
#
# CONSENSUS configuration
cnsConcurrency = 16
"

Thanks very much!

Xueping

Dr. Xueping Quan
Research Associate in BioInformatics
Imperial College London
Tel: +44(0)207 594 17 80
email:x....@im...
Personal:http://www3.imperial.ac.uk/people/x.quan
Group: www3.imperial.ac.uk/savolainenlab <https://exchange.imperial.ac.uk/ecp/Customize/www3.imperial.ac.uk/savolainenlab>

[wgs-assembler-users] Problem with Hybrid Assembly using both Illumina and 454 reads

From: Quan, X. <x....@im...> - 2012-07-31 13:18:46

Thanks very much!

Xueping

Dr. Xueping Quan
Research Associate in BioInformatics
Imperial College London
Tel: +44(0)207 594 17 80
email:x....@im...
Personal:http://www3.imperial.ac.uk/people/x.quan
Group: www3.imperial.ac.uk/savolainenlab<https://exchange.imperial.ac.uk/ecp/Customize/www3.imperial.ac.uk/savolainenlab>

Re: [wgs-assembler-users] huge assembly with thousands of MergeScaffoldsAggressive iterations

From: kuhl <ku...@mo...> - 2012-07-31 08:13:56

Hello Brian,

thanks for the help. Fortunately, in step 7-4 cgw successfully finished
MergeScaffoldsAggressive after iteration 564.

Best wishes, Heiner


On Mon, 30 Jul 2012 12:44:16 -0400, "Walenz, Brian" <bw...@jc...>
wrote:
> Hi, Heiner-
> 
> Working backwards through your email:
> 
> We've also noticed the 'large scaffold gets lots of little contigs
added'
> problem.  This seems to be dominating our run time.  I'm working on this
> problem at the moment.  Our previous solution was basically what you
did:
> let it run until we get impatient, then kill it and restart from the
next
> checkpoint label.
> 
> The CVS tip has a slight improvement in cgw, committed around the 20th. 
I
> hope to have much more within the next week.
> 
> You can ignore the mates in the library, but not the reads.  To ignore
the
> mates, simply delete the mate link from gkpStore.  At the very bottom of
> the
> 'gatekeeper' page on the wiki is 'allfragsunmated' which will remove the
> mate link from all reads in a single library.  This is a destructive
> operation!  Save a backup of gkpStore/fnm and gkpStore/fpk if you want
to
> revert.  (these two files store metadata for long and short fragments
> resp.)
> 
>
http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Gatekeep
> er
> 
> FYI- The 5-consensus-insert-size directory has a plot of the insert size
> histogram for each library.  These are based on unitigs, and so the 20k
> library might not be represented well.  tigStore (the command) can also
> analyze mate pairs for contigs/unitigs in the store with -d matepair.
> 
> b
> 
> 
> On 7/26/12 5:39 PM, "kuhl" <ku...@mo...> wrote:
> 
>> Hi Brian et al.,
>> 
>> I am currently running a huge assembly with CA7 (2.5Gb 30x Illumina +
>> 454,
>> cgw takes 150-300Gb RAM). It is now in step 7-2 and I have just stopped
>> cgw
>> at MergeScaffoldsAggressive iteration 1641 and restarted it at
>> ckp08-2SM. I
>> did this also in 7-0 at iteration 2xxx. Now I am not sure, if I should
>> maybe rerun scaffolding without 20 kb mate pairs, which I think are
>> responsible for this mess. So I have two questions:
>> 
>> How can I convince cgw to ignore a certain library without doing steps
>> 0-5
>> again? 
>> 
>> Is there a rule of thumb, when MergeScaffoldsAggressive should be
>> stopped?
>> 
>> 
>> In my case it looks like cgw is only very slightly progressing with
each
>> iteration and there is one large scaffold that is growing more and
>> more...
>> 
>> ExamineUsableSEdges()- maxWeightEdge from 0 to 32 at idx 3355 out of
>> 60498
>> ExamineUsableSEdges()- maxWeightEdge from 0 to 19 at idx 8774 out of
>> 60498
>> ExamineUsableSEdges()- maxWeightEdge from 0 to 55 at idx 286 out of
60500
>> ExamineUsableSEdges()- maxWeightEdge from 0 to 32 at idx 3355 out of
>> 60500
>> ExamineUsableSEdges()- maxWeightEdge from 0 to 16 at idx 10594 out of
>> 60500
>> ExamineUsableSEdges()- maxWeightEdge from 0 to 7 at idx 20348 out of
>> 60500
>> ExamineUsableSEdges()- maxWeightEdge from 0 to 55 at idx 286 out of
60489
>> ExamineUsableSEdges()- maxWeightEdge from 0 to 32 at idx 3355 out of
>> 60489
>> ExamineUsableSEdges()- maxWeightEdge from 0 to 19 at idx 8773 out of
>> 60489
>> ExamineUsableSEdges()- maxWeightEdge from 0 to 9 at idx 16854 out of
>> 60489
>> ExamineUsableSEdges()- maxWeightEdge from 0 to 55 at idx 286 out of
60486
>> ExamineUsableSEdges()- maxWeightEdge from 0 to 32 at idx 3355 out of
>> 60486
>> ExamineUsableSEdges()- maxWeightEdge from 0 to 16 at idx 10593 out of
>> 60486
>> ExamineUsableSEdges()- maxWeightEdge from 0 to 7 at idx 20428 out of
>> 60486
>> 
>> Regards, Heiner
>> 
>>
------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond.
Discussions
>> will include endpoint security, mobile security and the latest in
malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> wgs-assembler-users mailing list
>> wgs...@li...
>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users

-- 
---------------------------------------------------------------
Dr. Heiner Kuhl
MPI Molecular Genetics            Tel:   + 49 + 30 / 8413 1551
Next Generation Sequencing        
Ihnestrasse 73                    email: ku...@mo...
D-14195 Berlin                    http://www.molgen.mpg.de
---------------------------------------------------------------

9 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 13 14 15 16 17 .. 19 > >> (Page 15 of 19)

2012	Jan (1)	Feb (2)	Mar	Apr (29)	May (8)	Jun (5)	Jul (46)	Aug (16)	Sep (5)	Oct (6)	Nov (17)	Dec (7)
2013	Jan (5)	Feb (2)	Mar (10)	Apr (13)	May (20)	Jun (7)	Jul (6)	Aug (14)	Sep (9)	Oct (19)	Nov (17)	Dec (3)
2014	Jan (3)	Feb	Mar (7)	Apr (1)	May (1)	Jun (30)	Jul (10)	Aug (2)	Sep (18)	Oct (3)	Nov (4)	Dec (13)
2015	Jan (27)	Feb	Mar (19)	Apr (12)	May (10)	Jun (18)	Jul (4)	Aug (2)	Sep (2)	Oct	Nov (1)	Dec (9)
2016	Jan (6)	Feb	Mar	Apr	May	Jun	Jul (1)	Aug (1)	Sep (1)	Oct	Nov	Dec