From: Nava W. <ne...@sg...> - 2009-04-20 13:50:36
|
> Seg-fault: > > ========== > > Here is our gcc version info: > > -bash-3.00$ gcc -v > > Reading specs from /usr/lib/gcc/x86_64-redhat-linux/3.4.6/specs > > Configured with: ../configure --prefix=/usr --mandir=/usr/share/man > --infodir=/usr/share/info --enable-shared --enable-threads=posix > --disable-checking --with-system-zlib --enable-__cxa_atexit > --disable-libunwind-exceptions --enable-java-awt=gtk > --host=x86_64-redhat-linux > > Thread model: posix > > gcc version 3.4.6 20060404 (Red Hat 3.4.6-9) ok, I've not been able to replicate the segfault on 4.2.3 but it also only finds one read on this cluster, so I'll dig further. > Image offsets: > > ============== > > I think there are few issues with the runreport files: > > The XML tags are somewhat obvious: > > Base = A=0,C=1,G=2,T=3 > > Cycle = cycles of solexa run = length of sequence (Problem the cycle > numbers: for me they go from 0-9, 0-10, 0-10, 0-6) = 39 cycles, but we only > use 36!!!) ah ok. I think this is an issue with load_cycle. The pipeline loads reads in batches of 10 by default. During each back it also pulls in the reference cycle images, that's putting the numbering out. I'll fix this but as a temporary workaround you should be able to change load_cycle to 37 and the offsets should be correct. > X,y = subtiles (Which corner is x=0; y=0 (upper left?)? It should be top left yes. > For me it seems to be a 30x 30 square of sub images (see your previous > answer below). > > There is no overlap/gap between the tiles. In the case the divisions > (image_width/subimages) don't end up in an integer value, what happens > (round/ceil)? IIRC it's floored > On a different note, in the paper you describe the reason for subimage > registration as: incorrect focusing, warping, of the flowcell due to > temperature variation. > > What kind of improvements did you see with this approach? A similar approach is taken in the GAPipeline. In the GAPipeline they calculate offsets for 125x125 pixel regions. They then place a linear regression through these points of calculate a scaling factor. I opted for simple X/Y offsets because the offset variation across the tile didn't look linear. As it appears you can calculate offsets accurately for 50x50 subregions it seemed to me that scaling wouldn't buy you much (we are only talking about a variation of 1 or 2 pixels across the image). So, I'd hope that Swift is able to align clusters more accurately than the GAPipeline. > Why did you choose 30x30 sub-images as the standard? Experimentation. This has suited the datasets I've run well, however may be dependent on cluster density. From the tile you sent the cluster density appears to be a little lower than I've seen before, you might get some benefit from using a smaller number of subimages. > I believe this information (and derived statistics of the variations) could > be very well used for QC purposes. Are you doing something like that? I think it would be interesting to develop Swift in to QC tool. Right now my main focus is on the analysis algorithms, I'm writing the report data as xml which hopefully makes it easy for others to parse. > I guess that is it for now. > > > > Thanks for your help. > > > > Bernd > > > > -----Original Message----- > From: Nava Whiteford [mailto:ne...@sg...] > Sent: Monday, April 20, 2009 12:10 AM > To: Bernd Jagla > Cc: 'Tom Skelly' > Subject: Re: segmentation fault in swift program > > > > Thanks received the download. I've ran it through on my laptop using the > > Intel C compiler and didn't get a segfault, however the Swift only found > > 1 cluster on the tile. I'll check the tile against gcc and let you know > > what happens. > > > > > If you look in the fastq files the last 2 values in the description > > > field give the X and Y coordinates of the cluster. This is the X/Y > > > position on the reference image (usually cycle 0, A image). > > > > > > Hmm... I guess I am missing the obvious... Aren't the other images > > > registered to that reference image and don't I need to transform those > image > > > to have the same coordinates???? Where do I get this information? > > > > Ah ok, I understand now. Yes, if you want to match the cluster positions > > back to one of the other images you'll need to apply a transformation. > > > > Swift uses a simple x/y offset. However the offset is not constant over > > the image. Each subregion (set by the parameters -correlation_subimages > > * --correlation_cc_subimage_multiplier and by default 30) gets a > > * different offset. > > > > The offsets are available in two places, firstly in the standard output > > after: > > > > Cycle: 0 base: 2 offsetmap: > > X MAP: > > > > The follow matrix gives the X offsets for the first cycle G image (bases > > are A=0,C=1,G=2,T=3). So if you wanted to find the correct position in > > the G image you'd need something like: > > > > g_image_position_x = g_image_position_x + > offset_matrix[cluster_x_position/(image_width/subimages)][cluster_y_position > /(image_height/subimages)] > > > > Similarly for the clusters y coordinate. > > > > In addition to being in the standard output the offsets are also > > available in the xml runreports under: <offsetmaps> hopefully the layout > > makes sense. > > > > btw, do my mind if I CC these mailings to the swift mailing list? It > > might be useful to others. > > > > > > > > > PPS. Any further suggestions on how to compare the two methods? > > > > > > Have you performed an alignment to determine the error rate? This would > > > probably be quite a good idea, just to make sure that real sequences are > > > being generated. The preprint describes the methods we used. > > > > > > Good point! Even though I don't think this is the optimal way to very it > it > > > certainly makes sense as a first approximation. And this information will > be > > > useful to identify potentially problematic clusters. (And my boss also > > > suggested this over lunch ;) ) > > > > > > > > > Thanks a lot, and sorry if some of the answers are in your paper as I > > > haven't had time to read it yet. We do so now ;) > > > > > > Thanks, > > > > > > Bernd > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > From: Nava Whiteford [mailto:ne...@sg...] > > > > Sent: Thursday, April 16, 2009 10:20 PM > > > > To: Bernd Jagla > > > > Cc: 'Tom Skelly' > > > > Subject: Re: segmentation fault in swift program > > > > > > > > > > > > > > > > Hi Bernd, > > > > > > > > > > > > > > > > Thanks for trying out Swift, we're keen on working with the community to > > > > > > > > develop Swift so feedback is always useful. > > > > > > > > > > > > > > > > 30x yield increases: > > > > > > > > > > > > > > > > 30x increase in yield seems too good to be true. :) What are the raw > > > > > > > > numbers? If the GAPipeline is producing a few thousand and Swift is > > > > > > > > producing 100k+ then this sounds reasonable. However if Swift is > > > > > > > > producing 1000k+ then there's probably a problem somewhere. > > > > > > > > > > > > > > > > Both the GAPipeline and Swift apply what is know as purity filtering. > > > > > > > > Purity filtering is a relatively coarse metric for throwing out bad > > > > > > > > data. They have both been parameterised to result in an error rate of > > > > > > > > around 1% on the total dataset (both GAPipline and Swift). It maybe that > > > > > > > > the GAPipeline is experiencing a catastrophic failure on some tiles, > > > > > > > > which Swift is able to recover from. > > > > > > > > > > > > > > > > Quality scores: > > > > > > > > > > > > > > > > Neither the GAPipeline nor Swift's quality scores are very good without > > > > > > > > calibration. If you look in the Swift QualityCalibration directory there > > > > > > > > is a very simple score shuffling calibrator I suggest you use if > > > > > > > > aligning data using MAQ. I've attached a preprint of the Swift paper > > > > > > > > which describes our quality calibration and maybe of general interest. > > > > > > > > > > > > > > > > 4 hour run times: > > > > > > > > > > > > > > > > It maybe that Swift is identifying many "optical duplicates" and then > > > > > > > > filtering them out. You can try tweaking "threshold" (make it lower) and > > > > > > > > "threshold_window" (make it higher). Also reduce "segment_cycles". > > > > > > > > > > > > > > > > If you have the output for one of these runs I can possibly give you > > > > > > > > some other suggestions. > > > > > > > > > > > > > > > > Segmentation fault bug: > > > > > > > > > > > > > > > > Hmm yes, as Tom said it seems to be a bug in the crosstalk correction. I > > > > > > > > should make it fail more gracefully! :) > > > > > > > > > > > > > > > > If you have a copy of this image set you can send me it would be very > > > > > > > > useful in debugging. There's a script to grab a set of tile images here: > > > > > > > > > > > > > > > > http://linuxjunk.blogspot.com/2008/09/grab-set-of-tile-images-from.html > > > > > > > > > > > > > > > > Comparisons: > > > > > > > > > > > > > > > > Yes I'm extremely interested in seeing the results of your comparisons. > > > > > > > > Also check out the attached preprint where we have some basic > > > > > > > > comparisons of Swift against the GAPipeline. If you find datasets where > > > > Swift > > > > > > > > performs badly, or have features you'd like added please let me know. > > > > > > > > > > > > > > > > On Thu, Apr 16, 2009 at 06:19:29PM +0200, Bernd Jagla wrote: > > > > > > > > > > > > > > > > > > Thanks for the answer. What can I do to avoid this problem? I guess a > > > > > > > > > changing the code to account for this situation would be the best > > > > solution. > > > > > > > > > Unfortunately I am not fluent in C++... ;) > > > > > > > > > > > > > > > > > > On a different note: > > > > > > > > > I am comparing the output from swift and Firecrest currently and find > > > that > > > > > > > > > swift detects sometimes 30x more unique sequences than Firecrest... > this > > > > > > > > > make me wonder about the quality scores and how to really compare the > > > > > > > > > results. Maybe those sequences have been discarded by Firecrest for a > > > > > > > > > reason??? I would like to see with my own eyes the clusters, hence my > > > > > > > > > previous question about how to locate the clusters given a fastq > file... > > > > > > > > > >From the documentation I don't really understand what to do in order > to > > > > > > > > > compare them. > > > > > > > > > > > > > > > > > > Have you done similar experiments? > > > > > > > > > Do you have a more detailed description on how those scores are > > > > calculated? > > > > > > > > > Do you have any suggestions on how to compare the two methods? > > > > > > > > > > > > > > > > > > Also, what are ways to speed up the image analysis? What parameters > > > should > > > > I > > > > > > > > > tweak? > > > > > > > > > Sometimes the analysis for one tile takes more than 4 hours, which is > > > too > > > > > > > > > much for our environment... > > > > > > > > > > > > > > > > > > Thanks so much for your kind support. > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > > > > > Bernd > > > > > > > > > PS. Please let me know if you are interested in the results of my > > > > > > > > > comparisons... > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > From: Tom Skelly [mailto:ts...@sa...] > > > > > > > > > Sent: Thursday, April 16, 2009 4:06 PM > > > > > > > > > To: Bernd Jagla > > > > > > > > > Cc: ne...@sg... > > > > > > > > > Subject: Re: segmentation fault in swift program > > > > > > > > > > > > > > > > > > > > > > > > > > > I can see a lot of "Bin size: nan" in the output. There's a loop in > > > > > > > > > CrossTalkCorrection that counts down current_num_bins, and divides by > it > > > > > > > > > > > > to get the bin size. I'm guessing it's being counted down to zero, > hence > > > > > > > > > > > > the nan. > > > > > > > > > > > > > > > > > > That's as far as I can take it, however, as I'm not familiar with that > > > > > > > > > > area of the code. I'm hoping Nava can take it from there. > > > > > > > > > > > > > > > > > > --TS > > > > > > > > > > > > > > > > > > Bernd Jagla wrote: > > > > > > > > > > Hi Nava and Tom, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > First off, thanks for your swift program!!! It seems to be working > > > much > > > > > > > > > > better than the Illumina image analysis. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I just discovered a potential problem where you might be able to > help: > > > > > > > > > > > > > > > > > > > > Occasionally I get a segmentation fault (see attached files). The > > > files > > > > > > > > > were > > > > > > > > > > created using the following command: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > runswifttile /pasteur/solexa2/solexa_depot/090320_HWI-EAS285_0003/ 7 > > > 63 > > > > > > > > > > L7-63 > 63.out 2> 63.err > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If you need access to the images, please let me know where I can > drop > > > > > > > > > them. > > > > > > > > > > I only get two such seg-faults within the current experient (8 > lanes). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Please let me know if you know what I can do to solve this problem. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks a lot for your kind help. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Bernd > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Bernd Jagla > > > > > > > > > > Bioinformatician > > > > > > > > > > > > > > > > > > > > Institute Pasteur > > > > > > > > > > Plate-forme puces a ADN > > > > > > > > > > Genopole / Institut Pasteur > > > > > > > > > > 28 rue du Docteur Roux > > > > > > > > > > 75724 Paris Cedex 15 > > > > > > > > > > France > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > <mailto:ber...@pa...> ber...@pa... > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > tel: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > <http://www.plaxo.com/click_to_call?lang=en&src=jj_signature&To=%2B33+%280%2 > > > > > > > > > > 9+140+61+35+13&Email=ber...@ya...> +33 (0) 140 61 35 13 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > The Wellcome Trust Sanger Institute is operated by Genome Research > > > > > > > > > Limited, a charity registered in England with number 1021457 and a > > > > > > > > > company registered in England with number 2742969, whose registered > > > > > > > > > office is 215 Euston Road, London, NW1 2BE. > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Nav > > > > > > > > > > > > > > > > Work: 01865 854873 > > > > > > > > Mob : 07518-358405 > > > > > > > > > > -- > > > Nav > > > > > > Work: 01865 854873 > > > Mob : 07518-358405 > > > > > > > -- > > Nav > > > > Work: 01865 854873 > > Mob : 07518-358405 > -- Nav Work: 01865 854873 Mob : 07518-358405 |