From: Heng Li <lh...@sa...> - 2011-07-26 01:54:03
|
On Jul 25, 2011, at 9:36 PM, Fred Long wrote: > Java is not C. C is compiled, Java is interpreted (mostly). C does not use 2-byte Unicode strings, Java does. C does not do garbage collection in the background, Java does. Etc., etc. I was not arguing C is better. My intention is to show TabixReader may be optimized further - Java is not frequently twice as slow in my experience. Nonetheless, probably the difference in speed here does not matter in practice. Heng > > FL >> I tried several tabix implementations: >> >> 0) gzip >> 1) bgzip >> 2) tabix, the standard C implementation (BGZF aware readLine()) >> 3) tabix-slow-readline, C, simple readLine() (buffered but unaware of BGZF) >> 4) OldTabixReader, Java, naive readLine() >> 5) MatthiasTabixReader, Java, BGZF aware readLine() >> 6) FredTabixReader, Java, buffered readLine(), BGZF unaware >> >> My summary: >> >> 1) For BGZF files, bgzip seems to be faster than gzip. >> 2) BGZF aware readLine() should be faster than buffered readLine(). >> 3) The buffered C implementation (4.6 sec) is twice as fast as Java (10.8 sec). >> 4) I do not know why Java spent as many as ~4 sec on system CPUs. >> >> Heng >> >> >> hengli@gsa3 Tabix-test$ time gzip -dc input3.gz> /dev/null >> >> real 0m3.931s >> user 0m3.687s >> sys 0m0.243s >> hengli@gsa3 Tabix-test$ time bgzip -dc input3.gz> /dev/null >> >> real 0m2.287s >> user 0m1.909s >> sys 0m0.352s >> hengli@gsa3 Tabix-test$ time tabix input3.gz .> /dev/null >> >> real 0m3.407s >> user 0m3.051s >> sys 0m0.354s >> hengli@gsa3 Tabix-test$ time tabix-slow-readline input3.gz .> /dev/null >> >> real 0m4.687s >> user 0m4.207s >> sys 0m0.448s >> hengli@gsa3 Tabix-test$ time java -classpath .:sam-1.49.jar OldTabixReader input3.gz> /dev/null >> >> real 0m12.483s >> user 0m10.559s >> sys 0m4.765s >> hengli@gsa3 Tabix-test$ time java -classpath .:sam-1.49.jar MatthiasTabixReader input3.gz> /dev/null >> >> real 0m20.624s >> user 0m16.224s >> sys 0m5.827s >> hengli@gsa3 Tabix-test$ time java -classpath .:sam-1.49.jar FredTabixReader input3.gz> /dev/null >> >> real 0m8.741s >> user 0m7.159s >> sys 0m3.687s >> >> On Jul 25, 2011, at 7:46 PM, Fred Long wrote: >> >>> I have put my testing stuff here: http://dh.sdibr.org/fred/tabix-test.zip >>> >>> Just unzip it to a Linux machine and run "./doall". With the first two >>> smaller input files, the times jump around too much to be reliable, but >>> with the larger input file (input3) the timings are more consistent. >>> >>> Yes it is easy to add "\r\n" handling, but that's not the point. Unless >>> it can be shown that it is impossible to get decent performance without >>> adding readLine() to BlockCompressedInputStream, then I would not change >>> BlockCompressedInputStream at all. InputStreams in general cannot >>> implement readLine() correctly because they are not required to support >>> marks. >>> >>> FL >>>> Hi Alec, >>>> >>>> the concerns raised sound significant and I'm looking into it now as I just came back >>>> from holidays - hope that explains the previous silence. >>>> >>>> Fred, I'll try to replicate the concerns, but it would be helpful for me if you could >>>> provide me with your data sets for: >>>> - the 3 skipped output lines >>>> - where the readline method performs slower >>>> >>>> Looking at the code, the line terminator support (\r\n) should only be a small change >>>> - but I'll write some test cases first to make sure, I'm not introducing errors. I'll >>>> come back to you once my brain is back from holidays and produced a working solution + >>>> test cases. >>>> >>>> Best, >>>> Matthias >>>> >>>> On 07/25/2011 02:57 PM, Alec Wysoker wrote: >>>>> Hi Matthias, >>>>> >>>>> As the submitter of this patch, could you weigh in? I must confess I >>>>> didn't spend a lot of time looking at it, other than to address what I >>>>> thought was a bug (terminating a line at 0xff). Fred's concerns sound >>>>> significant, and if I don't hear from you (or someone else in support of >>>>> this change) I'm inclined to remove this method. >>>>> >>>>> Thanks, Alec >>>>> >>>>> On 7/20/11 11:18 AM, Fred Long wrote: >>>>>> I don't think BlockCompressedInputStream is the right place to put a >>>>>> readLine() method. To implement readLine() properly, you need to allow >>>>>> "\r", "\n", or "\r\n" as a line terminator, as Java's >>>>>> BufferedReader.readLine() does. But to allow a lone "\r" to be a line >>>>>> terminator, the stream must support marks, because you need to read the >>>>>> character after '\r' to see if it is a '\n', and if it's not a '\n' you >>>>>> need to put it back. Since the InputStream underneath >>>>>> BlockCompressedInputStream may not support marks, it's impossible to >>>>>> implement readLine() properly in BlockCompressedInputStream. >>>>>> >>>>>> The best and easiest solution is to wrap the BlockCompressedInputStream >>>>>> in TabixReader.java in a BufferedReader. >>>>>> >>>>>> I ran my own tests, and using BufferedReader.readLine() in >>>>>> TabixReader.java is slightly faster than using the >>>>>> BlockCompressedInputStream.readLine(). And depending on the input file, >>>>>> sometimes the original "slow" TabixReader.readLine() is faster than the >>>>>> readLine() in BlockCompressedInputStream or BufferedReader. >>>>>> >>>>>> Lastly, I think there may be a bug in >>>>>> BlockCompressedInputStream.readLine(), because 3 output lines were >>>>>> skipped when I used that version. >>>>>> >>>>>> FL >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> >>>>>> 10 Tips for Better Web Security >>>>>> Learn 10 ways to better secure your business today. Topics covered >>>>>> include: >>>>>> Web security, SSL, hacker attacks& Denial of Service (DoS), private >>>>>> keys, >>>>>> security Microsoft Exchange, secure Instant Messaging, and much more. >>>>>> http://www.accelacomm.com/jaw/sfnl/114/51426210/ >>>>>> _______________________________________________ >>>>>> Samtools-devel mailing list >>>>>> Sam...@li... >>>>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel >>> >>> ------------------------------------------------------------------------------ >>> Storage Efficiency Calculator >>> This modeling tool is based on patent-pending intellectual property that >>> has been used successfully in hundreds of IBM storage optimization engage- >>> ments, worldwide. Store less, Store more with what you own, Move data to >>> the right place. Try It Now! http://www.accelacomm.com/jaw/sfnl/114/51427378/ >>> _______________________________________________ >>> Samtools-devel mailing list >>> Sam...@li... >>> https://lists.sourceforge.net/lists/listinfo/samtools-devel >> >> > -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |