From: James B. <jk...@sa...> - 2012-06-29 10:36:40
|
Hello Fabien, I took at look at the source to see how it worked, and have a couple questions. I see in ReadCodecImpl.java: private FastArithmeticCoder qualityScoreCoder; ... private void compressQuality(final ByteString qualityScores, final OutputBitStream out, int readLength) throws IOException { qualityScoreCoder.reset(); for (int i = 0; i < readLength; i++) { final byte x = qualityScores.byteAt(i); qualityScoreCoder.encode(x, out); } qualityScoreCoder.flush(out); } Is this just an O(0) encoding? It's using an arithmetic coder I see. I took a quick look and it seems to be a mix of an order-0 model (the Fenwick Tree) and the coder together. Have you tried replacing this with something using an array of say 256 interleaved coders? It would need the entropy encoding step split apart from the modelling step, but something like this in theory: byte last = 0; for (int i = 0; i < readLength; i++) { final byte x = qualityScores.byteAt(i); qualityScoreCoder[last].encode(x, out); last = x; } This means you're compressing symbol Q(i) in the context of Q(i-1), which I found dramatically improves compression with very little extra CPU. You can make the FastArithmeticCoder class itself keep be higher order too and put the array handling in that bit instead. Or perhaps there's just an off-the-shelf O(1) coder out there as a drop-in replacement. Of course this then needs care to spot the orientation of the read so you don't mix "gradually reducing quality" models with "gradually increasing quality". Similarly for the sequence data, although given the smaller alphabet you can go to higher order modelling without using much more memory. James -- James Bonfield (jk...@sa...) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |