From: Donal K. F. <dk...@us...> - 2012-03-30 14:20:01
|
TIP #400: SETTING THE COMPRESSION DICTIONARY ============================================== Version: $Revision: 1.1 $ Author: Donal K. Fellows <dkf_at_users.sf.net> State: Draft Type: Project Tcl-Version: 8.6 Vote: Pending Created: Friday, 30 March 2012 URL: http://purl.org/tcl/tip/400.html WebEdit: http://purl.org/tcl/tip/edit/400 Post-History: ------------------------------------------------------------------------- ABSTRACT ========== Sometimes it is necessary to set the compression dictionary so that a sequence of bytes may be compressed more efficiently (and decompressed as well). This TIP exposes that functionality. RATIONALE =========== The SPDY protocol extensions to HTTP require the seeding of the zlib compression dictionary (which greatly improves the performance of compression on small amounts of data, such as HTTP headers). In order to allow a pure Tcl implementation of the SPDY protocol, it is therefore necessary to provide a mechanism whereby the compression dictionary (a byte-array up to 262 bytes long, according to the zlib documentation). There is to be no mechanism for retrieving the compression dictionary generated by the compression engine; there is no API for doing that. PROPOSED CHANGES: TCL ======================= The *zlib push* command will gain an extra option: *-dictionary* /bytes/ This option will provide a compression dictionary to be used, which will be supplied to the zlib compression engine at the correct moment during compression or provided on request of the compression engine on decompression. The /bytes/ argument will be interpreted as a Tcl bytearray; it must be non-empty if given. In addition, the *zlib stream* command will gain some complexity. All the subcommands will gain the ability to take an extra *-dictionary* /bytes/ pair of options (same interpretation as above), the *zlib stream gzip* variety will also gain the ability to take *-header* /dict/ (where /dict/ is a Tcl dictionary such as is passed to the *-header* option to *zlib gzip*, not a compression dictionary), and the *zlib stream gunzip* variety will also gain the ability to take *-headerVar* /name/ (so that a Tcl dictionary describing the contents of the gzip header can be reported). The omission of the last two were an oversight in [TIP #234]. PROPOSED CHANGE: C ==================== At the C level, one additional function will be provided: void * *Tcl_ZlibStreamGetZstreamp*(Tcl_ZlibStream /zshandle/) This returns the /z_streamp/ associated with a the given Tcl_ZlibStream structure, which can then be used to directly call appropriate zlib functions not directly exposed through Tcl's interface, notably including deflateSetDictionary and inflateSetDictionary. Note that if a function /is/ exposed through a public interface (e.g., deflate and inflate) then it should not be called via this route or inconsistent things may happen. The return type of Tcl_ZlibStreamGetZstreamp is /void*/ so that there is no need for the zlib public types to form part of Tcl's public API. COPYRIGHT =========== This document has been placed in the public domain. ------------------------------------------------------------------------- TIP AutoGenerator - written by Donal K. Fellows |
From: Lars H. <Lar...@re...> - 2012-03-30 14:40:18
|
Donal K. Fellows skrev 2012-03-30 16.19: > > TIP #400: SETTING THE COMPRESSION DICTIONARY [snip] > to allow a pure Tcl implementation of the SPDY protocol, it is > therefore necessary to provide a mechanism whereby the compression > dictionary (a byte-array up to 262 bytes long, according to the zlib > documentation). That sentence looks like it was abruptly cut off. And are you sure the dictionary can be at most 262 bytes long? Zlib compression works with a window up to 32k in length, so I would expect to be able to seed that much data. Or does the 262 rather refer to the length of the SPDY seed data? > There is to be no mechanism for retrieving the compression dictionary > generated by the compression engine; there is no API for doing that. > > PROPOSED CHANGES: TCL > ======================= > > The *zlib push* command will gain an extra option: > > *-dictionary* /bytes/ > > This option will provide a compression dictionary to be used, which > will be supplied to the zlib compression engine at the correct moment > during compression or provided on request of the compression engine on > decompression. The /bytes/ argument will be interpreted as a Tcl > bytearray; it must be non-empty if given. > > In addition, the *zlib stream* command will gain some complexity. All > the subcommands will gain the ability to take an extra *-dictionary* > /bytes/ pair of options (same interpretation as above), How does a decompress stream signal that it needs the dictionary to be seeded? > the *zlib > stream gzip* variety will also gain the ability to take *-header* > /dict/ (where /dict/ is a Tcl dictionary such as is passed to the > *-header* option to *zlib gzip*, not a compression dictionary), and the > *zlib stream gunzip* variety will also gain the ability to take > *-headerVar* /name/ (so that a Tcl dictionary describing the contents > of the gzip header can be reported). The omission of the last two were > an oversight in [TIP #234]. Since the last two points are unrelated to the -dictionary option, it may be clearer to put them in a paragraph of their own. Lars Hellström |
From: Donal K. F. <don...@ma...> - 2012-03-30 14:55:26
Attachments:
donal_k_fellows.vcf
|
On 30/03/2012 15:40, Lars Hellström wrote: > Donal K. Fellows skrev 2012-03-30 16.19: >> to allow a pure Tcl implementation of the SPDY protocol, it is >> therefore necessary to provide a mechanism whereby the compression >> dictionary (a byte-array up to 262 bytes long, according to the zlib >> documentation). > > That sentence looks like it was abruptly cut off. And are you sure the > dictionary can be at most 262 bytes long? Zlib compression works with a > window up to 32k in length, so I would expect to be able to seed that much > data. Or does the 262 rather refer to the length of the SPDY seed data? Misread the docs (http://zlib.net/manual.html). The max used is the window size *minus* 262 bytes. It doesn't change the fact that I wouldn't validate it. :-) I'm more worried about the minimum size. And yes, that sentence was cut off in its prime. :-( Donal. |
From: Jeff R. <dv...@di...> - 2012-03-30 16:13:32
|
Lars Hellström wrote: >> >> In addition, the *zlib stream* command will gain some complexity. All >> the subcommands will gain the ability to take an extra *-dictionary* >> /bytes/ pair of options (same interpretation as above), > > How does a decompress stream signal that it needs the dictionary to be seeded? The zlib specification (rfc1950) specifies a mechanism to indicate a seeded dictionary is to be used on a compressed stream, as well as which dictionary is needed (the FDICT flag and the checksum from the dictionary, respectively). I think the zlib library will return a Z_NEED_DICT error if this flag is set on a stream and the appropriate dictionary has not been set. Of course, to make things more complicated, the SPDY spec doesn't say that it actually uses the mechanism defined in zlib to use a precompressed dictionary, is just says that a dictionary is used. This may be intentional, to completely disallow streams that aren't using the dictionary, but it strikes me as short-sighted. -J |
From: Lars H. <Lar...@re...> - 2012-03-31 16:34:09
|
Jeff Rogers skrev 2012-03-30 18.13: > Lars Hellström wrote: > >>> >>> In addition, the *zlib stream* command will gain some complexity. All >>> the subcommands will gain the ability to take an extra *-dictionary* >>> /bytes/ pair of options (same interpretation as above), >> >> How does a decompress stream signal that it needs the dictionary to be seeded? > > The zlib specification (rfc1950) specifies a mechanism to indicate a > seeded dictionary is to be used on a compressed stream, as well as which > dictionary is needed (the FDICT flag and the checksum from the > dictionary, respectively). I think the zlib library will return a > Z_NEED_DICT error if this flag is set on a stream and the appropriate > dictionary has not been set. Yes, I know that! I meant: How does the proposed /Tcl interface/ communicate this fact back to the script level? In other words, by "decompress stream" I meant the stream command created by a [zlib stream decompress] call. This is not an issue for SPDY, where the same dictionary should always be used, but these [zlib] command additions presumably aims to be more general. Lars Hellström |
From: Donal K. F. <don...@ma...> - 2012-05-10 12:53:15
Attachments:
donal_k_fellows.vcf
|
On 31/03/2012 18:34, Lars Hellström wrote: > Yes, I know that! I meant: How does the proposed /Tcl interface/ communicate > this fact back to the script level? In other words, by "decompress stream" I > meant the stream command created by a [zlib stream decompress] call. I'm proposing to have it so that, if that's true, you get a Tcl error with an error code of {TCL ZLIB NEED_DICT}. At that point, you'll be able to use [$strm checksum] to find what the checksum of the required dictionary is, and some API (not yet specified, alas) to supply the compression dictionary. Remember, the TIP is not yet finished... Donal. |
From: Lars H. <Lar...@re...> - 2012-05-10 13:10:52
|
Donal K. Fellows skrev 2012-05-10 14.53: > On 31/03/2012 18:34, Lars Hellström wrote: >> Yes, I know that! I meant: How does the proposed /Tcl interface/ communicate >> this fact back to the script level? In other words, by "decompress stream" I >> meant the stream command created by a [zlib stream decompress] call. > > I'm proposing to have it so that, if that's true, you get a Tcl error > with an error code of {TCL ZLIB NEED_DICT}. When one tries to read decompressed data from the stream, I hope? (Perhaps [$stream get 0] can be useful as a special case to test whether decompression is possible, without actually getting any data.) > At that point, you'll be > able to use [$strm checksum] to find what the checksum of the required > dictionary is, and some API (not yet specified, alas) to supply the > compression dictionary. Seems a reasonable API that a -dictionary option to [$stream get] sets the dictionary first, then tries to decompress. In case you're looking for ideas. > Remember, the TIP is not yet finished... Just as long as the matter isn't forgotten. Lars Hellström |
From: Donal K. F. <don...@ma...> - 2012-05-15 10:33:30
Attachments:
donal_k_fellows.vcf
|
On 10/05/2012 14:11, Lars Hellström wrote: > Donal K. Fellows skrev 2012-05-10 14.53: >> I'm proposing to have it so that, if that's true, you get a Tcl error >> with an error code of {TCL ZLIB NEED_DICT}. > > When one tries to read decompressed data from the stream, I hope? (Perhaps > [$stream get 0] can be useful as a special case to test whether > decompression is possible, without actually getting any data.) Yes. Internally, the stream data is stored in compressed form in both directions. I'm not sure what getting zero-length content would do though; it might end up returning with no action taken. >> At that point, you'll be >> able to use [$strm checksum] to find what the checksum of the required >> dictionary is, and some API (not yet specified, alas) to supply the >> compression dictionary. > > Seems a reasonable API that a -dictionary option to [$stream get] sets the > dictionary first, then tries to decompress. In case you're looking for ideas. That sounds very reasonable. Donal. |
From: Lars H. <Lar...@re...> - 2012-05-16 10:31:24
|
Donal K. Fellows skrev 2012-05-15 12.33: > On 10/05/2012 14:11, Lars Hellström wrote: >> Donal K. Fellows skrev 2012-05-10 14.53: >>> I'm proposing to have it so that, if that's true, you get a Tcl error >>> with an error code of {TCL ZLIB NEED_DICT}. >> >> When one tries to read decompressed data from the stream, I hope? (Perhaps >> [$stream get 0] can be useful as a special case to test whether >> decompression is possible, without actually getting any data.) I was thinking that it might be good to be able to sort out the dictionary business /before/ having to worry about pieces of decompressed data. Concretely, set stream [zlib stream decompress] $stream put $data try {$stream get 0} trap {TCL ZLIB NEED_DICT} {} { $stream get -dictionary $DictTable([$stream checksum]) 0 } while {![$stream eof]} { set block [$stream get 128] ... } seems less convoluted than set stream [zlib stream decompress] $stream put $data try { set block [$stream get 128] } trap {TCL ZLIB NEED_DICT} {} { set block [$stream get -dictionary $DictTable([$stream checksum]) 128] } while 1 { ... if {[$stream eof]} break set block [$stream get 128] } > Yes. Internally, the stream data is stored in compressed form in both > directions. I'm not sure what getting zero-length content would do > though; it might end up returning with no action taken. If the zlib library is not documented to do something useful in that case, I can imagine having a special case in the implementation of [$stream get] for it (checking an integer argument for being 0 is quite straightforward). On the other hand, the cost for an extra [$stream ready] subcommand is quite small too. >>> At that point, you'll be >>> able to use [$strm checksum] to find what the checksum of the required >>> dictionary is, and some API (not yet specified, alas) to supply the >>> compression dictionary. >> >> Seems a reasonable API that a -dictionary option to [$stream get] sets the >> dictionary first, then tries to decompress. In case you're looking for ideas. > > That sounds very reasonable. A possible problem that occurs to me: what happens if one supplies a -dictionary several times? Are they concatenated, or does Tcl_ZlibStreamSetCompressionDictionary clear the old dictionary data first? I'm curious as to whether it would work to go set stream [zlib stream decompress] $stream put $data -dictionary $UsualDict ; # Default to most common dict try {$stream get 0} trap {TCL ZLIB NEED_DICT} {} { # It wasn't that one, look up which one should be used instead. $stream get -dictionary $DictTable([$stream checksum]) 0 } (On the other hand, it's not clear that there would be a problem even if concatenating, since [$stream checksum] reports the checksum of the /expected/ dictionary, and the data being decompressed should not refer to anything before the dictionary it knows about.) Lars Hellström |
From: Jan N. <nij...@us...> - 2012-05-06 21:17:37
|
2012/3/30 Donal K. Fellows <dk...@us...>: > PROPOSED CHANGE: C > ==================== > > At the C level, one additional function will be provided: > > void * *Tcl_ZlibStreamGetZstreamp*(Tcl_ZlibStream /zshandle/) > > This returns the /z_streamp/ associated with a the given Tcl_ZlibStream > structure, which can then be used to directly call appropriate zlib > functions not directly exposed through Tcl's interface, notably > including deflateSetDictionary and inflateSetDictionary. Note that if a > function /is/ exposed through a public interface (e.g., deflate and > inflate) then it should not be called via this route or inconsistent > things may happen. The return type of Tcl_ZlibStreamGetZstreamp is > /void*/ so that there is no need for the zlib public types to form part > of Tcl's public API. This Stub function is not a good idea. Here follows the explanation why. Tcl can be compiled either by compiling zlib in (e.g. when the system doesn't have a suitable zlib library) or by linking to the system zlib library. Stub entries are meant to be used by extensions. Any extension using this stub function wil after that use some function in zlib to operate on it. The problem is, how should an extension do that? If zlib is compiled in, deflateSetDictionary is not exported, so the only way is that the extension itself links to some zlib. How do we know that the zlib used by the extension is the same version as the zlib compiled-in by Tcl? Maybe the internal format of zlib changed, and it wouldn't work at all! If Tcl is linked with an external zlib, and the extension is as well, this would be fine. But then Tcl can never switch to zlib 2.0, because the extension links with zlib 1.2.x whould be incompatible with that. So, please, add stub functions for Tcl_ZlibDeflateSetDictionary and Tcl_ZlibInflateSetDictionary, in the same form as the other stub functions, to prevent this problem. So, and extension using zlib doesn't need to link with zlib directly. Regards, Jan Nijtmans |
From: Donal K. F. <don...@ma...> - 2012-05-08 11:36:20
Attachments:
donal_k_fellows.vcf
|
On 06/05/2012 23:17, Jan Nijtmans wrote: > This Stub function is not a good idea. Here follows the > explanation why. [...] > So, please, add stub functions for Tcl_ZlibDeflateSetDictionary and > Tcl_ZlibInflateSetDictionary, in the same form as the other stub > functions, to prevent this problem. So, and extension using zlib > doesn't need to link with zlib directly. The problem with that is that every time there's some functionality that was missed, we have to extend Tcl to fix it. That said, it's not hard to make the C API work like that; I'm just concerned about the general commitment of maintenance effort (particularly since it is likely to be me that does the work :-)). Donal. |
From: Donal K. F. <don...@ma...> - 2012-05-10 09:31:49
Attachments:
donal_k_fellows.vcf
|
On 06/05/2012 23:17, Jan Nijtmans wrote: > So, please, add stub functions for Tcl_ZlibDeflateSetDictionary and > Tcl_ZlibInflateSetDictionary, in the same form as the other stub > functions, to prevent this problem. So, and extension using zlib > doesn't need to link with zlib directly. I've added Tcl_ZlibStreamSetCompressionDictionary for this. There's no need to have a separate operation for inflating and deflating streams (an individual stream can only do one of them) and the operation would actually be the same in the two cases; it just stores the compression dictionary for later use... TIP not yet updated. Donal. |