Thread: [TCLCORE] TIP #400: Setting the Compression Dictionary

The Tool Command Language implementation

Brought to you by: andreas_kupries, apnadkarni, bgriffin, das, and 10 others

tcl-core

[TCLCORE] TIP #400: Setting the Compression Dictionary

From: Donal K. F. <dk...@us...> - 2012-03-30 14:20:01

 TIP #400: SETTING THE COMPRESSION DICTIONARY 
==============================================
 Version:      $Revision: 1.1 $
 Author:       Donal K. Fellows <dkf_at_users.sf.net>
 State:        Draft
 Type:         Project
 Tcl-Version:  8.6
 Vote:         Pending
 Created:      Friday, 30 March 2012
 URL:          http://purl.org/tcl/tip/400.html
 WebEdit:      http://purl.org/tcl/tip/edit/400
 Post-History: 

-------------------------------------------------------------------------

 ABSTRACT 
==========

 Sometimes it is necessary to set the compression dictionary so that a 
 sequence of bytes may be compressed more efficiently (and decompressed 
 as well). This TIP exposes that functionality. 

 RATIONALE 
===========

 The SPDY protocol extensions to HTTP require the seeding of the zlib 
 compression dictionary (which greatly improves the performance of 
 compression on small amounts of data, such as HTTP headers). In order 
 to allow a pure Tcl implementation of the SPDY protocol, it is 
 therefore necessary to provide a mechanism whereby the compression 
 dictionary (a byte-array up to 262 bytes long, according to the zlib 
 documentation). 

 There is to be no mechanism for retrieving the compression dictionary 
 generated by the compression engine; there is no API for doing that. 

 PROPOSED CHANGES: TCL 
=======================

 The *zlib push* command will gain an extra option: 

       *-dictionary* /bytes/ 

 This option will provide a compression dictionary to be used, which 
 will be supplied to the zlib compression engine at the correct moment 
 during compression or provided on request of the compression engine on 
 decompression. The /bytes/ argument will be interpreted as a Tcl 
 bytearray; it must be non-empty if given. 

 In addition, the *zlib stream* command will gain some complexity. All 
 the subcommands will gain the ability to take an extra *-dictionary* 
 /bytes/ pair of options (same interpretation as above), the *zlib 
 stream gzip* variety will also gain the ability to take *-header* 
 /dict/ (where /dict/ is a Tcl dictionary such as is passed to the 
 *-header* option to *zlib gzip*, not a compression dictionary), and the 
 *zlib stream gunzip* variety will also gain the ability to take 
 *-headerVar* /name/ (so that a Tcl dictionary describing the contents 
 of the gzip header can be reported). The omission of the last two were 
 an oversight in [TIP #234]. 

 PROPOSED CHANGE: C 
====================

 At the C level, one additional function will be provided: 

       void * *Tcl_ZlibStreamGetZstreamp*(Tcl_ZlibStream /zshandle/) 

 This returns the /z_streamp/ associated with a the given Tcl_ZlibStream 
 structure, which can then be used to directly call appropriate zlib 
 functions not directly exposed through Tcl's interface, notably 
 including deflateSetDictionary and inflateSetDictionary. Note that if a 
 function /is/ exposed through a public interface (e.g., deflate and 
 inflate) then it should not be called via this route or inconsistent 
 things may happen. The return type of Tcl_ZlibStreamGetZstreamp is 
 /void*/ so that there is no need for the zlib public types to form part 
 of Tcl's public API. 

 COPYRIGHT 
===========

 This document has been placed in the public domain. 

-------------------------------------------------------------------------

 TIP AutoGenerator - written by Donal K. Fellows

Re: [TCLCORE] TIP #400: Setting the Compression Dictionary

From: Lars H. <Lar...@re...> - 2012-03-30 14:40:18

Donal K. Fellows skrev 2012-03-30 16.19:
>
>   TIP #400: SETTING THE COMPRESSION DICTIONARY
[snip]
>   to allow a pure Tcl implementation of the SPDY protocol, it is
>   therefore necessary to provide a mechanism whereby the compression
>   dictionary (a byte-array up to 262 bytes long, according to the zlib
>   documentation).

That sentence looks like it was abruptly cut off. And are you sure the 
dictionary can be at most 262 bytes long? Zlib compression works with a 
window up to 32k in length, so I would expect to be able to seed that much 
data. Or does the 262 rather refer to the length of the SPDY seed data?

>   There is to be no mechanism for retrieving the compression dictionary
>   generated by the compression engine; there is no API for doing that.
>
>   PROPOSED CHANGES: TCL
> =======================
>
>   The *zlib push* command will gain an extra option:
>
>         *-dictionary* /bytes/
>
>   This option will provide a compression dictionary to be used, which
>   will be supplied to the zlib compression engine at the correct moment
>   during compression or provided on request of the compression engine on
>   decompression. The /bytes/ argument will be interpreted as a Tcl
>   bytearray; it must be non-empty if given.
>
>   In addition, the *zlib stream* command will gain some complexity. All
>   the subcommands will gain the ability to take an extra *-dictionary*
>   /bytes/ pair of options (same interpretation as above),

How does a decompress stream signal that it needs the dictionary to be seeded?

> the *zlib
>   stream gzip* variety will also gain the ability to take *-header*
>   /dict/ (where /dict/ is a Tcl dictionary such as is passed to the
>   *-header* option to *zlib gzip*, not a compression dictionary), and the
>   *zlib stream gunzip* variety will also gain the ability to take
>   *-headerVar* /name/ (so that a Tcl dictionary describing the contents
>   of the gzip header can be reported). The omission of the last two were
>   an oversight in [TIP #234].

Since the last two points are unrelated to the -dictionary option, it may be 
clearer to put them in a paragraph of their own.

Lars Hellström

Re: [TCLCORE] TIP #400: Setting the Compression Dictionary

From: Donal K. F. <don...@ma...> - 2012-03-30 14:55:26

Attachments: donal_k_fellows.vcf

On 30/03/2012 15:40, Lars Hellström wrote:
> Donal K. Fellows skrev 2012-03-30 16.19:
>>    to allow a pure Tcl implementation of the SPDY protocol, it is
>>    therefore necessary to provide a mechanism whereby the compression
>>    dictionary (a byte-array up to 262 bytes long, according to the zlib
>>    documentation).
>
> That sentence looks like it was abruptly cut off. And are you sure the
> dictionary can be at most 262 bytes long? Zlib compression works with a
> window up to 32k in length, so I would expect to be able to seed that much
> data. Or does the 262 rather refer to the length of the SPDY seed data?

Misread the docs (http://zlib.net/manual.html). The max used is the 
window size *minus* 262 bytes. It doesn't change the fact that I 
wouldn't validate it. :-) I'm more worried about the minimum size.

And yes, that sentence was cut off in its prime. :-(

Donal.

Re: [TCLCORE] TIP #400: Setting the Compression Dictionary

From: Jeff R. <dv...@di...> - 2012-03-30 16:13:32

Lars Hellström wrote:

>>
>>    In addition, the *zlib stream* command will gain some complexity. All
>>    the subcommands will gain the ability to take an extra *-dictionary*
>>    /bytes/ pair of options (same interpretation as above),
>
> How does a decompress stream signal that it needs the dictionary to be seeded?

The zlib specification (rfc1950) specifies a mechanism to indicate a 
seeded dictionary is to be used on a compressed stream, as well as which 
dictionary is needed (the FDICT flag and the checksum from the 
dictionary, respectively).  I think the zlib library will return a 
Z_NEED_DICT error if this flag is set on a stream and the appropriate 
dictionary has not been set.

Of course, to make things more complicated, the SPDY spec doesn't say 
that it actually uses the mechanism defined in zlib to use a 
precompressed dictionary, is just says that a dictionary is used.  This 
may be intentional, to completely disallow streams that aren't using the 
dictionary, but it strikes me as short-sighted.

-J

Re: [TCLCORE] TIP #400: Setting the Compression Dictionary

From: Lars H. <Lar...@re...> - 2012-03-31 16:34:09

Jeff Rogers skrev 2012-03-30 18.13:
> Lars Hellström wrote:
>
>>>
>>>     In addition, the *zlib stream* command will gain some complexity. All
>>>     the subcommands will gain the ability to take an extra *-dictionary*
>>>     /bytes/ pair of options (same interpretation as above),
>>
>> How does a decompress stream signal that it needs the dictionary to be seeded?
>
> The zlib specification (rfc1950) specifies a mechanism to indicate a
> seeded dictionary is to be used on a compressed stream, as well as which
> dictionary is needed (the FDICT flag and the checksum from the
> dictionary, respectively).  I think the zlib library will return a
> Z_NEED_DICT error if this flag is set on a stream and the appropriate
> dictionary has not been set.

Yes, I know that! I meant: How does the proposed /Tcl interface/ communicate 
this fact back to the script level? In other words, by "decompress stream" I 
meant the stream command created by a [zlib stream decompress] call.

This is not an issue for SPDY, where the same dictionary should always be 
used, but these [zlib] command additions presumably aims to be more general.

Lars Hellström

Re: [TCLCORE] TIP #400: Setting the Compression Dictionary

From: Donal K. F. <don...@ma...> - 2012-05-10 12:53:15

Attachments: donal_k_fellows.vcf

On 31/03/2012 18:34, Lars Hellström wrote:
> Yes, I know that! I meant: How does the proposed /Tcl interface/ communicate
> this fact back to the script level? In other words, by "decompress stream" I
> meant the stream command created by a [zlib stream decompress] call.

I'm proposing to have it so that, if that's true, you get a Tcl error
with an error code of {TCL ZLIB NEED_DICT}. At that point, you'll be
able to use [$strm checksum] to find what the checksum of the required
dictionary is, and some API (not yet specified, alas) to supply the
compression dictionary. Remember, the TIP is not yet finished...

Donal.

Re: [TCLCORE] TIP #400: Setting the Compression Dictionary

From: Lars H. <Lar...@re...> - 2012-05-10 13:10:52

Donal K. Fellows skrev 2012-05-10 14.53:
> On 31/03/2012 18:34, Lars Hellström wrote:
>> Yes, I know that! I meant: How does the proposed /Tcl interface/ communicate
>> this fact back to the script level? In other words, by "decompress stream" I
>> meant the stream command created by a [zlib stream decompress] call.
>
> I'm proposing to have it so that, if that's true, you get a Tcl error
> with an error code of {TCL ZLIB NEED_DICT}.

When one tries to read decompressed data from the stream, I hope? (Perhaps 
[$stream get 0] can be useful as a special case to test whether 
decompression is possible, without actually getting any data.)

> At that point, you'll be
> able to use [$strm checksum] to find what the checksum of the required
> dictionary is, and some API (not yet specified, alas) to supply the
> compression dictionary.

Seems a reasonable API that a -dictionary option to [$stream get] sets the 
dictionary first, then tries to decompress. In case you're looking for ideas.

> Remember, the TIP is not yet finished...

Just as long as the matter isn't forgotten.

Lars Hellström

Re: [TCLCORE] TIP #400: Setting the Compression Dictionary

From: Donal K. F. <don...@ma...> - 2012-05-15 10:33:30

Attachments: donal_k_fellows.vcf

On 10/05/2012 14:11, Lars Hellström wrote:
> Donal K. Fellows skrev 2012-05-10 14.53:
>> I'm proposing to have it so that, if that's true, you get a Tcl error
>> with an error code of {TCL ZLIB NEED_DICT}.
>
> When one tries to read decompressed data from the stream, I hope? (Perhaps
> [$stream get 0] can be useful as a special case to test whether
> decompression is possible, without actually getting any data.)

Yes. Internally, the stream data is stored in compressed form in both
directions. I'm not sure what getting zero-length content would do
though; it might end up returning with no action taken.

>> At that point, you'll be
>> able to use [$strm checksum] to find what the checksum of the required
>> dictionary is, and some API (not yet specified, alas) to supply the
>> compression dictionary.
>
> Seems a reasonable API that a -dictionary option to [$stream get] sets the
> dictionary first, then tries to decompress. In case you're looking for ideas.

That sounds very reasonable.

Donal.

Re: [TCLCORE] TIP #400: Setting the Compression Dictionary

From: Lars H. <Lar...@re...> - 2012-05-16 10:31:24

Donal K. Fellows skrev 2012-05-15 12.33:
> On 10/05/2012 14:11, Lars Hellström wrote:
>> Donal K. Fellows skrev 2012-05-10 14.53:
>>> I'm proposing to have it so that, if that's true, you get a Tcl error
>>> with an error code of {TCL ZLIB NEED_DICT}.
>>
>> When one tries to read decompressed data from the stream, I hope? (Perhaps
>> [$stream get 0] can be useful as a special case to test whether
>> decompression is possible, without actually getting any data.)

I was thinking that it might be good to be able to sort out the dictionary 
business /before/ having to worry about pieces of decompressed data. Concretely,

   set stream [zlib stream decompress]
   $stream put $data
   try {$stream get 0} trap {TCL ZLIB NEED_DICT} {} {
      $stream get -dictionary $DictTable([$stream checksum]) 0
   }
   while {![$stream eof]} {
      set block [$stream get 128]
      ...
   }

seems less convoluted than

   set stream [zlib stream decompress]
   $stream put $data
   try {
      set block [$stream get 128]
   } trap {TCL ZLIB NEED_DICT} {} {
      set block [$stream get -dictionary $DictTable([$stream checksum]) 128]
   }
   while 1 {
      ...
      if {[$stream eof]} break
      set block [$stream get 128]
   }


> Yes. Internally, the stream data is stored in compressed form in both
> directions. I'm not sure what getting zero-length content would do
> though; it might end up returning with no action taken.

If the zlib library is not documented to do something useful in that case, I 
can imagine having a special case in the implementation of [$stream get] for 
it (checking an integer argument for being 0 is quite straightforward). On 
the other hand, the cost for an extra [$stream ready] subcommand is quite 
small too.

>>> At that point, you'll be
>>> able to use [$strm checksum] to find what the checksum of the required
>>> dictionary is, and some API (not yet specified, alas) to supply the
>>> compression dictionary.
>>
>> Seems a reasonable API that a -dictionary option to [$stream get] sets the
>> dictionary first, then tries to decompress. In case you're looking for ideas.
>
> That sounds very reasonable.

A possible problem that occurs to me: what happens if one supplies a 
-dictionary several times? Are they concatenated, or does 
Tcl_ZlibStreamSetCompressionDictionary clear the old dictionary data first? 
I'm curious as to whether it would work to go

   set stream [zlib stream decompress]
   $stream put $data -dictionary $UsualDict ; # Default to most common dict
   try {$stream get 0} trap {TCL ZLIB NEED_DICT} {} {
      # It wasn't that one, look up which one should be used instead.
      $stream get -dictionary $DictTable([$stream checksum]) 0
   }

(On the other hand, it's not clear that there would be a problem even if 
concatenating, since [$stream checksum] reports the checksum of the 
/expected/ dictionary, and the data being decompressed should not refer to 
anything before the dictionary it knows about.)

Lars Hellström

Re: [TCLCORE] TIP #400: Setting the Compression Dictionary

From: Jan N. <nij...@us...> - 2012-05-06 21:17:37

2012/3/30 Donal K. Fellows <dk...@us...>:
>  PROPOSED CHANGE: C
> ====================
>
>  At the C level, one additional function will be provided:
>
>       void * *Tcl_ZlibStreamGetZstreamp*(Tcl_ZlibStream /zshandle/)
>
>  This returns the /z_streamp/ associated with a the given Tcl_ZlibStream
>  structure, which can then be used to directly call appropriate zlib
>  functions not directly exposed through Tcl's interface, notably
>  including deflateSetDictionary and inflateSetDictionary. Note that if a
>  function /is/ exposed through a public interface (e.g., deflate and
>  inflate) then it should not be called via this route or inconsistent
>  things may happen. The return type of Tcl_ZlibStreamGetZstreamp is
>  /void*/ so that there is no need for the zlib public types to form part
>  of Tcl's public API.

This Stub function is not a good idea. Here follows the
explanation why.

Tcl can be compiled either by compiling zlib in (e.g. when
the system doesn't have a suitable zlib library) or by
linking to the system zlib library. Stub entries
are meant to be used by extensions. Any extension
using this stub function wil after that use some function
in zlib to operate on it.

The problem is, how should an extension do that? If
zlib is compiled in, deflateSetDictionary is not exported,
so the only way is that the extension itself links to some
zlib. How do we know that the zlib used by the extension
is the same version as the zlib compiled-in by Tcl?
Maybe the internal format of zlib changed, and it
wouldn't work at all! If Tcl is linked with an external
zlib, and the extension is as well, this would be
fine. But then Tcl can never switch to zlib 2.0,
because the extension links with zlib 1.2.x
whould be incompatible with that.

So, please, add stub functions for
Tcl_ZlibDeflateSetDictionary and
Tcl_ZlibInflateSetDictionary, in the same
form as the other stub functions, to prevent
this problem. So, and extension using
zlib doesn't need to link with zlib directly.

Regards,
           Jan Nijtmans

Re: [TCLCORE] TIP #400: Setting the Compression Dictionary

From: Donal K. F. <don...@ma...> - 2012-05-08 11:36:20

Attachments: donal_k_fellows.vcf

On 06/05/2012 23:17, Jan Nijtmans wrote:
> This Stub function is not a good idea. Here follows the
> explanation why.
[...]
> So, please, add stub functions for Tcl_ZlibDeflateSetDictionary and
> Tcl_ZlibInflateSetDictionary, in the same form as the other stub
> functions, to prevent this problem. So, and extension using zlib
> doesn't need to link with zlib directly.

The problem with that is that every time there's some functionality that
was missed, we have to extend Tcl to fix it. That said, it's not hard to
make the C API work like that; I'm just concerned about the general
commitment of maintenance effort (particularly since it is likely to be
me that does the work :-)).

Donal.

Re: [TCLCORE] TIP #400: Setting the Compression Dictionary

From: Donal K. F. <don...@ma...> - 2012-05-10 09:31:49

Attachments: donal_k_fellows.vcf

On 06/05/2012 23:17, Jan Nijtmans wrote:
> So, please, add stub functions for Tcl_ZlibDeflateSetDictionary and
> Tcl_ZlibInflateSetDictionary, in the same form as the other stub
> functions, to prevent this problem. So, and extension using zlib
> doesn't need to link with zlib directly.

I've added Tcl_ZlibStreamSetCompressionDictionary for this. There's no
need to have a separate operation for inflating and deflating streams
(an individual stream can only do one of them) and the operation would
actually be the same in the two cases; it just stores the compression
dictionary for later use...

TIP not yet updated.

Donal.