#1791 Decoding broken for iso-2022-jp

obsolete: 8.3.4
closed-fixed
5
2002-03-02
2002-02-28
thomas park
No

Tcl 8.3.4 does not decode iso-2022-jp data properly
under certain conditions.

Attached is a file "iso2022.sample". This file is
also located at
http://www.nevernever.org/iso2022.sample

When this file is read line-by-line using "gets", only
the first 62 bytes of each line are decoded properly.
The remainder of the line is decoded as rubbish, eg. a
string literal such as:

%e%"%&%H$N:]$N

A sample of the garbled output (reencoded in iso-2022-
jp) can be found at
http://www.nevernever.org/iso2022.out

Here is a code segment which will produce output with
mangled lines in $msg:

set f [ open $filename r ]
fconfigure $f -encoding iso2022-jp
set msg ""

while { ![eof $f] } {
set line [ gets $f ]
append msg "$line\n"
}

close $f

Interestingly, this problem is not exhibited when the
file is read using "read", as in:

set f [ open $filename r ]
fconfigure $f -encoding iso2022-jp
set msg [ read $f ]
close $f

A Usenet thread discussing this issue (subject
was "trouble decoding iso2022-jp") can be found at
http://groups.google.com/groups?
hl=en&frame=right&th=b7c438757333dab2

Jeff Hobbs posted: "OK, on this one Thomas keyed me in
by pointing out that only his strict example fails
(using gets - read is OK). So I honed in on that, and
was further keyed in that he noted only X chars ever
get translated. Poking around, I found the problem
in tclIO.c:FilterInputBytes. It has something to do
with the value of the ENCODING_LINESIZE #define
(currently 30). If I bump that up to 60, I can read
Thomas' sample just fine, and if I drop it to 20, it
stops the correct encoding translation even earlier
per line. That's obviously not a correct fix, but it
does indicate that FilterInputBytes isn't encoding
right. I'll have to look into this more when it's not
past midnight ..."

Discussion

  • thomas park

    thomas park - 2002-02-28

    Sample iso-2022-jp text

     
  • thomas park

    thomas park - 2002-02-28

    Garbled output using "gets"

     
  • thomas park

    thomas park - 2002-02-28

    Logged In: YES
    user_id=472917

    I would like to mention that > 62 byte lines of text in
    other encodings (such as Shift-jis) do not pose any
    problems when read using "gets".

    A 235 byte line of SJis text can be found at
    http://www.nevernever.org/shiftjis.sample

     
  • Jeffrey Hobbs

    Jeffrey Hobbs - 2002-03-02
    • assigned_to: nijtmans --> hobbs
    • status: open --> closed-fixed
     
  • Jeffrey Hobbs

    Jeffrey Hobbs - 2002-03-02
     
  • Jeffrey Hobbs

    Jeffrey Hobbs - 2002-03-02

    Logged In: YES
    user_id=72656

    I figured out the problem ... the ChannelState's
    inputEncodingFlags was not getting the TCL_ENCODING_START
    flag ever turned off. 'read' would do this, but
    not 'gets'. This meant that 'gets' would see the initial
    escape and jump to jis0208 mode, but after reading in a
    small buffer's worth of data (somewhat related to
    ENCODING_LINESIZE), it reset to the default encoding in the
    table again (iso8859-1). Increasing the ENCODING_LINESIZE
    parameter just extended the amount of data that fit in that
    initial buffer.

    In the patch I actually lower the ENCODING_LINESIZE value
    slightly, as it speeds up 'gets' somewhat.

    Commited to 8.3.4+ and 8.4a4cvs. Added tests based on the
    sample data to encoding.test

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks