[Tcl-bugs] [ tcl-Bugs-219283 ] iso2022-jp encoding is broken

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Bugs item #219283, was opened at 2000-10-25 22:09
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=110894&aid=219283&group_id=10894

Category: 10. Objects
Group: = 8.1b3
>Status: Pending
Resolution: None
Priority: 5
Submitted By: Nobody/Anonymous (nobody)
Assigned to: Jeffrey Hobbs (hobbs)
Summary: iso2022-jp encoding is broken

Initial Comment:
OriginalBugID: 1933 Bug
Version: 8.1b3
SubmitDate: '1999-04-26'
LastModified: '1999-04-26'
Severity: CRIT
Status: UnAssn
Submitter: stanton
ChangedBy: stanton
OS: All
OSVersion: NA
Machine: NA
FixedDate: '2000-10-25'
FixedInVersion: NA
ClosedDate: '2000-10-25'

It does not work well unfortunately.  I used this script with the cvs 
version of core8.1 on Digital-Unix 4.0B and Linux 2.0.36.  I'll 
describe the details below. 

=====
#!/usr/new/bin/tclsh8.1
# eucjis
# k.furukawa, apr.24.1999. 
# based on message from scott stanton

set usage "eucjis (-eucjjis|-jiseuc) (-read|-gets) {input-file} {output-file}"
foreach {eucjis read input output} $argv {}
if {$output == ""} { puts stderr $usage ; exit 1 }
set infile [open $input r]
set outfile [open $output w]
switch -- $eucjis {
    -eucjis {
        fconfigure $infile -encoding euc-jp
        fconfigure $outfile -encoding iso2022-jp
    } -jiseuc {
        fconfigure $infile -encoding iso2022-jp
        fconfigure $outfile -encoding euc-jp
    } default { puts stderr $usage ; exit 1 }
}
switch -- $read {
    -read {
        puts -nonewline $outfile [read $infile]
    } -gets {
        while {[gets $infile line] != -1} {
            puts $outfile $line
        }
    } default { puts stderr $usage ; exit 1 }
}
close $infile ; close $outfile
=====

(1) Old character set (not serious)

Texts generated by tcl's iso2022-jp are lead by "<ESC> $ @", which 
is an old sequence.  Rfc1468 and rfc1554 says 
      reg#  character set      ESC sequence                designated to
      ------------------------------------------------------------------
      6     ASCII              ESC 2/8 4/2      ESC ( B    G0
      42    JIS X 0208-1978    ESC 2/4 4/0      ESC $ @    G0
      87    JIS X 0208-1983    ESC 2/4 4/2      ESC $ B    G0
      14    JIS X 0201-Roman   ESC 2/8 4/10     ESC ( J    G0
Character sets of 0208-1978 and 0208-1983 are slightly different.  
If we don't want to specify the old character set explicitly, we 
should use the new character set, 0208-1983, which is lead by 
"<ESC> $ B". 

(2) Unnecessary sequences (not serious)

If I puts each line using "-gets" option above, each line is lead 
by "<ESC> ( B" and followed by "<ESC> ( B", even if the line is all 
composed of ASCII characters.  They are simply not necessary.  Only 
Japanese (0208-1983) texts should be surrounded by "<ESC> $ B" and 
"<ESC> ( B". 

(3) Not conveted right from iso2022-jp to euc-jp (serious)

I'll attach an example file jis1.txt.  The file is not converted 
right with a command 
"tclsh8.1 eucjis.tcl -jiseuc -gets jis1.txt euc1.txt". 
The last character should be hexadecimal <A5><F3>, while tcl outputs 
ASCII "%s".  I could find many examples.  Another example is jis2.txt, 
which has 5 characters not converted right. 

(4) Infinite loop in conversion from euc-jp to iso2022-jp (serious)

If I convert line by line with "-gets" option above, the script 
converts euc-jp texts into iso2022-jp, although it contains many 
unnecessary escape sequences as described in (2).  

However, if I use "-read", it sometimes goes into an infinite loop in 
"puts", and outputs "<ESC> ( B" repeatedly.  It does not occur with 
short texts.  I'll attach one of the example input file "euc.txt". 

Sorry, I don't have time now to debug the tcl's encoding system.  But 
if you have more questions, please ask me. 

jis1.txt and jis2.txt are encoded in iso2022-jp, euc.txt in euc-jp. 

Regards, Kazuro. 
-----
Kazuro FURUKAWA <kaz...@ke...>  (or <fur...@ke...>)
 Linac,  High Energy Accelerator Research Organization (KEK), Japan

This looks like there may be a problem with the 2022 encoding.  We need to verify the sub-tables and possibly change the order of the prefixes in the file to give preference to the newer sub-encodings.

----------------------------------------------------------------------

>Comment By: Jeffrey Hobbs (hobbs)
Date: 2001-10-04 13:33

Message:
Logged In: YES 
user_id=72656

This is a rather old bug report, and the encodings have 
been updated since then, so it would be nice to reverify 
this.  Also, note in this thread:

http://groups.google.com/groups?
th=3521768f2de838da&seekm=3B698242.1F2365D2%40crd.ge.com

that not all iso2022 encodings are created equal...

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=110894&aid=219283&group_id=10894

[Tcl-bugs] [ tcl-Bugs-219283 ] iso2022-jp encoding is broken

The Tool Command Language implementation

[Tcl-bugs] [ tcl-Bugs-219283 ] iso2022-jp encoding is broken