From: <no...@so...> - 2001-10-04 20:33:59
|
Bugs item #219283, was opened at 2000-10-25 22:09 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=110894&aid=219283&group_id=10894 Category: 10. Objects Group: = 8.1b3 >Status: Pending Resolution: None Priority: 5 Submitted By: Nobody/Anonymous (nobody) Assigned to: Jeffrey Hobbs (hobbs) Summary: iso2022-jp encoding is broken Initial Comment: OriginalBugID: 1933 Bug Version: 8.1b3 SubmitDate: '1999-04-26' LastModified: '1999-04-26' Severity: CRIT Status: UnAssn Submitter: stanton ChangedBy: stanton OS: All OSVersion: NA Machine: NA FixedDate: '2000-10-25' FixedInVersion: NA ClosedDate: '2000-10-25' It does not work well unfortunately. I used this script with the cvs version of core8.1 on Digital-Unix 4.0B and Linux 2.0.36. I'll describe the details below. ===== #!/usr/new/bin/tclsh8.1 # eucjis # k.furukawa, apr.24.1999. # based on message from scott stanton set usage "eucjis (-eucjjis|-jiseuc) (-read|-gets) {input-file} {output-file}" foreach {eucjis read input output} $argv {} if {$output == ""} { puts stderr $usage ; exit 1 } set infile [open $input r] set outfile [open $output w] switch -- $eucjis { -eucjis { fconfigure $infile -encoding euc-jp fconfigure $outfile -encoding iso2022-jp } -jiseuc { fconfigure $infile -encoding iso2022-jp fconfigure $outfile -encoding euc-jp } default { puts stderr $usage ; exit 1 } } switch -- $read { -read { puts -nonewline $outfile [read $infile] } -gets { while {[gets $infile line] != -1} { puts $outfile $line } } default { puts stderr $usage ; exit 1 } } close $infile ; close $outfile ===== (1) Old character set (not serious) Texts generated by tcl's iso2022-jp are lead by "<ESC> $ @", which is an old sequence. Rfc1468 and rfc1554 says reg# character set ESC sequence designated to ------------------------------------------------------------------ 6 ASCII ESC 2/8 4/2 ESC ( B G0 42 JIS X 0208-1978 ESC 2/4 4/0 ESC $ @ G0 87 JIS X 0208-1983 ESC 2/4 4/2 ESC $ B G0 14 JIS X 0201-Roman ESC 2/8 4/10 ESC ( J G0 Character sets of 0208-1978 and 0208-1983 are slightly different. If we don't want to specify the old character set explicitly, we should use the new character set, 0208-1983, which is lead by "<ESC> $ B". (2) Unnecessary sequences (not serious) If I puts each line using "-gets" option above, each line is lead by "<ESC> ( B" and followed by "<ESC> ( B", even if the line is all composed of ASCII characters. They are simply not necessary. Only Japanese (0208-1983) texts should be surrounded by "<ESC> $ B" and "<ESC> ( B". (3) Not conveted right from iso2022-jp to euc-jp (serious) I'll attach an example file jis1.txt. The file is not converted right with a command "tclsh8.1 eucjis.tcl -jiseuc -gets jis1.txt euc1.txt". The last character should be hexadecimal <A5><F3>, while tcl outputs ASCII "%s". I could find many examples. Another example is jis2.txt, which has 5 characters not converted right. (4) Infinite loop in conversion from euc-jp to iso2022-jp (serious) If I convert line by line with "-gets" option above, the script converts euc-jp texts into iso2022-jp, although it contains many unnecessary escape sequences as described in (2). However, if I use "-read", it sometimes goes into an infinite loop in "puts", and outputs "<ESC> ( B" repeatedly. It does not occur with short texts. I'll attach one of the example input file "euc.txt". Sorry, I don't have time now to debug the tcl's encoding system. But if you have more questions, please ask me. jis1.txt and jis2.txt are encoded in iso2022-jp, euc.txt in euc-jp. Regards, Kazuro. ----- Kazuro FURUKAWA <kaz...@ke...> (or <fur...@ke...>) Linac, High Energy Accelerator Research Organization (KEK), Japan This looks like there may be a problem with the 2022 encoding. We need to verify the sub-tables and possibly change the order of the prefixes in the file to give preference to the newer sub-encodings. ---------------------------------------------------------------------- >Comment By: Jeffrey Hobbs (hobbs) Date: 2001-10-04 13:33 Message: Logged In: YES user_id=72656 This is a rather old bug report, and the encodings have been updated since then, so it would be nice to reverify this. Also, note in this thread: http://groups.google.com/groups? th=3521768f2de838da&seekm=3B698242.1F2365D2%40crd.ge.com that not all iso2022 encodings are created equal... ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=110894&aid=219283&group_id=10894 |