Menu

#1800 iso2022-jp conversion problems

obsolete: 8.4a4
closed-fixed
5
2002-04-18
2002-03-06
No

tcl8.4a4 addressed several problems around the
iso2022-jp enconding.
For example, bugs that I submitted in the past was
mostly fixed.
[ BugID: 218099 ] iso2022-jp encoding does not work.
[ BugID: 219283 ] iso2022-jp encoding is broken

However, it still have problems when I convert
relatively long (longer
than several kilo-bytes) japanese texts (eg. Unix
Japanese Manual
Pages) into iso2022-jp. I'll attach a scipt to
reproduce that.

Some details follow.

(1) euc-jp to iso2022-jp gets-puts conversion

When I convert a text with "tclsh8.4 eucjis.tcl -eucjis
-gets infile outfile",
sometimes "esc ( B" is missing, sometimes extra "esc (
B" appears.
While extra "esc ( B" does not matter, missing "esc (
B" causes
missing characters on reading. The error is
reprodusible if I use the
same file, but I don't know how and when it happens.

"od -x -a" of an example error is below. If I extract
the erroneous
line, the error does not occur. Thus the error is not
the code
dependent but context dependent.

[ output from eucjis.tcl -eucjis -gets euc.txt
jis-n3.txt ]

% H $ 7 $ ^ esc ( B nl sp
sp sp sp sp sp
0007760 241b 2442 2139 1b23 4228
0a0a 2020 752d
esc $ B $ 9 ! # esc ( B nl
nl sp sp - u
! 0010000 2020 241b 2542 213d 253c
2148 2d4a 2074
! sp sp esc $ B % = ! < % H
! J - t sp
! 0010020 241b 2442 3b48 4d48 2451
2439 246b 2448
! esc $ B $ H ; H M Q $ 9
$ k $ H $

[ correct output produced from a software called nkf ]

% H $ 7 $ ^ esc ( B nl sp
sp sp sp sp sp
0007760 241b 2442 2139 1b23 4228
0a0a 2020 752d
esc $ B $ 9 ! # esc ( B nl
nl sp sp - u
! 0010000 2020 241b 2542 213d 253c
2148 1b4a 4228
! sp sp esc $ B % = ! < % H
! J esc ( B
! 0010020 742d 1b20 4224 4824 483b
514d 3924 6b24
! - t sp esc $ B $ H ; H M
Q $ 9 $ k
! 0010040 4824 2d24 4b21 5e24 3f24
4f24 3d49 283c
! $ H $ - ! K $ ^ $ ? $
O I = < (

(2) euc-jp to iso2022-jp read-puts conversion

When I convert a text with "tclsh8.4 eucjis.tcl -eucjis
-read infile outfile",
sometimes extra "esc $ B" appears in the middle of the
output.
It seems it always appears at around the character
number 4096 or
8192, etc. (It's not byte number, but character
number.) Thus,
if the tcl internal buffer for unicode storage is
8192-byte long
(4096 characters), such boundary handling is supposed
to have some
bugs, at the beginning of each internal buffer.

(3) font selection mechanism

Under tk8.4a4 some character is not displayed correctly
with a font
like "*-jisx0208.1983-1". It is a minor problem, since
we normally use
"*-jisx0208.1983-0".

>

Discussion

  • kazuro furukawa

    kazuro furukawa - 2002-03-06

    A script to convert Japanese texts between euc-jp and iso2022-jp encodings

     
  • kazuro furukawa

    kazuro furukawa - 2002-03-06
    • assigned_to: nijtmans --> hobbs
     
  • kazuro furukawa

    kazuro furukawa - 2002-03-08

    Logged In: YES
    user_id=49637

    Problems (1) and (2) were found to be fixed by a patch by
    Koichi Yamamoto (private communication). He may submit the
    patch after he refine it.

     
  • Koichi Yamamoto

    Koichi Yamamoto - 2002-03-12

    Logged In: YES
    user_id=475117

    Hi,
    I sent Mr. Furukawa an additional patch to fix this
    problem, then I received his message that (1) and (2)
    problems were solved.

    My additional patch is available from:
    http://www3.ocn.ne.jp/~yamako/tcl/iso2022-
    jp.tcl84a4.2002mar12.patch

     
  • Jeffrey Hobbs

    Jeffrey Hobbs - 2002-04-18
     
  • Jeffrey Hobbs

    Jeffrey Hobbs - 2002-04-18

    Logged In: YES
    user_id=72656

    Applied patch to 8.4 head on 2002-04-17. Attached patch
    for posterity.

     
  • Jeffrey Hobbs

    Jeffrey Hobbs - 2002-04-18
    • status: open --> closed-fixed