Menu

Suggestion: Support for East Asian code pages

Anonymous
2014-08-21
2014-11-09
1 2 3 4 > >> (Page 1 of 4)
  • Anonymous

    Anonymous - 2014-08-21

    Currently PC-BASIC seems to have support for single-byte code pages such as 437, 850 and 866. However, it still lacks support for double-byte code pages such as 936 & 950 (Simplified & Traditional Chinese), 932 (Japanese) and 949 (Korean). As a result, characters in these languages are not able to be correctly displayed in PC-BASIC at all. I really suggest that PC-BASIC would support these code pages so that it would properly run programs that display such languages as well. Thanks!

     
  • Rob Hagemans

    Rob Hagemans - 2014-08-21

    Hi, thanks for your interest! I agree that this is a desirable feature, and I do intend to implement it. I also want to manage your expectations as to a quick implementation - this is actually quite a difficult task as I see it, due to the large number of glyphs in these codepages. Moreover (and perhaps the real reason this appears so hard to me) unfortunately I don't read any of these languages so it's more difficult for me to check whether the implementation is correct.

    Do I infer correctly that you are a speaker of one of these languages? If so, perhaps you would be able to send me some programs that make use of these codepages and screenshots of how they are supposed to look? That would be extremely useful to me!

    As to sourcing the typefaces, do you know if e.g. FreeDOS or DOSBox support these codepages and pack rom fonts for them?

     

    Last edit: Rob Hagemans 2014-08-21
  • Anonymous

    Anonymous - 2014-08-21

    Thanks for your quick reply! Indeed, I am a native speaker of Chinese, but also have some capability of reading Japanese texts. I do not really think FreeDOS or DOSBox support these code pages natively, but certainly both FreeDOS and DOSBox are able to display these languages correctly with the help of third-party DOS tools. For correctly displaying Chinese (either Simplified or Traditional) in DOS you will need to first run a (usually third party, but there is also one from Microsoft) TSR that will load the Chinese font. On the other hand, Japanese version of MS-DOS will automatically load the font driver in CONFIG.SYS for displaying Japanese. For demonstration I have uploaded a simple GW-BASIC program named ZHISHU.BAS (within the ZIP package) that will check if a user-entered natural number is a prime or a composite. The display language of this program is Simplified Chinese. The same program with English as the display language has also been uploaded (named ZHISHUEN.BAS) for comparison. To run the Chinese version of the program please first load the attached TSR named TW.EXE (TechWay SCS) for displaying Chinese in DOS, then launch GW-BASIC and run the file. A screenshot how it will look has also been attached. Thanks!

     
  • Anonymous

    Anonymous - 2014-08-21

    P.S. There are certainly many alternatives to the TSR above that will be able to display Chinese in DOS. Now I have attached another one called FreeCDOS (Free Chinese DOS), which is a free third-party Chinese font loader aimed to provide Chinese support under FreeDOS (but it also works in MS-DOS of course). It includes a Chinese font file named HZ16.FCZ too, which may be useful for use with PC-BASIC. To launch FreeCDOS, simply run FCDOS.BAT under DOS.

     
  • Rob Hagemans

    Rob Hagemans - 2014-08-21

    That's great, thanks a lot!

     
  • Rob Hagemans

    Rob Hagemans - 2014-08-23

    Check it out in git!

     

    Last edit: Rob Hagemans 2014-08-23
  • Anonymous

    Anonymous - 2014-08-23

    Thanks a lot for the update. Since there is no binary available for the update now so I tried to run the source under Python 2.7.8 with the following command lines:

    python pcbasic --codepage gb2312
    python pcbasic --codepage big5eten

    However, both returned the same error:

    File "C:\Program Files (x86)\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 10: invalid start byte

    I am not sure why this happens (it works fine with codepages such as 437 though). Also, could you please add support for numerical codepage numbers "936" and "950" (similar to "437", "850" etc) for the --codepage option as well, which are shorter and also easier to be remembered than "gb2312" and "big5eten" etc? Thanks!

     
  • Rob Hagemans

    Rob Hagemans - 2014-08-23

    Hi, it's still early days for the DBCS implementation (the cursor doesn't work correctly either) so you're likely to run into issues; please consider these fonts to be very much in beta testing. That said, I don't run into your particular issue; it may be Windows-specific. One potential problem is that the default font family is freedos, and it doesn't have Chinese support.

    I've fixed some bugs and updated git, can you try the following?
    python pcbasic --codepage=gb2312 --font-family=fcdos
    python pcbasic --codepage=big5-eten --font-family=njstar

    As for numeric codepages, I've tried to use unambiguous names since there are many different implementations of the same codepage number (this situation is different from 437 and 850). The codepages that are in there currently are as far as I can see not exactly the same as cp 936 and 950, so I'd like to avoid the confusion of calling them that. I've also not settled on names yet (as you can see I've changed big5eten to big5-eten in the last update for no good reason). I'm of a mind to implement the font system quite differently and hopefully a bit easier by the next release.

    If you run pcbasic -h it will tell you the names of the available code pages; if you prefer different names, just rename those files in the font and encoding directories and pcbasic will use your names.

     
  • Rob Hagemans

    Rob Hagemans - 2014-08-23

    I think I know what the issue is with that error - I have used symbolic links in the git source and Windows git converts those to text files with a path rather than to links or copies of the file. I'll think of some more elegant solution, but for now I think you can solve it as follows:
    cd encoding
    del big5-eten.utf8 gb2312.utf8
    copy 437.utf8 big5-eten.utf8
    copy 437.utf8 gb2312.utf8

    There are more symlinks in the font directory but I think this way you might at least get gb2312 working.

     
  • Anonymous

    Anonymous - 2014-08-23

    Thanks! With the above fix the error that I mentioned earlier indeed went away, but it seems that another error occurs as below:

    File "D:\test\pcbasic-code\backend_pygame.py", line 637, in build_glyph
    line = ord(face[yy*(glyph_width//8)+half])
    IndexError: string index out of range

    Also, I can understand your point concerning different implementations of cp 936 and 950. This also makes me think about the issue regarding GB2312 and GBK. While most of the TSRs that provide Simplified Chinese display support under DOS only support GB2312, Windows 95+ and at least two such TSRs actually provide GBK support under DOS. I have uploaded one of such DOS TSRs called CJKDOS, which contains a GBK font file named CJK-GB.F16. A GBK font file is certainly more useful than a GB2312 one when used with PC-BASIC, especially when there is no 640KB conventional memory issue as in DOS. Once these code pages are more or less exactly the same as the "real" cp 936 and 950, I still hope they could be called by simpler names, or at least with such names as an alias on default installation of PC-BASIC.

     
  • Anonymous

    Anonymous - 2014-08-23

    Attachment CJKDOS.ZIP as mentioned above.

     
  • Rob Hagemans

    Rob Hagemans - 2014-08-23

    Thanks for the GBK file!

    It's getting quite complicated with all these different encodings and font sets. My intention is to work with full unicode fonts instead (like GNU unifont) so that all that's needed for a new encoding is a unicode mapping file. That will make it easier to maintain a set of slightly different code pages mapping large numbers of characters.

    I suspect your remaining error relates to the symlinks in the font directory, try this:
    cd font
    del fcdos_gb2312_08 fcdos_gb2312_14 njstar_big5-eten_08 njstar_big5-eten_14 njstar_big5-eten_16
    copy freedos_437_08 njstar_big5-eten_08
    copy freedos_437_14 njstar_big5-eten_14
    copy freedos_437_16 njstar_big5-eten_16
    copy freedos_437_08 fcdos_gb2312_08
    copy freedos_437_14 fcdos_gb2312_14

     
  • Anonymous

    Anonymous - 2014-08-23

    WOW! It really works this time. Amazing! Thanks a lot for your work! I also agree it is a good idea to use something like GNU unifont instead of all these different font sets. There is probably another thing that may need to be taken note of though: ideally PC-BASIC should not only support CJK characters that are encoded in GB/BIG5/SJIS etc, but should also support these characters encoded in Unicode, at least UTF-8, which becomes increasingly commonly used in recent years. So probably UTF-8 (or 65001) should be added as an available codepage as well.

     
  • Rob Hagemans

    Rob Hagemans - 2014-08-24

    I'm glad it works now, happy testing!

    As for UTF8, I don't see how it can be supported at all - BASIC has one-byte codes corresponding to half-width characters and two-byte codes corresponding to full-width characters. UTF8 breaks that correspondence and encodes many characters in three bytes. It's a great encoding, but the GW-BASIC framework wasn't made for it. My priority for now is implementing features that existed when GW-BASIC was used and that are needed by legacy programs, not extending the language with modern features.

     
  • Anonymous

    Anonymous - 2014-08-24

    I see. Since I converted my GB/BIG5 encoded GW-BASIC programs to UTF-8 with wxMEdit and ran in PC-BASIC without "real" issues except for character display problems, I assumed it should be not very hard to support UTF-8 encoded programs as well. But if there are other problems too and the GW-BASIC framework wasn't made for it anyway, sure, implementing features that existed when GW-BASIC was used and that are needed by legacy programs first. You are absolutely right about this. An example may be the very useful "ON ERROR GOTO" handling (which does not seem to work properly now)?

     
  • Rob Hagemans

    Rob Hagemans - 2014-08-24

    I see what you mean, reading programs in utf8 might be possible, in fact this was possible under Linux in an earlier release by piping in the program through pcbasic < PROGRAM.BAS, but I notice this doesn't work anymore. I'll see if I can fix that and make it work on Windows one of these days. Meanwhile I can only suggest converting your files to gb2312 or big5 before using them...

    What problems do you see with ON ERROR GOTO?

     
  • Anonymous

    Anonymous - 2014-08-24

    The ON ERROR GOTO statement will run, but does not seem to really function, at least during my testings. Below is a sample GW-BASIC program:

    10 CLS:ON ERROR GOTO 40
    20 PRINT "Try an invalid screen mode:":SCREEN -1
    30 END
    40 PRINT "An error occurred with code";ERR
    50 RESUME 30

    The result under GW-BASIC will be:

    Try an invalid screen mode:
    An error occurred with code 5

    But under PC-BASIC the result will be:

    Try an invalid screen mode:
    Illegal function call in 20

    Clearly, in PC-BASIC the program never goes to line number 40, but which is the desired effect by the ON ERROR GOTO statement.

     
  • Rob Hagemans

    Rob Hagemans - 2014-08-24

    Ah, that's a regression (it worked in 14.04). Fixed in git.

     
  • Anonymous

    Anonymous - 2014-08-24

    During my testing I have found an "old" issue that now exist in PC-BASIC's DBCS codepage handling. The problem is that since the byte range of the DBCS characters overlap with ASCII box-drawing characters (e.g. ┐ and ─) somehow, when DBCS code pages are used PC-BASIC is not able to display ASCII box-drawing characters correctly. Most (if not all) TSRs that provide Chinese support under DOS (including all three TSRs that I uploaded earlier in this thread) are able to differentiate between ASCII box-drawing chars and real DBCS characters correctly. For demonstration I have uploaded a sample GW-BASIC program that I had made in the past named MENU.BAS, which will display a menu for the user to select among four options (the option texts are in Simplified Chinese, which translate to just "Option 1" to "Option 4"). The border of the menu will look pretty bad when gb2312 (or big5-eten) is selected as the codepage because PC-BASIC currently cannot handle ASCII box-drawing characters correctly in DBCS code pages, but it will look fine in GW-BASIC when any of the TSRs that I uploaded earlier in this thread to provide Chinese support under DOS is loaded. Hope PC-BASIC will be able to handle ASCII box-drawing chars too when using these code pages, thanks!

     
  • Rob Hagemans

    Rob Hagemans - 2014-08-25

    Hm, that's interesting. I wonder how these TSRs distinguish between a byte sequence intended as box drawing and the same byte sequence intended as a DBCS character.
    What would people do if they wanted to write "谀哪"?
    print chr$(218)+chr$(196)+chr$(196)+chr$(196) returns box drawing.

     
  • Anonymous

    Anonymous - 2014-08-25

    I guess that is an interesting point too. Since I never wrote such a TSR myself, I do not know exactly how these TSRs will interpret such byte sequences. However, by observing the results of the display of such byte sequences under these TSRs, I have found the following:

    It is not necessary that all these TSRs will interpret each of such byte sequences identically. Most (if not all) TSRs seem to distinguish between a byte sequence intended as box drawing and the same byte sequence intended as a DBCS character by detecting the length of such byte sequence. In other words, if they find that there are certain number (e.g. 6 or more) of consecutive chars that belong to the range of box-drawing chars (and probably also that such box-drawing chars match well; for example, ─ and ┐ will match and produce ─┐, but not the other way around, since ┐─ is not a good shape of box drawing), they will interpret such byte sequences as box drawing characters instead of DBCS characters. Note that DBCS characters such as "谀" as you mentioned above is very rarely used, so interpreting such byte sequences of a length of at least 6 for example as box-drawing chars will generally produce no problems. But since it is not always 100% safe to auto-interpret such byte sequences as one or the other category, many TSRs that provide Chinese display support under DOS offer an option to turn on or turn off auto-recognition of box-drawing chars at run-time, and by default this feature is turned on of course. Probably PC-BASIC can do this too in order to be 100% safe.

     
  • Rob Hagemans

    Rob Hagemans - 2014-08-26

    I've uploaded a new version to git that uses unicode fonts. I've also added several new codepage mappings including GBK. According to GNU libiconv sources, that's the same as CP 936 as it was used in 1999, but Microsoft has since expanded 936 with more code points.

    You no longer have to specify the font, just use e.g. pcbasic --codepage=gbk and it should work.

    As for box-drawing recognition, I think it is as you say - TechWay seems to show three consecutive 'matching' box-drawing chars as box drawing, otherwise as DBCS char. I'll need to give a bit of thought to a good algorithm. I'd like to avoid adding key-combinations for switching things at runtime, as it would interfere with other programs that don't use the feature. A command-line parameter is feasible, however.

     

    Last edit: Rob Hagemans 2014-08-26
  • Rob Hagemans

    Rob Hagemans - 2014-08-29

    I've uploaded a version to git that has basic box drawing recognition. I'll also add a command-line option to turn it off.

     
  • Anonymous

    Anonymous - 2014-08-29

    Excellent, although I have found a display glitch when trying to run a GW-BASIC program in graphical mode. This is a 8-puzzle game which will run fine in GW-BASIC, but it will not currently run directly in PC-BASIC, otherwise it will give the following error:

    File "D:\test\pcbasic-code\backend_pygame.py", line 485, in update_pos
    attr = state.console_state.apage.row[state.console_state.row-1].buf[state.console_state.col-1][1] & 0xf
    IndexError: list index out of range

    I can nevertheless force it to run in PC-BASIC by directly loading line number 370 after issuing a "KEY OFF" command, but there will be an obvious display issue in all of the grids if code pages such as GBK is used (it won't occur in codepages such as 437). I have uploaded the program named 8PUZZLE.BAS.

     
  • Rob Hagemans

    Rob Hagemans - 2014-08-30

    Wow, that program opened up a heap of problems with PC-BASIC. I've fixed the crashes (we couldn't handle LOCATE x,41:SCREEN 1, now we can).

    There are still display issues remaining that I'll look at - (1) the numbers on the puzzle pieces have a trailing blank in DBCS code pages and (2) it picks up some bizarre box-drawing sequences.

    You can work around (2) with the --nobox option.

     
1 2 3 4 > >> (Page 1 of 4)
Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.