PC-BASIC - a GW-BASIC emulator / Discussion / [CLOSED] General Discussion: Suggestion: Support for East Asian code pages

Rob Hagemans - 2014-08-31

The latest git version solves (1) and (2) and is able to run 8PUZZLE.BAS correctly.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2014-08-31

Thanks, although certain DBCS characters still don't show up correctly in certain cases. For example, running the following command in GBK codepage will not result in a correct display:

10 PRINT "原谅你中断举行了"

Three DBCS characters above (谅,中,行) will be showed as box-drawing chars instead of DBCS chars, unless --nobox option is given.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rob Hagemans - 2014-08-31

That sequence in GBK is going to be interpreted as box drawing no matter what, I'm afraid, since the box drawing characters all join up left and right. You'll have to turn of box drawing if you need that sentence.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2014-08-31

I think the issue may be actually simpler. For example, even with two DBCS chars such as "中断" (which is a relatively common word in GBK by the way), PC-BASIC will display it as "╓╨断" instead of either "中断" or "╓╨╢╧" when box drawing recognition is enabled, but two consecutive chars (i.e. ╓╨) should never be displayed/interpreted as box-drawing chars. There may be three consecutive "matching" box-drawing chars in "中断" (i.e. "╓╨╢╧"), but the later two chars ("╢╧") are already interpreted as a DBCS character (i.e. "断"), so the former two chars ("╓╨") should be interpreted as a DBCS character (i.e. "中") as well.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rob Hagemans - 2014-08-31

Here there are three consecutive connecting characters (╓╨╢) which is why the algorithm displays box-drawing. There seems to be an error in the algorithm so that it displays the second double-byte as DBCS.

According to the 'three connecting glyphs' algorithm, it should really be drawing the second character as box drawing as well, so ╓╨╢╧ or perhaps taking ╧ as the lead byte for the next DBCS character. As long as it's looking for three connecting glyphs, these characters will be displayed as box drawing as soon as the third glyph is printed; two chars only (PRINT "中") will print as DBCS.

Clearly the TSRs like TW do something different than my interpretation, but I don't know what it is they do. Maybe they just look for single-line characters or do something more complicated.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2014-08-31

Before we find out how TSRs like TW will handle them exactly, it will probably help to change the syntax of the command line option "--nobox" to something like "--as_box_char NUMBER", where "NUMBER" is the minimal number of consecutive connecting glyphs that will be recognized as box-drawing chars instead of DBCS chars, with 3 or 5 for example as default. On the other hand, setting NUMBER to 0 will disable box-drawing recognition. This will clearly be more flexible than the simple ON and OFF options.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rob Hagemans - 2014-08-31

That's not just a matter of adding a command line option, it would actually need to be implemented in the code and that's quite a lot of work. Three chars is just about doable in terms of code complexity and running time, but unless I get a really good idea or someone shows me a good algorithm that will give satisfactory results, I'm not going to implement that, sorry.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2014-08-31

OK, thanks a lot for the explanation. Now I see implementing that would be a lot harder than I previously thought. In such case it will be no doubt a good idea to find another way instead. I will try to see if I can discover something new in respect to how TSRs like TW will handle box-drawing chars in such case, but otherwise DBCS support in PC-BASIC is currently already pretty good. There is another important thing regarding DBCS support in PC-BASIC though (besides the possibility to support for UTF-8 encoded programs): PC-BASIC should actually be able to accept input from System IME, so that we can actually enter DBCS characters into PC-BASIC. Applications such as Notepad and wxMEdit can do this, but not yet PC-BASIC. I am not sure whether this will be very easy or difficult though.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rob Hagemans - 2014-09-01

Hi, thanks! Let me know if you find out what TW does. It may be as simple as only checking the single-to single and double-to-double line box drawing (the ones causing trouble in your example are all single-to-double).

Supporting input methods would be useful, but I don't know a single thing about system IME. If the input system sends utf-8 sequences to the keyboard buffer it should already work, I think - similar to the sterling sign £ on my UK keyboard which is sent as UTF-8 and interpreted by PC-BASIC as the corresponding codepage-437 point. If, on the other hand, it doesn't use the keyboard buffer it will be very difficult; PC-BASIC depends on PyGame for keyboard handling and if PyGame doesn't support it, well... But I may be missing something. What happens if you try to enter a DBCS character?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2014-09-01

Currently nothing will happen when trying to enter any DBCS character in PC-BASIC. After searching info about PyGame and IME in Google I found that the official build of PyGame 1.9.1 indeed does not support input from System IME. However, there are unofficial patches available for making it compatible with IME input. Below are some sites for IME support patch in PyGame (both sites are in Chinese):

http://www.shiftsky.com/article/36/
http://tib.tw/tBoard/index.py?m=pl&t=698

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rob Hagemans - 2014-09-01

Ok, thanks! I had come to the same conclusion but couldn't find that patch on English-language websites.

There also seems to be the suggestion that SDL 2 does support IME and I think the current source version of PyGame is based on SDL 2, in which case the patches wouldn't be necessary and running the pc-basic source against a freshly compiled PyGame should give some results.

That said, I am currently using pygame compiled from current sources on Ubuntu and I can't enter any Chinese characters through ibus. But it might just work on Windows, where I have only tried the release version of PyGame.

So, the things we need to try on Windows are:
- run PC-BASIC source with PyGame compiled from current sources (1.9.2a)
- run PC-BASIC source with PyGame compiled from a patched release 1.9.1
and see if anything happens when entering a Chinese character. I wouldn't expect it to work flawlessly but any sort of input coming through, even some random ascii letters, would give a hook as to how to proceed. It nothing comes through, I have no idea how to make it work. I'll try this if/when I have the time (compiling stuff on Windows is a bit of a pain) but if you have time & feel up to it, have a go and let me know the results...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2014-09-01

After actually trying that pygame 1.9.1 patch package, I find that the file PyGame-IME.py included in the package is actually a Python test program that will accept input from System IME (and it does work in my Win8.1 x64 system). The packages also include two .dll files named SDL.dll and imm32.dll, but the test program will actually work even just with the SDL.dll file included with PC-BASIC 14.08 and the imm32.dll shipped with Windows (at least in my system). So I think the real thing needed to do in order to support IME input is to modify some file in PC-BASIC source code in the same way as how that test program accept input from system IME, and everything will probably be done then.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rob Hagemans - 2014-09-01

[EDIT - this was posted before the post above, messages crossed but are basically saying the same thing]

I had a quick look at those links (using google translate) - the top one seems to give a way to link into Windows IME; unfortunately that's Windows-only but that still covers most people using it. The patch linking into SDL's IME functionality seems to have disappeared, all the links to it are dead. Also, it appears current PyGame still likes to SDL 1.9, so recompiling may not really have any effect at all.

I can try to see if I can work along the lines of the first link, it looks doable and would make it work for Windows, though I'll have to find out how to set up Windows IME and generally spend more time on Windows than I usually care to ;)

Once PyGame moves to SDL 2 this can then be revisited for a proper implementation. On Linux you can also enter UTF-8 characters through the command-line and text-mode interfaces (which don't work on Windows) so that would be a way of working around it.

The UTF-8 input is on the to-do list too, it would be an option to load a text file as UTF8 on the command line. The interface and things like PRINT and INPUT to screen and text files would still use the specified codepage (because it's limited to two-byte sequences, as discussed before).

I'm also, at some point, looking into implementing cut & paste for the pygame interface, perhaps using the windows key to avoid taking up key combinations that BASIC programs might use. That would then allow to paste in UTF-8 characters copied from your text editor, let's say with a key combination like Win+v.

Last edit: Rob Hagemans 2014-09-01

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rob Hagemans - 2014-09-01

Latest git can read & write UTF-8 files with the --utf8 option. You will still need to provide a code page that includes the unicode code points you want to input, so e.g. pcbasic PROGRAM.BAS --codepage=gbk --utf8 will work if PROGRAM.BAS has Chinese characters stored in UTF-8. In this mode, SAVE "PROGRAM",A will save to a UTF-8 text file.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2014-09-02

Great! This feature will be very useful! There is still one thing I wonder though, that is, is it possible to provide a "fallback" codepage in case that the primary codepage does not contain certain UTF-8 code points in the file?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2014-09-02

Now I realized that support for a "fallback" codepage in UTF-8 mode may be very difficult to be implemented. If this is the case, you can simply discard it and focus on more important issues instead. Thanks for your great work!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rob Hagemans - 2014-09-02

Yeah, a "fallback" codepage isn't going to work, it would be hugely confusing if they defined different characters on the same 2-byte sequence. Just use an extended codepage if available or create your own by merging the bits you want from existing codepages.

Code pages are supplied with PC-BASIC as text files in the encoding/ directory, with each codepoint represented on one line. For example, in gbk.ucp, the line 8140:4E02 specifies that in GBK, Unicode 4E02 (丂) is mapped to the byte sequence chr$(&h81)+chr$(&h40). You can just cut & paste such lines to your own .ucp text file and put it in encoding/.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rob Hagemans - 2014-09-02

I've tried a new box-drawing algorithm that only looks at horizontal single or double lines. It's not the same as TWAY but it does work the right way for all of your example files, so it may be an improvement. It's also a lot faster which is a clear advantage as PC-BASIC gets a bit sluggish with DBCS codepages.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2014-09-03

It is good that most programs will display characters correctly in DBCS codepages now. But I also have a GW-BASIC Go game which used to work fine in earlier builds of PC-BASIC, but now no longer display box-drawing characters correctly when DBCS codepages such as GBK and Big5-eten are used (the screen will be a complete mess then). I have uploaded the program named GO.BAS.

GO.BAS

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rob Hagemans - 2014-09-03

Hi, thanks for that, I'll have a look. I appreciate your help in testing.

Do you have a large collection of BASIC programs around? If so, could I ask you to them all over as a zip file so I can have a look through when I make changes? The files you have sent so far are literally the only real-life testing I have for Chinese language support so it may make it all a bit easier. I don't know if there's a file size limit for attachments (though BAS files are tiny when zipped), if so I could pick it up from Dropbox or something similar. Thanks!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rob Hagemans - 2014-09-03

I just checked GO.BAS in TWAY (it's indeed as mess at current in PC-BASIC) and I realised something I hadn't noticed before. TWAY does not display DBCS characters in WIDTH 40 modes at all. It seems GO.BAS makes use of that, because if you change the WIDTH statement in line 10 it displays as a mess as well. The same mess, indeed. So, the easy fix here is to disable DBCS when the screen width is less than 80. It also makes sense that it would be turned off, as the characters become really stretched.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2014-09-04

Thanks! Disabling DBCS by default when the screen width is less than 80 is probably a good idea, as most TSRs that support Chinese seem to do the same thing. But I think it may also help to add an option to enable DBCS in such modes, and this option will be turned off by default of course. I don't really have a large collection of GW-BASIC programs, but I do have a GW-BASIC program which consisting of many smaller programs, some are more or less real games, and some just for fun, with a menu to select from them to run. This program is nevertheless currently somewhat badly organized, so I think I need to make some changes before uploading it. Thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2014-09-04

In the meantime I have uploaded another GW-BASIC program that I had written previously, a User Management System, which has a bilingual user interface (English & Simplified Chinese), and user can switch language by pressing the Tab key in the menu. See the attachment named USRMAN25.BAS.

USRMAN25.BAS

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rob Hagemans - 2014-09-04

Hi, thanks for the upload! I don't mind badly organised by the way, that's fairly standard for GW-BASIC ;). I'm just hoping to have some more testing programs. Plus it's nice to toy around with old games of course.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rob Hagemans - 2014-09-05

The latest git version supports cut & paste - select with WinKey+left and right arrows, copy with Win+C and paste with Win+V (the usual shortcuts won't work as they would conflict with BASIC shortcuts, in particular ctrl+c). You can use this to copy & paste in DBCS characters from a text editor. Let me know if it works on Windows; the machine I use for Windows is down with hard drive issues so I can't test there at the moment.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Suggestion: Support for East Asian code pages

A free, cross-platform emulator for GW-BASIC, PCjr & Tandy BASIC

Forums

Help

Suggestion: Support for East Asian code pages

Suggestion: Support for East Asian code pages

A free, cross-platform emulator for GW-BASIC, PCjr & Tandy BASIC

Forums

Help

Suggestion: Support for East Asian code pages document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Suggestion: Support for East Asian code pages