Re: [Seed7-users] Using UTF-8 characters in identifiers
Interpreter and compiler for the Seed7 programming language.
Brought to you by:
thomas_mertes
From: Thomas M. <tho...@gm...> - 2015-01-06 21:06:41
|
On 2014-Dec-29, 23:24, "Кулешов Аркадий" <ark...@ya...> wrote: > Dear Seed7 Users, > > Attached is an experimental patch that I used to enable UTF-8 multibyte characters in program identifiers. Thank you very much for your patch. I would like to add your improvements to the Seed7 release, but there are several steps to reach that goal. > I think it may be useful for educational purposes/schools as an example. Yes, of course. When I taught my children programming they used german identifiers for variables and functions. Obviously people who use other alphabets do not have this possibility. I see it as important that for educational purposes everybody can use the mother tongue. On the other hand I have found program code in the internet with (for me) unreadable variable and function identifiers. When I decided to restrict Seed7 to ASCII identifiers I wanted to force people to write code that is readable by professional software developers all over the world. The Seed7 interpreter could work in two identifier modes: - Identifiers with ASCII letters (default mode). - Identifiers with Unicode letters. There could be a interpreter option or a declaration statement in the program that does select the mode. The operation in ASCII identifier mode should not be slowed down by the fact that an Unicode letter mode exists. > Please let me know if you also see other potential uses of this feature. > > Most changes are in the scanner.c file. The identifier symbols continue to be stored as multibyte C strings. > I also added portable wcwidth.c and c_ident.c files from "libutf8" library with reference to the original author. Under which license the original author released the "libutf8" library? The Seed7 runtime library uses the LGPL. To use code from "libutf8" it must be allowed to relicense it with the LGPL. > All makefiles were modified to include these new files. Btw, is it the correct way? Yes, but it might not be necessary to introduce new files. Concerning the functions in wcwidth.c and c_ident.c I have some questions. Does the function is_c_identifier_part() return TRUE for a letter and FALSE otherwise? In this case it would probably make sense to make is_c_identifier_part available for Seed7 programs. The same applies to the function wcwidth(). A Seed7 program might be interested to know if a character is non-spacing or double-width. I saw that both is_c_identifier_part() and wcwidth() work only for UTF-16. Seed7 uses UTF-32 so this functions will probably fail for characters byond U+ffff. When is_c_identifier_part() and wcwidth() are useful for Seed7 programs also they should be added to the interpreter. I suggest to add them to chr_rtl.c. For is_c_identifier_part() I suggest the C function name chrIsLetter and the action "CHR_IS_LETTER". For wcwidth() I suggest the C function name chrWidth and the action "CHR_WIDTH". The functions need to work for UTF-32 and should have charType parameters. You can take a look at the functions toLower() and toUpper() in str_rtl.c. This functions are used by functions in str_rtl.c and chr_rtl.c. I consider the addition of this functions as first step. There are other open questions with Unicode identifiers. Does is_c_identifier_part allow variables names with chinese characters, hieroglyphs, cuneiform or other scripts? I had only a brief look at your changes in scanner.c and literal.c, but they seem to go into the right direction. > For me this was easiest portable solution to use wcwidth and to quickly check if a unicode character can be part of identifier. > > The patch also includes a feature to automatically use utf8 files STD_UTF8_IN and STD_UTF8_OUT for IN and OUT variables. I consider this as a different thing that should be in a different patch. I was considering to use STD_CONSOLE and maybe KEYBOARD for that purpose. > It introduces new primitive action "UT8_MODE_ON" for that and a new function in ut8lib.c to check if current locale uses UTF-8. > http://www.cl.cam.ac.uk/~mgk25/unicode.html#activate was used as one of the guides. > I checked the changes and ran tests under Ubuntu 14 and Windows 8. > Windows locale detection and console code page selection are not yet implemented. > > However one change in utf8.s7i causes the compiler s7c to fail. > The change is in lines 242 - 259 and is commented out. > > ../lib/utf8.s7i: In function ‘o_3135_SEL_STD_FILE_FROM’: > ../lib/utf8.s7i:248:45: error: expected expression before ‘;’ token > isUTF8 := utf8_mode_on; > > Please advice what could be the reason. You did not add code for the action UT8_MODE_ON to the compiler. The file seed7/lib/comp/ut8_act.s7i handles UT8 actions. There is also the file seed7/lib/comp/action.s7i which contains code that calls functions in ut8_act.s7i and in other files. If you start s7c with the option -g-debug_c the intermediate *.c file is not removed and the C compiler refers its error message to the line numbers in the intermediate *.c file. Regards, Thomas Mertes -- Seed7 Homepage: http://seed7.sourceforge.net Seed7 - The extensible programming language: User defined statements and operators, abstract data types, templates without special syntax, OO with interfaces and multiple dispatch, statically typed, interpreted or compiled, portable, runs under linux/unix/windows. |