Re: [Seed7-users] Using UTF-8 characters in identifiers

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On 2014-Dec-29, 23:24, "Кулешов Аркадий" <ark...@ya...> wrote:
> Dear Seed7 Users,
> 
> Attached is an experimental patch that I used to enable UTF-8 multibyte characters in program identifiers.

Thank you very much for your patch.
I would like to add your improvements to the Seed7 release,
but there are several steps to reach that goal.

> I think it may be useful for educational purposes/schools as an example.

Yes, of course. When I taught my children programming they used
german identifiers for variables and functions. Obviously people who
use other alphabets do not have this possibility. I see it as
important that for educational purposes everybody can use the mother
tongue.

On the other hand I have found program code in the internet with
(for me) unreadable variable and function identifiers. When I decided
to restrict Seed7 to ASCII identifiers I wanted to force people to
write code that is readable by professional software developers
all over the world.

The Seed7 interpreter could work in two identifier modes:
- Identifiers with ASCII letters (default mode).
- Identifiers with Unicode letters.
There could be a interpreter option or a declaration statement in the
program that does select the mode. The operation in ASCII identifier
mode should not be slowed down by the fact that an Unicode letter
mode exists.

> Please let me know if you also see other potential uses of this feature.
> 
> Most changes are in the scanner.c file. The identifier symbols continue to be stored as multibyte C strings.
> I also added portable wcwidth.c and c_ident.c files from "libutf8" library with reference to the original author.

Under which license the original author released the "libutf8"
library? The Seed7 runtime library uses the LGPL. To use code from
"libutf8" it must be allowed to relicense it with the LGPL.

> All makefiles were modified to include these new files. Btw, is it the correct way?

Yes, but it might not be necessary to introduce new files.
Concerning the functions in wcwidth.c and c_ident.c I have some
questions. Does the function is_c_identifier_part() return TRUE for a
letter and FALSE otherwise? In this case it would probably make sense
to make is_c_identifier_part available for Seed7 programs. The same
applies to the function wcwidth(). A Seed7 program might be
interested to know if a character is non-spacing or double-width.

I saw that both is_c_identifier_part() and wcwidth() work only
for UTF-16. Seed7 uses UTF-32 so this functions will probably
fail for characters byond U+ffff. When is_c_identifier_part() and
wcwidth() are useful for Seed7 programs also they should be added
to the interpreter. I suggest to add them to chr_rtl.c. For
is_c_identifier_part() I suggest the C function name chrIsLetter and
the action "CHR_IS_LETTER". For wcwidth() I suggest the C function
name chrWidth and the action "CHR_WIDTH". The functions need to
work for UTF-32 and should have charType parameters. You can take a
look at the functions toLower() and toUpper() in str_rtl.c. This
functions are used by functions in str_rtl.c and chr_rtl.c.

I consider the addition of this functions as first step.
There are other open questions with Unicode identifiers.
Does is_c_identifier_part allow variables names with chinese
characters, hieroglyphs, cuneiform or other scripts?

I had only a brief look at your changes in scanner.c and literal.c,
but they seem to go into the right direction.

> For me this was easiest portable solution to use wcwidth and to quickly check if a unicode character can be part of identifier.
> 
> The patch also includes a feature to automatically use utf8 files STD_UTF8_IN and STD_UTF8_OUT for IN and OUT variables.

I consider this as a different thing that should be in a different
patch. I was considering to use STD_CONSOLE and maybe KEYBOARD
for that purpose.

> It introduces new primitive action "UT8_MODE_ON" for that and a new function in ut8lib.c to check if current locale uses UTF-8.
> http://www.cl.cam.ac.uk/~mgk25/unicode.html#activate was used as one of the guides.
> I checked the changes and ran tests under Ubuntu 14 and Windows 8.
> Windows locale detection and console code page selection are not yet implemented.
> 
> However one change in utf8.s7i causes the compiler s7c to fail.
> The change is in lines 242 - 259 and is commented out.
> 
> ../lib/utf8.s7i: In function ‘o_3135_SEL_STD_FILE_FROM’:
> ../lib/utf8.s7i:248:45: error: expected expression before ‘;’ token
>      isUTF8 := utf8_mode_on;
> 
> Please advice what could be the reason.

You did not add code for the action UT8_MODE_ON to the compiler. The
file seed7/lib/comp/ut8_act.s7i handles UT8 actions. There is also
the file seed7/lib/comp/action.s7i which contains code that calls
functions in ut8_act.s7i and in other files. If you start s7c with
the option -g-debug_c the intermediate *.c file is not removed and
the C compiler refers its error message to the line numbers in the
intermediate *.c file.

Regards,
Thomas Mertes

-- 
Seed7 Homepage:  http://seed7.sourceforge.net
Seed7 - The extensible programming language: User defined statements
and operators, abstract data types, templates without special
syntax, OO with interfaces and multiple dispatch, statically typed,
interpreted or compiled, portable, runs under linux/unix/windows.

Re: [Seed7-users] Using UTF-8 characters in identifiers

Interpreter and compiler for the Seed7 programming language.

Re: [Seed7-users] Using UTF-8 characters in identifiers