Support for multibyte in the compiler contains calculating "by character", not "by byte" (what we currently do) in several places and for national/utf8 literals may need an internal conversion of the byte buffer.
There is a GSOC proposal on handling this (see the contributions discussion board). We will see if this student work can be sponsored by Google.
There is a GSOC project idea on handling this in the runtime, which is much more complicated especially for NATIONAL items, as these include NUL bytes and therefore prevent the use of any standard C string function. Note that this is a large project ~350h and no student wrote a proposal for this yet.
"Someday" this will all be available with GnuCOBOL, but the amount of time people have is always less then what would be needed to do everything "soon". And many COBOL projects work fine without it and benefit more from tweaking what we already have or by adding missing extensions.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2024-03-28
When we store a string with UTF-8 or UTF-16, depending the language we need a... guess!
PIC N stores two bytes by character which should suffice.
UTF-8 may need 1-4 bytes
UTF-16 may need 2-4 bytes (excluding surrogates when needed)
Comparison of Unicode encodings - Wikipedia
One can use UTF-8 for display while using other internally! (and vice-versa)
Windows command prompt (even XP) is smart enough already! (with a tweek)
Are you ready to go to an adventure... Simon?
Don't be a chicken please!
By the way, where's Brian?
I'm fed up with his lame jokes... lately!
Is he Ok?
:-)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Please stay with a discussion time that you would use in a business meeting (and don't spam-post, I have much better things to do than moderating them).
Concerning the question here: this may already with as a user-defined name (just give it a try) - but be aware that this can lead to get out of col72 as we currently only byte-count for that.
If you want this in a literal either just place it in an alphanumeric one / PIC X (all operations then have to operate on the byte level, so this may have strange results with ref-mod/STRING/UNSTRING/INSPECT/MOVE) or as a utf8 literal u"中国が好きです" / in PIC U where it should (currently doesn't with GnuCOBOL) operate on the character level.
Similar applies to national literals / PIC N.
If you don't use utf8 for your COBOL source encoding, then the above changes with PIC U not working at all.
Note that using these with any type of screenio is a different thing because that relates to having setup the terminal correctly and, for extended screenio, depends on the library doing it.
Are there reasons not to use it?
Why?
I can do it in Windows easily with UTF-8, UTF-16, UTF-32, etc. even with PIC X (with math and a good guess)
PIC N is not, even... mandatory!
Using a Database without multibyte is like using «something» with «nothing»
What can we do?
Better, yet...
...what should we do?
Just saying (asking?)
MM
Support for multibyte in the compiler contains calculating "by character", not "by byte" (what we currently do) in several places and for national/utf8 literals may need an internal conversion of the byte buffer.
There is a GSOC proposal on handling this (see the contributions discussion board). We will see if this student work can be sponsored by Google.
There is a GSOC project idea on handling this in the runtime, which is much more complicated especially for
NATIONALitems, as these include NUL bytes and therefore prevent the use of any standard C string function. Note that this is a large project ~350h and no student wrote a proposal for this yet."Someday" this will all be available with GnuCOBOL, but the amount of time people have is always less then what would be needed to do everything "soon". And many COBOL projects work fine without it and benefit more from tweaking what we already have or by adding missing extensions.
When we store a string with UTF-8 or UTF-16, depending the language we need a... guess!
PIC N stores two bytes by character which should suffice.
UTF-8 may need 1-4 bytes
UTF-16 may need 2-4 bytes (excluding surrogates when needed)
Comparison of Unicode encodings - Wikipedia
One can use UTF-8 for display while using other internally! (and vice-versa)
Windows command prompt (even XP) is smart enough already! (with a tweek)
Are you ready to go to an adventure... Simon?
Don't be a chicken please!
By the way, where's Brian?
I'm fed up with his lame jokes... lately!
Is he Ok?
:-)
I've already read Ahmed Maher' proposal!
Is fascinating!
Nevertheless, LIBICU is... complex!
It's... big!
Too much... complex!
I have "nothing" against it!
But...
...GnuCOBOL have... also... "simple" alternatives to... LIBICU
One of them is... "libiconv"...
...the most basic and... simple I have already... found!
And works! (even in Windows)
Fascinating!
Note:
Windows does not need this! :-)
:-)
So how do you convert from iso-8859-15 to utf8 using libiconv? How do you do the same with ebcdic source (also 8bit encodings)…?
«So how do you convert from iso-8859-15 to utf8 using libiconv? How do you
do the same with ebcdic source (also 8bit encodings)…?»
https://www.lemoda.net/c/iconv-example/iconv-example.html
Examples:
iconv_t iconvDesc;
iconvDesc = iconv_open("UTF-8//TRANSLIT//IGNORE", "ISO−8859-15");
https://man7.org/linux/man-pages/man3/iconv_open.3.html
...
size = iconv(iconvDesc, inbuf, inbytesleft, outbuf, outbytesleft);
https://man7.org/linux/man-pages/man3/iconv.3.html
...
status = iconv_close(iconvDesc);
https://man7.org/linux/man-pages/man3/iconv_close.3.html
...
There are a some forks to libiconv with «EBCDIC» support
(must configure --enable-extra-encodings)
The principle is the same
https://lists.gnu.org/archive/html/bug-gnu-libiconv/2022-01/msg00002.html
https://github.com/pffang/libiconv-for-Windows (version 1.17 with MSVC)
On Fri, Mar 29, 2024 at 6:58 AM Simon Sobisch sf-mensch@users.sourceforge.net wrote:
Thank you for this useful post.
it is likely useful to check with the students if this would be a reasonable and simpler approach.
Please stay with a discussion time that you would use in a business meeting (and don't spam-post, I have much better things to do than moderating them).
Concerning the question here: this may already with as a user-defined name (just give it a try) - but be aware that this can lead to get out of col72 as we currently only byte-count for that.
If you want this in a literal either just place it in an alphanumeric one / PIC X (all operations then have to operate on the byte level, so this may have strange results with ref-mod/STRING/UNSTRING/INSPECT/MOVE) or as a utf8 literal u"中国が好きです" / in PIC U where it should (currently doesn't with GnuCOBOL) operate on the character level.
Similar applies to national literals / PIC N.
If you don't use utf8 for your COBOL source encoding, then the above changes with PIC U not working at all.
Note that using these with any type of screenio is a different thing because that relates to having setup the terminal correctly and, for extended screenio, depends on the library doing it.
Am 29. März 2024 00:27:32 MEZ schrieb "Mário Matos" matosma@users.sourceforge.net:
Last edit: Simon Sobisch 2024-03-29