Re: [Foxgui-users] unicode character has 5 digits

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On 2025-07-08 03:53, Roland Hughes via Foxgui-users wrote:
> Please define "Does not work."
> 
> Do you get a compilation error?
> 
> Just not see the character?
> 
> What OS are you on?
> 
> I haven't coded with Fox in years, but . . . when it comes to Unicode,
> the first thing you have to do is ensure the font you are using
> actually has the character represented. Most fonts only have a tiny
> subset.
> 
> Here is an ancient discussion about finding which fonts have what
> character
> 
> https://graphicdesign.stackexchange.com/questions/63283/how-to-find-browse-fonts-that-include-certain-rare-characters-unicode-internat
> 
> 
> a 4 year old discussion
> 
> https://www.reddit.com/r/Unicode/comments/l3a3t8/what_font_renders_all_unicode_characters/
> 
> 
> A bit of barefoot in the snow for you
> 
> We should have forced all countries to use American English just so
> software developers would have an easier life. Internationalization is
> where it all went to Hell. Those who are long in the tooth (or now
> toothless) will remember wide characters.
> 
> https://www.geeksforgeeks.org/cpp/wide-char-and-library-functions-in-c/
> 
> 
> This was it!!!  Instead of 256 ASCII values we could now have 65536.
> That would rule the world! Please read point 2 at the top of that.
> wchar_t could be 2 or 4 bytes DEPENDING ON COMPILER USED. Data
> exchange was basically impossible.
> 
> Microsoft, in its infinite wisdom, cough cough hack hack, basically
> got trapped here. They are still trapped here today. Under the hood
> they went with the first cut of UTF-16 to avoid having to do multiple
> value characters like UTF-8 forced. In theory it was faster. Keep in
> mind Windows 3.10 was running no 286 computers so 16-bit at the time.
> 
> https://www.betaarchive.com/forum/viewtopic.php?t=38718
> 
> Still we could not get the population to engage in global nuclear
> warfare and force it to use the one true language, American English,
> where we could make do with good ole ASCII and those wonderful code
> pages. Especially since IBM still thwarts the universe today with
> EBCDIC
> 
> https://en.wikipedia.org/wiki/Code_page
> 
> Guess what?
> 
> Instead of subjugating all others via global warfare, they chose to
> promote peace and love, forming a committee churning out an ever
> larger elephant when the world wanted a mouse. Like all committees, it
> lacked any real industry knowledge. All they ever had was an x86 so
> that must be all that exists.
> 
> Read up on surrogates
> 
> https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF_(surrogates)
> 
> Pay attention to the BE (Big Endian) and LE (Little Endian) columns.
> IBM and AMDAL (sp?) are Big Endian. Despite Unisys switching to Intel
> processors they are still ones complement.
> 
> Now, we had a fine fine pickle brine.
> 
> The x86 and ARM world needed to support itty bitty embedded systems
> having 512MB or less of RAM (think universal remote control for your
> TV)
> 
> __AND__
> 
> we now had to be able to indicate the width of a constant.
> 
> The one true world where everything fit into a single 16-bit box was
> gone!
> 
> There is oceans of documentation and legacy code examples out there
> where \u is always used for unicode.
> 
> So now, C programmers, who've never touched a shift key in their life,
> had to use \U
> 
> Just wait for the hack they come up with when the benevolent committee
> lacking industry knowledge bloats UTF past 32.
> 
> UTF-64 is already taken.
> 
> https://utf64.moreplease.com/

Thanks for this wonderful background.  32-bit wide characters would
indeed incur enormous bloat, but thankfully, it seems UTF8 encoding
is brilliantly leaving most european langauges very close to 1-byte
per characters; even people in Korea, Japan, and China, UTF8 is never
bigger than 32-bit wide characters, but all punctuation, numbers, etc.
is mercifully as short as 1 byte.

Now RAM and DISK space is cheaper than ever, but the biggest problem
was always software.  UTF8 also makes software *mostly* able to deal
with wide characters w/o undue pain and suffering.

UTF8 is very clever: you can start a character walk from any point in
a string, as the begin of a character is always recognizable as such;
thus, you can also walk backwards through UTF8 very easily.

Various other encodings of 32-bit wide characters are not nearly as
clever.  So UTF8 is winning and all those who gambled on 16-bit
characters are having the worst of both worlds now: not as compact
as UTF8, while still having variable-sized characters.

So, UTF8 is the way to go.  For those who don't interpret the
characters, just store them, you'll never need to know anything other
than 8-bit safe strings of bytes.  In a few cases, you need to traverse
a character, not a byte at a time, you can look for the magic 
lead-character:

    (ch&0xC0)!=0x80

this takes only two clock cycles!

    -- JVZ