From: Bruce & B. R. <br...@dc...> - 2016-02-10 14:05:52
|
Good morning to all, I am making available to anyone who so desires a copy of the UTF-8 unicon classes with some documentation. The basic code has been completed and testing has been done on the functionality. Exhaustive testing has NOT been performed as yet. I still have to fix the comments in the header of the file. This will be done over the next few weeks. However, the code is available for use and/or testing for those interested. Unlike in my previous non-class based code, I have made analogues to all the in-built string processing functions and provide a means of using the class methods within the string scanning environment. There is also a small PDF file that describes the various methods and a simple example of how to use the classes. If you are interested, please feel free to contact me either directly or via this list group. This is my take on processing UTF-8. It is an interim measure until the UTF-8 implementation changes are made to the Unicon/Icon runtime system. It may or may not suit your purposes. There is at least one possible problem in that it will recognise some multi-byte characters as UTF-8 even though they are specifically not UTF-8. This particular problem will only arise where someone has been malicious in crafting the codepoint. This will only occur when an extra continuation byte is inserted that contains only 0 bits in the lower 6 bits of the byte. The standard specifically states that this is not allowed. My code, at this point, doesn't always catch this condition and in one place has the potential to generate these specific continuation bytes. In normal processing, this should not arise. regards Bruce Rennie |