UTF8 extension in 6.30

Help
vranoch
2009-05-20
2012-11-23
  • vranoch
    vranoch
    2009-05-20

    Hello all,

    when testing the 6.30 version I randomly met a cloud of bugs related with the UTF8 extension. Shortly, most of the new code in strngfun.c and utility.c does not correctly manipulates real string sizes. For example function SubStringFunction cores for empty string due to the statement:

    end = UTF8Offset(tempString,end + 1) - 1;

    Before correctly set value end=0 is set to end=-1 (resp. end=MAXINT) which does not cause a crash itself, but the following rewriting of characters in the FOR loop does.

    Also most of UTF8xxx functions do not calculate with the fact, that the string can be shorter then they expect. They just access 4 following bytes regardless they still belong to the string or not.

    Does anybody plan to revise this code?

    Thanks a lot   Vranoch

     
    • Gary Riley
      Gary Riley
      2009-05-24

      There's a fix for substring in strngfun.c checked into svn.

       
      • vranoch
        vranoch
        2009-05-24

        That's great. Thanks. And how about the other string functions? Did You check them also whether they do not suffer the same problem? And did You looked at UTF8xxx functions in utility.c? - do You find them safe against various mixed-strigs or strange-sized strings?

         
    • Gary Riley
      Gary Riley
      2009-05-24

      substring is the only CLIPS function that seems to have an issue. I haven't decided yet on what type of C API I want to use to allow access to the characters of a UTF8 string, so the UTF8xxx functions may remain for internal use only.

       
    • vranoch
      vranoch
      2009-05-25

      It is not a problem that UTF8xxx functions are accessed only internally so far. They are frequently called from public string functions and their problem is that they access memory without any respect to real length of the string. Calling a public string function with appropriately formed string parameter can easily cause a memory corruption.

       
    • Gary Riley
      Gary Riley
      2009-05-26

      The problem was with substring, not the UTF8xxx function. The UTF8Offset function doesn't access memory beyond the valid length of a properly formed string.

       
    • vranoch
      vranoch
      2009-05-26

      It is not a problem that UTF8xxx functions are accessed only internally so far. They are frequently called from public string functions and their problem is that they access memory without any respect to real length of the string. Calling a public string function with appropriately formed string parameter can easily cause a memory corruption.

       
    • vranoch
      vranoch
      2009-05-26

      sorry for re-posting after page refresh :-(