SourceForge has been redesigned. Learn more.
Close

#446 Fix utf8len_of_string for ANSI build.

closed-rejected
None
5
2003-12-15
2003-11-22
No

Fix utf8len_of_string for ANSI build.

Use Laurent's singe-pass technique (precompute UTF-8
lengths for all characters in relevant codepage). Cache
result locally.

This is a bugfix for bug#847155 (Utf8len_of_string is
wrong for 8-bit).

Changes are all in unicoder module. (I forgot to have
MergeDoc pass in the relevant codepage, but that is a
trivial thing to do, and also, doesn't yet affect us,
as it is always default still).

Discussion

  • Anonymous

    Anonymous - 2003-11-22

    Altered and original unicoder.* (12k)

     
  • ganier

    ganier - 2003-11-23

    Logged In: YES
    user_id=804270

    I didn't thought about this question, sharing the array.
    Anyway, what is wrong with a simple solution like create the
    array in Utf8len_of_string, and no static array ?
    That is not a solution for Utf8len_fromChar, but this function
    is not used anywhere, isn't it ?
    I am really not fontd of semaphore or lock. I get always afraid
    of locks because of debugging issues.

    Detail : do we want to go for files above 2 Go (and use
    __int64). I never take care for that (I don't really imagine
    diffing such files), but I don't care either about using __int64
    in my own patch. So is it the way we shall go ?

     
  • Anonymous

    Anonymous - 2003-11-23

    Logged In: YES
    user_id=60964

    I cached the array because it seemed silly to me to keep
    recomputing this same array - because 99.9% of the time we
    compute the same array over and over, I think. But, if I
    cache it, and if this gets called in multiple threads, there
    is a race condition when it first gets created, leading to
    the need for the semaphore.

    Anyway, removing the caching will remove the semaphore, so,
    I'm okay with doing that -- I think computing this array
    every time will be neglible really anyway.

    I think we may as well try to handle large files, although
    we're not yet really ready I bet.

     
  • ganier

    ganier - 2003-11-24

    Logged In: YES
    user_id=804270

    I wonder why -1 in this case ?

    int nchars = MultiByteToWideChar(codepage, 0, &ch, 1,
    wbuff, 3);
    wbuff[nchars]=0; // (MultiByteToWideChar doesn't zero
    terminate)
    int len = 0;
    if (!nchars)
    {
    len = -1;

     
  • Kimmo Varis

    Kimmo Varis - 2003-12-08

    Logged In: YES
    user_id=631874

    So.. What's happening here?

     
  • Anonymous

    Anonymous - 2003-12-08

    Logged In: YES
    user_id=60964

    I need to find time and inspiration to go back and look at
    the handling of invalid characters -- that is what that -1
    involves (that Laurent asked about).

    I think this is probably read to go, subject to possible fix
    if the -1 isn't being handled correctly.

     
  • Anonymous

    Anonymous - 2003-12-09

    Logged In: YES
    user_id=60964

    Oh, -1 is to signal that the character can't be represented.

    This is in case the caller needs to distinguish this case --
    actually I think we should be sometimes checking for this in
    WinMerge, to notice when conversion is lossy. What shall we
    do if user tries to save buffer as some codepage, and it is
    a lossy save ? I don't know, but optimally we would at least
    inform them.

    This patch is not ready -- I need to do a new simpler
    version without map caching, as requested by Laurent in
    commentary.

     
  • ganier

    ganier - 2003-12-09

    Logged In: YES
    user_id=804270

    I doubt that -1 is a good value (that will shrink the buffer !).

    This is just a function to compute the buffer length, probably
    the check for lossy characters will happen in the real
    conversion function, not in this one.
    However this function shall work if the user accepts lossy
    conversions, the lossy character is probably '?' and a value of
    1 seems appropriate.

     
  • Anonymous

    Anonymous - 2003-12-09

    Logged In: YES
    user_id=60964

    The caller has to check for -1, not use it to shrink the buffer.

    Probably our only calling path is via Utf8len_of_string,
    which has this line:

        if \(chlen < 1\) chlen = 1;
    

    so I think that will take care of it ?

    (In the patch code, there is one unused potential caller,
    Utf8len_fromChar, but you've requested that I remove that in
    the next patch version, so it doesn't matter.)

    You may be correct that we could ignore the problem (of
    lossy conversion) here and pick it up elsewhere -- I don't
    recall -- but I think right now we are recording the
    information and handling the case, so why throw away the
    information ? I guess the reason why is to simplify the use
    of map (protect from bad users). In general, though, I don't
    like ignoring errors.

    This is one of my peeves about using Microsoft codepage
    conversion (MultiByteToWideChar etc) functions -- they are
    not very good for lossy conversions -- you can't iterate
    through a string both doing conversion and also finding all
    the points of loss, AFAIK. iconv is much better for this.
    So, because that is something that annoys me, I write this
    map not to have that problem :)

    I can check later before doing the next version of this
    patch that we can handle the problem later, and do a simpler
    map version that just stores 1 for lossy character length.
    (I need to document that this map assumes you will use a
    single byte character for losses.)

    After all, if I want the map with more info (that includes
    -1 for losses), I can always do another version for that.

    So, I'll fix this in next version.

     
  • Anonymous

    Anonymous - 2003-12-09

    Logged In: YES
    user_id=60964

    PS:

    The -1 isn't documented in the function making the map,
    which is bad.

     
  • Anonymous

    Anonymous - 2003-12-10

    Logged In: YES
    user_id=60964

    Uhoh, this entire approach fails utterly when it encounters
    CP936 (Chinese Simplified multibyte codepage), I believe.

    :(

     
  • Anonymous

    Anonymous - 2003-12-10

    Logged In: YES
    user_id=60964

    We could do a fancy multibyte codepage, but it will be
    larger, and then I think I'd want to cache the maps as in
    this patch, but Kimmo just posted a crash involving the
    Unicode build which is yet another problem with the
    memory-mapped file saving, and I'm wondering if we shouldn't
    drop this entire approach and just write out the data in
    streams instead of all this hacking about trying to figure
    out the #bytes before writing.

     
  • Kimmo Varis

    Kimmo Varis - 2003-12-10

    Logged In: YES
    user_id=631874

    Like I commented to bug: converting to using streams for
    saving is OK for me. We are talking about only saving code
    here, right?

     
  • ganier

    ganier - 2003-12-11

    Logged In: YES
    user_id=804270

    Streams are OK for me too.
    Plugins call some functions from unicoder, I hope they won't
    interfere. I am going to check and report to you.

     
  • ganier

    ganier - 2003-12-11

    Logged In: YES
    user_id=804270

    The codepage functions are not used in plugins, but the Utf8
    functions are.

    Plugins (the new code) employs :
    - getDefaultCodepage
    - writeBom
    and five functions for UCS2-LE<->UTF-8 conversion :
    - Utf8len_of_string (argumentss changed for new plugin code)
    - to_utf8_advance
    - stringlen_of_utf8 (for new plugin code)
    - GetUtf8Char
    - Utf8len_fromLeadByte

    Did you begin to write your code ? I just applied the new
    plugin code, it adds some small functions to unicoder.cpp and
    you need to update your current work (:

    For the new plugin code, streams are fine too. That requires
    small changes in the file multiformatText.cpp. If you release
    something, it is probably easier that you just comment out
    the calls in multiformatText, and let me update them.

     
  • Anonymous

    Anonymous - 2003-12-14

    Logged In: YES
    user_id=60964

    My new version of UniFile/unicoder was written directly from
    today's CVS versions, so it should include all your new code :)

    I'm talking about my just submitted, not very tested,
    patch#[ 860054 ]: Reimplement SaveToFile using stream output.

     
  • Anonymous

    Anonymous - 2003-12-15

    Logged In: YES
    user_id=60964

    I'm rejecting this patch, in favor of my new patch#860054
    (Reimplement SaveToFile using stream output).

     
  • Anonymous

    Anonymous - 2003-12-15
    • assigned_to: nobody --> puddle
    • status: open --> closed-rejected
     

Log in to post a comment.