#366 Inserting out-of-range characters needs to be handled


I do not believe we are handling the problem of what to
do if the user inserts characters into the editor that
can't be represented in the original source file.

The simplest case is if you are using WinMergeU.exe and
editing 8-bit source files, and you add some Greek,
some Telugu, and some Arabic -- they can't possibly be
converted into the source file codepage.

What do we want to do in this case ?

Then, what about plugin unpackers and packers -- do
they report if they did lossy conversion ? How do we
handle that ?
What if the unpacker wasn't lossy, but the packer is ?

(I don't have a proposal for all this, so I'm hoping
someone else will.)


  • ganier

    Logged In: YES

    Unpack/pack is now very different from unicode, and that
    was a good idea to separate them early.
    This means :
    * if you have a file with a BOM, and the unpacker wants Ansi
    files, it does not work.
    * if you have a file without BOM, and the unpacker wants a
    Unicode file with a BOM, it does not work.
    * an unpacker may unpack a file with zero inside. For example
    an 'unzip' unpacker will work with text ansi, text unicode,
    exe... (For files like zipped exe), unpacking succeeds but
    loading fails.
    * if you are able to save, even with some lossy conversions,
    most probably you'll be able to pack. The packer may still fail
    if the syntax is no more correct (for example, the packer may
    check for missing closing brackets).
    * unpacking may fail too, we then just consider it is not the
    right unpacker.

    Unpack/pack looks much like decompress/compress. Really
    like one compressed format in UNIX (I don't remember the
    extension) for which an archive holds only one file.

    Lossy conversions do not exist in pack/unpack. Either it
    succeeds, either it fails. (probably it is going to change but I
    don't know when and how).

    Lossy conversions are really a question for Unicode/encoding.
    I suppose you may find inspirations from some editor. If I
    remember, notepad refuses to save with lossy conversions
    and proposes to use another format.

  • ganier

    Logged In: YES

    > I just filed a bug about lossy conversion in the packing
    > phase, because I don't believe we're handling this problem
    > (although, I've not tested), at least with WinMergeU.exe
    > when it does UNICODE -> TCHAR conversion.
    There is no conversion during packing (pack/unpack handles
    byte arrays, not character strings).

    One event uses conversion, it is PREDIFFING.
    Then we must convert the buffer first to WCHAR for the
    plugin then to UTF8 for diffutils.
    I will be happy to just copy the conversion code used for
    saving, when this code will handle lossy conversions. If you
    don't mind...

  • Perry

    Logged In: YES

    Ok, I like your point about notepad.

    You said the packer may fail -- then what do you do ?

    Maybe this weekend I'll look into at least doing something
    when this happens, instead of silently writing question
    marks to their file (which is what I guess we do now).

    I suppose a message box will be the easiest first cut.

  • Perry

    Logged In: YES

    Oops, the bug is a bit larger than this.

    CMergeDoc::CDiffTextBuffer::SaveToFile doesn't actually take
    a codepage parameter, so we never convert back to the
    original codeset.

    Because the user cannot currently give a codepage to
    WinMerge, this only affects WinMergeU.exe using 8-bit source

    It reads them & converts them using CP_ACP (or CP_THREAD_ACP
    if available), but when it saves, it doesn't convert back.

    So the bug is that we don't convert back, not (as I said
    above) that we convert back and ignore the losses.

  • ganier

    Logged In: YES

    > It reads them & converts them using CP_ACP (or
    > CP_THREAD_ACP if available), but when it saves,
    > it doesn't convert back
    May you explain some more for my information : how does the
    code know when loading that it must convert (are code >=
    0x80 automatically converted from the local codepage ?).

    And this line called when saving :
    ucr::convertToBuffer(text, fileData.pMapBase, ucr::NONE);
    doesn't do a conversion ?

  • Perry

    Logged In: YES

    Ok, relevant case is that the source file has no BOM, so we
    assume it is 8-bit, and WinMergeU.exe is running, so it
    wants to convert it into wchar_t CString. So, it has 8-bit
    input and it wants wchar_t. It calls MultiByteToWideChar.
    That requires a codepage. It uses CP_ACP or CP_THREAD_ACP,
    which means the user's default codepage.

    We have to pick an example besides Latin-1 to see the effect
    (because Unicode 0x7F-0xFF is identical to Latin-1).

    Suppose the user's codepage is CP 1250 (Windows Eastern
    European codepage for Latin script Eastern European languages).

    When we find the byte xA3 in the source file, we interpret
    that as a "LATIN CAPITAL LETTER L WITH STROKE" (it is a
    letter in the Polish alphabet, and not used in any other
    alphabet that I know of). So, we make that into Unicode
    character x141. So, the 8-bit character in the source 0xA3
    becomes the wchar_t of 0x141 in the CString, so it displays
    correctly in the crystal editor, if the source file was
    really meant to be in CP-1250 (our assumption).

    But, when the user saves the file, we forgot to convert
    back, so what do we do with that 0x141 ? Hm, I'm not sure, I
    guess we write it back as 0x01 followed by 0x41. Anyway, we
    don't do the correct thing, which is to convert it back to
    codepage 1250, to a 0xA3.


    I don't know if I can set my default codepage to 1250 to do
    tests like this. My default codepage is 1252, which isn't
    very good for these tests, because it is mostly isomorphic
    to Unicode -- except I guess in the Windows extension range,
    which I think is 0x7F to 0x9F, and I'm not familiar with
    this stuff.

    So, in sum, to fix this when writing back, we need to know
    that we are supposed to go back to 1252. So we need the
    codepage number (m_codepage == 1252, in some object).

    > ucr::convertToBuffer(text, fileData.pMapBase, ucr::NONE);

    Ok, there are two issues.

    #1) Does it convert back at all ?
    Answer no. Maybe this is a bug in convertToBuffer. Should it
    convert back using default codepage CP_ACP/CP_THREAD_ACP ?
    Maybe so -- I'm inclined to say yes. That would fix this for
    now. But, see below:

    #2) Does it remember exactly what converting ?
    It can't, because we didn't tell it. I mean, when we first
    converted to get to the buffer, we ought to record exactly
    what codepage we assumed, and use the same one going back.
    Right now, we always use CP_ACP/CP_THREAD_ACP because we
    lack UI to pick other one, but I plan for the future, when
    we might use a different codepage, so I say we should record
    which one we used (in fact I made a variable for that,
    m_codepage in some object, but I never filled it in).

    So, what I mean is, we should explicitly set the codepage up
    in the file loading code (to CP_ACP for now in all cases),
    and then explicitly pass that to the saving code, and revise
    it to use it.

  • ganier

    Logged In: YES

    > So, what I mean is, we should explicitly set the codepage
    > up in the file loading code (to CP_ACP for now in all cases),
    > and then explicitly pass that to the saving code, and revise
    > it to use it
    That looks a little bit like pack/unpack. I had to introduce a
    structure too.

    During save : I first copy this structure in a temporary copy.
    SaveToFile is called with the filename and this copy.
    If it succeeds, then it is finished.
    If pack fails during SaveToFile , a custom error code is
    Then TrySaveAs does the job : it copy a default packer
    (which doesn't pack and always succeeds) in the temporary
    copy, displays an appropriate message, and calls SaveToFile
    again. This time, pack may not fail any more.

    There is a difference between codepage and packer. There is
    only one packer for both files, so the structure is stored in
    CMergeDoc (search for m_pInfoUnpacker). But you want to
    have one codepage per file.

    I hope this helps.

  • Perry

    Logged In: YES

    I'm forking the convertToBuffer bug into a separate bug,
    because it is really a separate, simpler, and much more
    important issue (because it is causing repeated corruption
    of the Czech and Slovak translations, at least).

  • elsapo

    Logged In: YES

    The convertToBuffer bug & patch mentioned below is PATCH [
    859995 ] "convertToBuffer assumes ISO-8895-1 codepage",
    which was applied 2004-12-29 (which removed the no longer
    used convertToBuffer method).


    Checking in Src/Common/unicoder.cpp;
    /cvsroot/winmerge/WinMerge/Src/Common/unicoder.cpp,v <--
    new revision: 1.24; previous revision: 1.23