WinMerge / Bugs / #366 Inserting out-of-range characters needs to be handled

ganier - 2003-11-07

Logged In: YES
user_id=804270

Unpack/pack is now very different from unicode, and that
was a good idea to separate them early.
This means :
* if you have a file with a BOM, and the unpacker wants Ansi
files, it does not work.
* if you have a file without BOM, and the unpacker wants a
Unicode file with a BOM, it does not work.
* an unpacker may unpack a file with zero inside. For example
an 'unzip' unpacker will work with text ansi, text unicode,
exe... (For files like zipped exe), unpacking succeeds but
loading fails.
* if you are able to save, even with some lossy conversions,
most probably you'll be able to pack. The packer may still fail
if the syntax is no more correct (for example, the packer may
check for missing closing brackets).
* unpacking may fail too, we then just consider it is not the
right unpacker.

Unpack/pack looks much like decompress/compress. Really
like one compressed format in UNIX (I don't remember the
extension) for which an archive holds only one file.

Lossy conversions do not exist in pack/unpack. Either it
succeeds, either it fails. (probably it is going to change but I
don't know when and how).

Lossy conversions are really a question for Unicode/encoding.
I suppose you may find inspirations from some editor. If I
remember, notepad refuses to save with lossy conversions
and proposes to use another format.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ganier - 2003-11-07

Logged In: YES
user_id=804270

> I just filed a bug about lossy conversion in the packing
> phase, because I don't believe we're handling this problem
> (although, I've not tested), at least with WinMergeU.exe
> when it does UNICODE -> TCHAR conversion.
There is no conversion during packing (pack/unpack handles
byte arrays, not character strings).

One event uses conversion, it is PREDIFFING.
Then we must convert the buffer first to WCHAR for the
plugin then to UTF8 for diffutils.
I will be happy to just copy the conversion code used for
saving, when this code will handle lossy conversions. If you
don't mind...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2003-11-07

Logged In: YES
user_id=60964

Ok, I like your point about notepad.

You said the packer may fail -- then what do you do ?

Maybe this weekend I'll look into at least doing something
when this happens, instead of silently writing question
marks to their file (which is what I guess we do now).

I suppose a message box will be the easiest first cut.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2003-11-08

Logged In: YES
user_id=60964

Oops, the bug is a bit larger than this.

CMergeDoc::CDiffTextBuffer::SaveToFile doesn't actually take
a codepage parameter, so we never convert back to the
original codeset.

Because the user cannot currently give a codepage to
WinMerge, this only affects WinMergeU.exe using 8-bit source
files.

It reads them & converts them using CP_ACP (or CP_THREAD_ACP
if available), but when it saves, it doesn't convert back.

So the bug is that we don't convert back, not (as I said
above) that we convert back and ignore the losses.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ganier - 2003-11-08

Logged In: YES
user_id=804270

> It reads them & converts them using CP_ACP (or
> CP_THREAD_ACP if available), but when it saves,
> it doesn't convert back
May you explain some more for my information : how does the
code know when loading that it must convert (are code >=
0x80 automatically converted from the local codepage ?).

And this line called when saving :
ucr::convertToBuffer(text, fileData.pMapBase, ucr::NONE);
doesn't do a conversion ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2003-11-08

Logged In: YES
user_id=60964

Ok, relevant case is that the source file has no BOM, so we
assume it is 8-bit, and WinMergeU.exe is running, so it
wants to convert it into wchar_t CString. So, it has 8-bit
input and it wants wchar_t. It calls MultiByteToWideChar.
That requires a codepage. It uses CP_ACP or CP_THREAD_ACP,
which means the user's default codepage.

We have to pick an example besides Latin-1 to see the effect
(because Unicode 0x7F-0xFF is identical to Latin-1).

Suppose the user's codepage is CP 1250 (Windows Eastern
European codepage for Latin script Eastern European languages).

When we find the byte xA3 in the source file, we interpret
that as a "LATIN CAPITAL LETTER L WITH STROKE" (it is a
letter in the Polish alphabet, and not used in any other
alphabet that I know of). So, we make that into Unicode
character x141. So, the 8-bit character in the source 0xA3
becomes the wchar_t of 0x141 in the CString, so it displays
correctly in the crystal editor, if the source file was
really meant to be in CP-1250 (our assumption).

But, when the user saves the file, we forgot to convert
back, so what do we do with that 0x141 ? Hm, I'm not sure, I
guess we write it back as 0x01 followed by 0x41. Anyway, we
don't do the correct thing, which is to convert it back to
codepage 1250, to a 0xA3.

Oops.

I don't know if I can set my default codepage to 1250 to do
tests like this. My default codepage is 1252, which isn't
very good for these tests, because it is mostly isomorphic
to Unicode -- except I guess in the Windows extension range,
which I think is 0x7F to 0x9F, and I'm not familiar with
this stuff.

So, in sum, to fix this when writing back, we need to know
that we are supposed to go back to 1252. So we need the
codepage number (m_codepage == 1252, in some object).

> ucr::convertToBuffer(text, fileData.pMapBase, ucr::NONE);

Ok, there are two issues.

#1) Does it convert back at all ?
Answer no. Maybe this is a bug in convertToBuffer. Should it
convert back using default codepage CP_ACP/CP_THREAD_ACP ?
Maybe so -- I'm inclined to say yes. That would fix this for
now. But, see below:

#2) Does it remember exactly what converting ?
It can't, because we didn't tell it. I mean, when we first
converted to get to the buffer, we ought to record exactly
what codepage we assumed, and use the same one going back.
Right now, we always use CP_ACP/CP_THREAD_ACP because we
lack UI to pick other one, but I plan for the future, when
we might use a different codepage, so I say we should record
which one we used (in fact I made a variable for that,
m_codepage in some object, but I never filled it in).

So, what I mean is, we should explicitly set the codepage up
in the file loading code (to CP_ACP for now in all cases),
and then explicitly pass that to the saving code, and revise
it to use it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ganier - 2003-11-09

Logged In: YES
user_id=804270

> So, what I mean is, we should explicitly set the codepage
> up in the file loading code (to CP_ACP for now in all cases),
> and then explicitly pass that to the saving code, and revise
> it to use it
That looks a little bit like pack/unpack. I had to introduce a
structure too.

During save : I first copy this structure in a temporary copy.
SaveToFile is called with the filename and this copy.
If it succeeds, then it is finished.
If pack fails during SaveToFile , a custom error code is
returned.
Then TrySaveAs does the job : it copy a default packer
(which doesn't pack and always succeeds) in the temporary
copy, displays an appropriate message, and calls SaveToFile
again. This time, pack may not fail any more.

There is a difference between codepage and packer. There is
only one packer for both files, so the structure is stored in
CMergeDoc (search for m_pInfoUnpacker). But you want to
have one codepage per file.

I hope this helps.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2003-11-13

Logged In: YES
user_id=60964

I'm forking the convertToBuffer bug into a separate bug,
because it is really a separate, simpler, and much more
important issue (because it is causing repeated corruption
of the Czech and Slovak translations, at least).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

elsapo - 2005-02-28

Logged In: YES
user_id=1195173

The convertToBuffer bug & patch mentioned below is PATCH [
859995 ] "convertToBuffer assumes ISO-8895-1 codepage",
which was applied 2004-12-29 (which removed the no longer
used convertToBuffer method).

**

Checking in Src/Common/unicoder.cpp;
/cvsroot/winmerge/WinMerge/Src/Common/unicoder.cpp,v <--
unicoder.cpp
new revision: 1.24; previous revision: 1.23
done

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Inserting out-of-range characters needs to be handled

Windows visual diff and merge for files and directories

Group

Searches

Help

#366 Inserting out-of-range characters needs to be handled

Discussion