WinMerge / Patches / #2766 unicodefile to compare

Matthias - 2008-12-30

convert to UTF8

unicode.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kimmo Varis - 2008-12-30

More description wouldn't hurt...

+ if ( ufile.IsUnicode() )
+ {
+ // Finished with examing file contents
+ ufile.Close();
+ return TRUE; // not unicode file, nothing to do
+ }

As Jochen reminded me in another patch item, ufile.Close() isn't really needed since UniMemFile destructor calls Close().

+ * @bug Fails for files larger than 2gigs

This is BAD. Its usual use case to compare e.g. DVD ISO images. But since this was comment for AUtf8len_of_string() method I'm expecting this 1) won't be called for binary files and 2) takes the buffer so, it is not FILE size limitation, but BUFFER size limitation. Two very different things.

+ UCS2LE, /**< UCS-2 UTF-16 little endian.
I've read somewhat confusing documents about this UCS-2 vs. UTF-16 matching. If I remember correctly UCS-2 was "Unicode" that W2K supported, and it is subset (?) of UTF-16? And WXP and later have real UTF-16 Unicode support? In this case I'm only wondering that we still support W2K and that there isn't danger we assume its always UTF-16 only?

--- diffutils/src/UTIL.C (Revision 6258)
+++ diffutils/src/UTIL.C (Arbeitskopie)
@@ -957,3 +957,40 @@
return 0;
}
#endif
+isunicode(unsigned char *pBuffer, int size)

Oh, I was missing this function in another patch. Here it is now.. :) I'd much rather add this to new file, to make influence to diffutils code as small as possible. It is after all external code we want to update some day...

After quick look this patch looks good. And now that I see the code I better understand your mails too. However I need more time to look more at this.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthias - 2008-12-30

+ * @bug Fails for files larger than 2gigs
I just checked it's also called with binaryfiles!

Means we have to overjump these lines in RunFileDiff() with binary files!

// Comparing UTF-8 files seems to require this conversion? // I'm still very confused about why, as what the functions // document doing is UCS2 to UTF-8 conversion, nothing else. // Something is wrong here. - Kimmo FileTransform\_ToUTF8\(strFile1Temp, m\_bPathsAreTemp\); FileTransform\_ToUTF8\(strFile2Temp, m\_bPathsAreTemp\);
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kimmo Varis - 2008-12-31

No need to, big binary files are automatically compared by quick contents compare method in folder compare.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthias - 2009-01-06

>This is BAD. Its usual use case to compare e.g. DVD ISO images. But since
>this was comment for AUtf8len_of_string() method I'm expecting this 1)
>won't be called for binary files and 2) takes the buffer so, it is not FILE
>size limitation, but BUFFER size limitation. Two very different things.

Yes I was thinking on buffer, here it's never check. I'd read anywhere else in bugreport it will break by files 120 MB, that's why I was asking.

As normally we are doing such things partly with 8-16kB only.:=)

>No need to, big binary files are automatically compared by quick contents
In folder compare yes, but in doc-view.:=(

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kimmo Varis - 2009-01-07

Full contents compare needs at least two copied of whole files into memory. Quick contents compare uses small buffer.

File compare is always done with full contents.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthias - 2009-01-07

back to the roots.
here we are discussing about unicode compare engine first!

As original only UTF16(UCS2LE) had been converted!
see what we are doing in OlecharToUTF8. which file will be converted?
only if (unicoding != ucr::UCS2LE)!!!
Thats why result is wrong, if the coding of source are not same!
Take the samples from [ winmerge-Patches-2477657 ] unicode file will be shown as textfiles
can be used here also. You may compare one to one, ok go across it fails.
files are same as bugreport Can't identify see difference - ID: 1826452 .

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kimmo Varis - 2009-01-07

Can we start by having test files in SVN first? Do you have files with different encodings we could use. Not only fixing this for a first time but later checking it still works.

And can you attach a patch without full contents compare parts that were already committed in the another patch item?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthias - 2009-01-08

Take the samples from [ winmerge-Patches-2477657 ] unicode file will be
shown as textfiles
can be used here also. You may compare one to one, ok go across it fails.
files are same as bugreport Can't identify see difference - ID: 1826452 .

patch I will do, sorry i just have some problem with my sytem. hope to solve today also.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthias - 2009-01-08

unicode2.patch

unicode2.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthias - 2009-01-08

new patch added.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kimmo Varis - 2009-01-08

>500 lines patch still without proper description. And comments I wrote earlier are not addressed either. And with quick scan there is now additional change that is not in any way related to this. To add one line of comment!??

What is going on? I'm not blindly applying patches somebody submits.

Quoting bug item #845994:
File1.txt is an ansi file (french codepage)

I don't see any codepage handling? Please remember that file's codepage may be different than system codepage, or even thread codepage.

Why you need to use memory mapped file for simple file write?

What is the A in AUtf8len_of_string?

Also, there are too many things this patch tries to do. I don't want UTF-32 changes in this patch. That is new feature and separate patch.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthias - 2009-01-09

see my comment from Date: 2009-01-08 16:33 , Date: 2009-01-07 22:52
and headline:
WM was only converting UTF16 to UTF8.
Now all will be converted to UTF8. This maybe is reason we do not have much problem with the bug files greater 2G, see later. As this codeing is not used so often with such big files.:)

starting problem was:
>FileTransform_UCS2ToUTF8(strFile1Temp, m_bPathsAreTemp);
means only UCS2 had been converted
>if (unicoding != ucr::UCS2LE)
as this line shows.

now I changed
FileTransform_UCS2ToUTF8(strFile1Temp, m_bPathsAreTemp);
to
FileTransform_ToUTF8(strFile1Temp, m_bPathsAreTemp);
as all kinds of codeing will be converted to UTF8 now!

so if you take the samples from bug: Can't identify see difference - ID: 1826452

and my samples from [ winmerge-Patches-2477657 ] unicode file will be shown as textfiles.
You will see the difference.
Take from my sample from folder1 a file and an otherone from folder2, nut not same name!

So you will see in Dir -view differnce are shown, evern there is none!
Reason is different codeing only. With this patch, you can now compare across the folders, and there will be no difference, thats all.
Codeing checked here is not related to Doc-View codeing! It's only for the comparengie!
In bugreport he compared an ANSI and a Unicoded file.
see BOM:
File1: 4C 65 73 20 <= ANSI
File2: FF FE 4C 00 65 00 73 20 00 <= UTF-16, little endian

the codeing is shown correct in doc-view, without patch.
But line 1 is maked as diff!

AUtf8len_of_string(const char* text, int size)
same as
Utf8len_of_string(const wchar_t* text, int size)

That i told in on of my emails, how should I name it.
The function return the size to get the file compared. Recognize the parameter.
Original is for unicode 1.parameter( Word == 2 Bytes)
new for Ascii 1.parameter(byte).
We may need one more in future if we go for the newest versions of unicode also, than we need 'DDword'= 4 bytes.

The comment standing there:
/**
* @brief How many bytes will it take to write string as UTF-8 ?
*
* @param size size argument as filemapping are not 0 terminated
*
* @bug Fails for files larger than 2gigs
*/
thats why I asked: Date: 2008-12-31 00:27 and before per email.:=)

Aslong you have same coding on both side, patch not needed.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kimmo Varis - 2009-01-10

> This maybe is reason we do not have
> much problem with the bug files greater 2G, see later.

This only shows you don't even understand how WinMerge works and don't even listen me when I've explained it earlier.

Why do we need to convert ANSI files to UTF-8? How can you do this without codepage info? Why not use Win32 functions?

Answers to rest of questions?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthias - 2009-01-10

Sorry but now you didn't follow me what I explained. I only use GUI no commandline.
1.)
> This maybe is reason we do not have
> much problem with the bug files greater 2G, see later.
You only come here in case of UTF16 codied input files! otherwise never.
So take here a textfile greater than 2G you will come here, but only haveing a UTF16 LE file!
2.)
>WM was only converting UTF16 to UTF8.
thats why, means Ansi and UTF8 aren't same, that's bugreport.
3.)
>How can you do this without codepage info?
WM doesn't do it! first it's checked. than converted. see 'OlecharToUTF8=>OlecharToUTF8'.
That's original. WM was converting only UTF16 to UTF8. only!!!
'if (unicoding != ucr::UCS2LE)' That's the real bug.
To solve this problem I done the patch.
4.)
>Why not use Win32 functions?
All convert functions had been there, some calls had been mist, those I added.

Didnt you tell us, you want to have an open system, not only Microsoft and MFC!
Now opposit? And again, these functions have been there allready! Why should I touch and change a running system. Even there is no need?

5.)
>This only shows you don't even understand how WinMerge works
take the samples (of bugreports no others) and debug WM. You will see you come exact to that point! Result: as buf report. Soultion like I had done. Take original system!
Than take my patch and do it again.
You will see I know meanwhile a little more than last year, not even all.
With the compareengine (analyse.c) the seperat lines are compared and decition done, if
real diff or trivial. Codepage any! Only UTF16 => UTF8. So in case of char > 0x80H differ.
So these lines will be marked like that. Not more is done at that place. Result only is as lotterie aslong codepage on input are not same! Only UTF16 with UTF8, or same codepage on both side, are allways correct!
Means what is differ in lines itself is done in stringdiff, etc, but here allways UTF8!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kimmo Varis - 2009-01-11

> > How can you do this without codepage info?
> WM doesn't do it!
WinMerge damn sure handles codepages for ANSI files! You can take and compare two ANSI files with different codepages and people do that all the time.

Of course the existing conversion code doesn't handle codepage since codepages are only needed for ANSI files and we weren't converting ANSI files before. When you convert ANSI files you must handle codepage.

So your patch is dataloss for the ANSI file. Think about having Russian ANSI file and put it through your conversion code...

And handling codepage may be the trickiest part solving this bug. So this is far from a solution. I'm afraid there might not be a good way to solve this.

> Didnt you tell us, you want to have an open system,
> not only Microsoft and MFC!

Cross-platform would be a target in future, yes. But a lot more important is to have working code in Windows! Once we really get into cross-platform planning we can think about ways to handle these codepage issues etc. That will be a very hard problem to solve we don't need to solve yet. But MFC you should avoid now if possible. Win32 methods aren't MFC. OleChar is but there isn't currently way around it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthias - 2009-01-11

>WinMerge damn sure handles codepages for ANSI files! You can take and
>compare two ANSI files with different codepages and people do that all the
>time.
For them nothing will change. If codepage in these ansi files are also diff, I agree with you.

>So your patch is dataloss for the ANSI file.
not more as now.
>And handling codepage may be the trickiest part solving this bug. So this is far from a solution.
agree, but one step after the other.
So fare I'd seen in Doc-view we have this info already. But I think to do it with next generation will be ok. Normaly nobody is useing two ansi files with diff codepage, maybe 1 : 1 000 000 or so.
If you want to solve all, this patch is growing up and will be delayed also. Nobody will be abels to follow. ex. How you will get the codepage of both files? That need much more controles, handling etc. So make small steps.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kimmo Varis - 2009-01-12

> Normaly nobody is useing two ansi files
> with diff codepage, maybe 1 : 1 000 000 or so.

I was using this pretty often earlier. And I definitely need this feature. Hint: language files.

And this language file merging is one use case I'm worried about with this patch. As you still haven't clearly said what the change will be to current behavior.

One more time: what exactly is the problem? How are you fixing it and why so? How we can test this change? How can I be assured the fix doesn't break some other use case?

Only somewhat understandable description so far was this:
> WM was only converting UTF16 to UTF8. Now all will be converted to UTF8.

But that does not really explain the problem or the solution at all. Table for conversions:
#1 ANSI and ANSI: no conversion
#2 ANSI and UTF-8: ??
#3 ANSI and UTF-16: UTF-16 converted to UTF-8??
#4 UTF-8 and UTF-8: no conversion
#5 UTF-8 and UTF-16: UTF-16 converted to UTF-8

Is this correct? And now you are altering #2 so that ANSI file would be converted to UTF-8?

> So make small steps.
So why is the patch containing totally unrelated changes and changes I've already asked you to remove. Please submit a patch that is related only to this issue you are fixing. Not random comment changes to other files. No UTF-32 additions etc.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthias - 2009-01-12

unidode test files

test.zip

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthias - 2009-01-12

Sorry you missunderstand! See bugreport Can't identify see difference - ID: 1826452
there is also a sample , also you can take mine, same. I will post again even same as for an otherpatch as I told. Your samples for unicode are not good for this patch, as you see no difference before and afterwards.
Try without patch please:
First you will do folder compare, result they are same. Rearly they are same! It correct.
Now you take Ansi left, and a unicode file right => one line is show as diff, but that's wrong. Taken unicode left and UTF8 right => result ok.
Why you can see next. But now please install patch and repeat, Result,now you can compare any file from folder 1 with any of folder 2, all are same. Thats all, that's the bug.
>I was using this pretty often earlier. And I definitely need this feature.
You never used this as it was never working.
As this line shows' unicoding != ucr::UCS2LE)' only UCS2LE == UTF16LE had been converted.
That has nothing to do with the text in doc-view!
This created temp-files are only used to select lines for diff und trivial diff coloring. Not more! Not on word or characterlevel! That's a different block 'stringdiff'.

I think your problem is actually you are matching unicode with codepage. Aslong you are not comparing USA codepage with china or so, you won't have any problems.

You need the complete patch to solve this problem. As I told before in a next step we can
also take care as you want with Win32 methods codepage. But this behavior will be complete new, it had nerver there!!!! Actually we have only one codepage selecter in our option. To make it correctly, we need two. One for left and one for right file!

I'm not talking about the texteditor!

>And now you are altering #2 so that ANSI file would be converted to UTF-8?
Yes, all should be UTF8 to have same coding left and right.
>No UTF-32 additions etc.
Wrong, we have to check. Actually we cannot convert, but we must detect and do.. nothing!

> So make small steps.
As in doc-view, I told you before we allready know the coding!
in Diffwrapper I want to overjump these lines incase the coding is same.
// Comparing UTF-8 files seems to require this conversion?
// I'm still very confused about why, as what the functions
// document doing is UCS2 to UTF-8 conversion, nothing else.
// Something is wrong here. - Kimmo
FileTransform_ToUTF8(strFile1Temp, m_bPathsAreTemp);
FileTransform_ToUTF8(strFile2Temp, m_bPathsAreTemp);
Cause these lines are only needed if we have any difference in unicoding or codepage!

That mease some options can be doneto speed up WM.:=) That I means with small steps.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kimmo Varis - 2009-01-12

I just don't understand this at all anymore. I asked for a simple table and didn't get even that.

I want to understand what is going on and why. But I get only some technical details and instructions how I should apply patches and run in debugger. I need simple description.

This is not how you convince me to interest about this or any other patch.

> > I was using this pretty often earlier. And I definitely
> > need this feature.
> You never used this as it was never working.

Why are you once again mixing things? I was talking about comparing files with two different codepages. And it works as it is what I routinely do.

> > No UTF-32 additions etc.
> Wrong, we have to check. Actually we cannot convert,
> but we must detect and do.. nothing!

No we definitely do not. There is no UTF-32 support right now. And we aren't adding it in this patch.

Look, if you keep arguing like this this won't lead anywhere. I won't get tired and apply patch. If I get tired I reject the patch. Please understand I'm busy and I don't even have time to read all this. So simple and understandable description is still missing.

I don't know what to say anymore.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthias - 2009-01-12

>Why are you once again mixing things? I was talking about comparing files
>with two different codepages. And it works as it is what I routinely do.
UTF16 in different codepages, yes, thats works!

Compare 2 ANSI files please codepage USA and German, how??
In compare option you have only one codepage.
But I'm not talking abaout that.

>I was talking about comparing files with two different codepages.
This patch is for different in unicode and ansi, not codepages! Thats why i say you are matching.

>There is no UTF-32 support right now.
Right, so if can detect and don't know what to do, do just nothing!
In your mind I should use the files as ansi? No, thats wrong.

#1 ANSI and ANSI: no conversion =>ok, aslong same codepage!!
#2 ANSI and UTF-8: =fault
#3 ANSI and UTF-16: UTF-16 converted to UTF-8 =>fault
#4 UTF-8 and UTF-8: no conversion =>ok
#5 UTF-8 and UTF-16: UTF-16 converted to UTF-8 =>ok

this patch solves #2 and #3 not more.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kimmo Varis - 2009-01-15

> Compare 2 ANSI files please codepage USA and German, how??
> In compare option you have only one codepage.

WinMerge can detect codepage from XML and RC files.

> > There is no UTF-32 support right now.
> Right, so if can detect and don't know what to do, do just nothing!

No, you don't need to detect. Ignore, forget such files exist. Not supporting UTF-32 means we just don't know such files exist in this patch. If somebody loads UTF-32 file to WinMerge our behavior don't change because of this patch. That is the point. If we randomly change behavior in patches we never get anything fixed.

Thanks a lot for the table, now I there is finally some kind of definition for the problem. Its not anymore some vague bug report, but clear table of conversions missed. Now we progress.

So the effect of the patch is we add conversion to cases #2 and #3. #2 is very common and adding new conversion there certainly slows WinMerge for lots of people. But I think there is no way around it. But the conversion should use Win32 function for the conversion as it handles codepages and for sure is more optimized than homemade implementations.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthias - 2009-01-15

>WinMerge can detect codepage from XML and RC files.
I know, but is it needed here.? Our WM UTF8 has an codepage not related to the used file codepage.
That the function 'to_utf8_advance' itself take care. Same is done for Ansi.

>If somebody loads UTF-32 file to WinMerge our behavior don't change because
>of this patch
Yes, that's correct. See we check for all kinds and if none found we convert it as an ansi file, that is def. wrong.
I was looking in web for UTF32. It seams our routine is doing that also. I will see to get a samplefile in UTF32.

So far as I could checked, a convert is only needed in case of difference in file codeing or codepage! So we should do nothing in that case with 2 Ansi files!.
> Win32 function for the conversion as it handles codepages
we have to check if Win32 is going faster, maybe it's also easyier and can solve all these problems with codeing. so just 'to_utf8_advance' better 'unicoder.cpp' must be rewritten.

Much more time we can spare , while storeing the diff to the tmp file and rescan same.
Also only important working with very big files, so level 6.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kimmo Varis - 2009-01-15

I say this for absolutely last time. I don't want UTF-32 changes into the patch. As long as those changes are in the patch I won't apply it. It is so simple. I'm the one who controls WinMerge development and who decides which patches are accepted. If I'm not happy with the patch, it won't get into WinMerge.

If you want this patch into WinMerge you submit a patch that has the changes I've been requesting.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

unicodefile to compare

Windows visual diff and merge for files and directories

Milestone

Searches

Help

#2766 unicodefile to compare

Discussion