#681 WinMergeU.exe treats multibyte text file as binary file

closed-rejected
5
2004-08-01
2004-05-24
No

When I open a multibyte text file with
WinMergeU.exe(2.1.7.2),
I get the error message "%1 %2 is different binary
files. Binary files cannot be visually compared."

I tried to fix this problem. And I found that
unicoder.cpp#byteToUnicode() function can't map
multibyte chars to Unicode chars because it always
processes only one byte.

I added new function multiByteToUnicode() and fixed to
use it instead of byteToUnicode() function.

To reproduce this problem using attached file, you may
have to use hardcoded codepage 932(Japanese SJIS) like
this:

UINT
byteToUnicode (unsigned char ch, UINT codepage)
{

if \(ch < 0x80\)
    return ch;

DWORD flags = 0;
wchar\_t wbuff;
int n = MultiByteToWideChar\(932/\*codepage\*/, flags,

(LPCSTR)&ch, 1, &wbuff, 1);
if (n>0)
return wbuff;
else
return '?';
}

Discussion

  • Anonymous - 2004-05-24

    Logged In: YES
    user_id=60964

    What is the call chain for this problem? (I guess plugin
    Transform code, but that is just a guess.)

    I remember altering UniFile to not call byteToUnicode just
    for this reason; instead it calls ucr::maketstring, which
    handles an entire line at one time.

     
  • Takashi Sawanaka

    Logged In: YES
    user_id=954028

    Helllo.

    >What is the call chain for this problem? (I guess plugin
    >Transform code, but that is just a guess.)

    ucr::byteToUnicode
    ucr::get_unicode_char(unsigned char * 0x01630000,
    ucr::UNICODESET NONE, unsigned int * 0x0012edc8, int 0) line
    629 + 17 bytes
    UniMemFile::ReadString(CString & {""}, CString & {""}) line
    481 + 45 bytes
    CMergeDoc::CDiffTextBuffer::LoadFromFile(const unsigned
    short * 0x010cfab4, PackingInfo * 0x010cb408, CString {"C"},
    int & 0, int -1, int 0) line 1457 + 31 bytes
    CMergeDoc::LoadFile(CString {"C"}, int 1, int & 0, int 0)
    line 2567 + 79 bytes
    CMergeDoc::OpenDocs(CString {"C"}, CString {"C"}, int 0, int
    0, int 0, int 0) line 2650 + 39 bytes
    CMainFrame::ShowMergeDoc(CDirDoc * 0x010c4720 {CDirDoc},
    const unsigned short * 0x010ca664, const unsigned short *
    0x010cb46c, int 0, int 0, int 0, int 0, PackingInfo *
    0x0012f9a0) line 523 + 80 bytes
    CMainFrame::DoFileOpen(const unsigned short * 0x00000000,
    const unsigned short * 0x00000000, unsigned long 0, unsigned
    long 0, int 1) line 1525
    CMainFrame::OnFileOpen() line 502
    _AfxDispatchCmdMsg(CCmdTarget * 0x00b0a428 {CMainFrame},
    unsigned int 57601, int 0, void (void)* 0x00402eb9
    CMainFrame::OnFileOpen(void), void * 0x00000000, unsigned
    int 12, AFX_CMDHANDLERINFO * 0x00000000) line 88

    >I remember altering UniFile to not call byteToUnicode just
    >for this reason; instead it calls ucr::maketstring, which
    >handles an entire line at one time.

    ucr::maketstring() didn't run because m_codepage == 0.

    When I set m_codepage = 932 forcedly, it works correctly.
    My patch is not need :)

    Thank you.

     
  • Anonymous - 2004-05-24

    Logged In: YES
    user_id=60964

    So, shall we close this patch (as invalid I guess) and open
    a bug?

    I mean, from what you're saying, the problem is that the
    codepage is 0. I definitely believe we have outstanding
    problems with codepage, most likely because its getting set
    to 0.

    This, I think, goes back to the problem that we really need
    the user to tell WinMerge that it is a cp-932 (or whatever)
    file for things to work out correctly. WinMerge can't figure
    that out (except for some rc and html files). And,
    currently, we lack the user interface for the user to tell
    WinMerge. :(

    Although, one might ask, should we ever call byteToUnicode?
    Maybe the real problem here is that I treated codepage=0 as
    unknown codepage, and used byteToUnicode.

    Maybe the real answer is to call the maketstring line in all
    cases. I think codepage==0 may be CP_ACP or something -- a
    magic number saying use the user's default codepage. The
    user's default codepage may well be multibyte, so maybe the
    real bug is that we ought NEVER to call byteToUnicode--what
    do you think?

     
  • Takashi Sawanaka

    Logged In: YES
    user_id=954028

    > So, shall we close this patch (as invalid I guess) and open
    > a bug?
    Yes, please close it. It is invalid patch.

    >Maybe the real answer is to call the maketstring line in all
    >cases. I think codepage==0 may be CP_ACP or something -- a
    >magic number saying use the user's default codepage. The
    >user's default codepage may well be multibyte, so maybe the
    >real bug is that we ought NEVER to call byteToUnicode.

    I think so, too. We may should always call maketstring when
    m_unicoding==ucr::NONE.

     
  • Anonymous - 2004-05-26

    Logged In: YES
    user_id=60964

    I'm going to leave this open for now, because I intend to
    look into what to do about byteToUnicode (ie, deleting it
    entirely).

     
  • Anonymous - 2004-05-29

    Logged In: YES
    user_id=60964

    I just posted PATCH#962544 (UniMemFile::ReadString calls
    maketstring not get_unicode_cha), which calls maketstring
    even when m_codepage is 0, as discussed below.

     
  • Anonymous - 2004-08-01
    • assigned_to: nobody --> puddle
    • status: open --> closed-rejected
     
  • Anonymous - 2004-08-01

    Logged In: YES
    user_id=60964

    This patch was rejected in favor of my patch

    PATCH [ 962544 ] UniMemFile::ReadString calls maketstring
    not get_unicode_cha

    which was in turn rejected in favor of Laurent's patch

    PATCH [ 972108 ] use getDefaultCodepage for document with
    no codepage

    which was applied 2004-06-17.

    The only remaining issue is that get_unicode_char still calls
    byteToUnicode. The function byteToUnicode should be removed,
    as it does not handle MBCS codepages correctly.

    In fact get_unicode_char is only called from
    UniMemFile::ReadString,
    and only in the case that we have a Unicode encoding, so
    actually it never does call byteToUnicode.

    I will post this final issue as a new patch, and now close
    this patch as Rejected.

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks