Menu

#227 basename() truncates filenames with variable-width encoding

open
nobody
crt (86)
5
2023-03-22
2011-05-22
No

Hi,

This is forwarded from http://bugs.debian.org/625918

The attached program computes basename of a 3-bytes long (which denotes 2 characters in some encodings) filename. Everything works fine if a single byte character set is used:

$ LC_ALL=pl_PL.utf8 ./test.exe
basename("\312\253\172") = "\312\253\172"

However, in the Chinese locale the last byte is truncated:

$ LC_ALL=zh_CN.utf8 ./test.exe
basename("\312\253\172") = "\312\253"

The original reporter believes the culprit is the following fragment of mingwex/basename.c:

if( (len = wcstombs( path, refcopy, len )) != (size_t)(-1) )
path[ len ] = '\0';

where len was previously initialized to the number of _characters_ of the input string.

Looking at implementation of dirname(), it might be affected by a similar bug as well.

Discussion

  • Stephen Kitt

    Stephen Kitt - 2011-05-22

    Test program

     
  • Jonathan Yong

    Jonathan Yong - 2011-05-23

    Is the text encoded as CP936 or UTF8?

     
  • Stephen Kitt

    Stephen Kitt - 2011-05-23

    According to the comment in the attached file, it's CP936.

     
  • Ozkan Sezer

    Ozkan Sezer - 2011-05-23

    The code is from mingw.org, do you know whether the problem also shows itself with mingw? (Do they know about this?..)

     
  • Stephen Kitt

    Stephen Kitt - 2011-05-24

    The problem also shows itself with mingw (runtime 3.18 with w32api 3.17); I'll file a bug with them too.

     
  • 张天师

    张天师 - 2023-03-21
    setlocale (LC_CTYPE, "");
    

    this function in dirname.c or basename.c set the current locale to value returned by GetACP(). And remove this line is the temp solution to the BUG.The dirname() and basename() works well in "C" locale.

     

    Last edit: 张天师 2023-03-21
  • 张天师

    张天师 - 2023-03-22

    the len variable returned in line 51len = mbstowcs (NULL, path, 0) function is different to the len parameter in line 57 mbstowcs( refpath, path, len) function,the former is about wide byte characters needed, the latter is about multiplebyte characters needed,the original programmer confused them In many places (though in "C" locale they have the same value),so cause the truncation.

     

    Last edit: 张天师 2023-03-22
  • 张天师

    张天师 - 2023-03-22

    Processing in wide-byte characters is not a good idea.Now I fix it by totally rewrite it without converting to wide-byte characters,It's in the attachment,have a try:)

     

    Last edit: 张天师 2023-03-22

Log in to post a comment.