#32 utf-8 vs codepage

v1.2.10
closed-fixed
Aleksey
Other (6)
5
2012-12-26
2012-08-23
Liviu
No

The attached holds the test .txt file and the result of "rhash --md5 --sha1 --bsd *.txt >checksum.bsd". For some reason, the checksums don't verify back, and it looks like a filename encoding issue on the reading/checking side since as far as I can tell the name is properly UTF-8 encoded in the .bsd file itself.

Also, the output of --check differs depending on whether --ansi is used or not, in case that matters.

>>>
C:\tmp\rhash>rhash -c checksum.bsd

--( Verifying checksum.bsd )----------------------------------------------------
Itrtoalzt Llzt .txt No such file or directory
Itrtoalzt Llzt .txt No such file or directory
--------------------------------------------------------------------------------
Errors Occurred: Errors:0 Miss:2 Success:0 Total:2

C:\tmp\rhash>rhash -c --ansi checksum.bsd

--( Verifying checksum.bsd )----------------------------------------------------
« Iñtërñàtíoñalîzâtìöñ · Lòçålîzätìóñ ».txt No such file or directory
« Iñtërñàtíoñalîzâtìöñ · Lòçålîzätìóñ ».txt No such file or directory
--------------------------------------------------------------------------------
Errors Occurred: Errors:0 Miss:2 Success:0 Total:2

C:\tmp\rhash>chcp 65001
Active code page: 65001

C:\tmp\rhash>type checksum.bsd
MD5 (« Iñtërñàtíoñalîzâtìöñ · Lòçålîzätìóñ ».txt) = a7c0f8a90a6998486c116a7b118918de
SHA1 (« Iñtërñàtíoñalîzâtìöñ · Lòçålîzätìóñ ».txt) = 4442e17b42c949a8302611826ebbdbcc2cae310c
<<<

Thanks,
Liviu

Discussion

  • Liviu

    Liviu - 2012-08-23
     
    Attachments
  • Aleksey

    Aleksey - 2012-08-23

    The bug is reproduced. Thanks for reporting it! :)

     
  • Liviu

    Liviu - 2012-09-01

    P.S. Got around today to building rhash, and took a closer look at the code. Think I found a couple of clues. They don't raise to the level of a formal patch (for one thing, I only ran some quick tests with the 32b build under XP). Yet, you may find the following useful.

    The "no such file" error during verification comes from the 'isspace' calls in hash_check_find_str. 'isspace' takes an int as argument, and VS 2010 has the 'char' type signed by default, so high-bit characters get sign extended to int ranges illegal for 'isspace'. The following is one way to fix the problem by first checking that the character is in the 0-127 ASCII range before calling 'isspace'.

    if(backward) for(; begin < end && (unsigned char)end[-1] <= 0x7F && isspace(end[-1]); end--, len++);
    else for(; (unsigned char)*begin <= 0x7F && isspace(*begin) && begin < end; begin++, len++);

    The above fixes the verification itself, which then finds the file and works OK. However, the display of the filename is still wrong. Root cause is the VS quirk where fprintf functions do an implicit codepage conversion for text files if the current locale includes an explicit codepage. The i18n_initialize function calls setlocale(""); which sets the codepage to ANSI as far as CRT is concerned. That extra conversion is not needed in the rhash case, since the string is already UTF-8 encoded, and the console output CP is set to UTF-8, so the CRT needs to simply pass the given string through. This happens when the current locale is set to plain "C". One way to do it would be to insert a call in setup_console right after SetConsoleOutputCP(cp);

    setlocale(LC_CTYPE, opt.flags&OPT_UTF8 ? "C" : opt.flags&OPT_ANSI ? ".ACP" : ".OCP");

    For some reason this works and is only needed when the output is _not_ redirected to a file. The other way to do it would be to comment out the existing setlocale(""); at least for Windows, which would leave the locale at the builtin "C" default. This could of course have other sideeffects since it's a global setting, but it's not obvious to me whether rhash does in fact rely on localization (such as number/date/time/etc formats).

     
  • Aleksey

    Aleksey - 2012-09-02

    Thanks for finding the solution. The fix is now on GitHub.

    I hope it will no break output of localized error messages on a localized Windows.

     
  • Aleksey

    Aleksey - 2012-12-26
    • status: open-accepted --> closed-fixed
    • milestone: --> v1.2.10
     
  • Aleksey

    Aleksey - 2012-12-26

    Fixed in 1.2.10

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks