Menu

#21 UTF-8 is necessary

Next_Release
closed
None
7
2021-09-04
2021-07-30
Kaiser
No

Please add UTF-8 (both w or w/o BOM) support by default for hashlist files. It is critical for such platform independent software. At the moment it uses (at least on Windows) local Windows codepage (in my case it's Windows-1251).
So I created testing folder, here's it's content:

Filename Size
!!!ĀāĂ㥹.jpg 531,96 KB
Em—dash.jpg 288,64 KB
Latin.jpg 203,07 KB
Ελληνικά.jpg 269,00 KB
Кириллица.jpg 397,73 KB
العربية.jpg 299,19 KB
देवनागरी.jpg 334,38 KB
かな カナ.jpg 397,11 KB
汉字 漢字.jpg 348,61 KB
한글 조선글.jpg 234,47 KB

Now I produce hashlist with jacksum -a adler32 -E base64 -m -O hash.txt -r NamesTest
Here's output:

Jacksum: Meta-Info: version=1.7.0;algorithm=adler32;filesep=\;flags=r;encoding=base64;
Jacksum: Comment: created with Jacksum 1.7.0, http://jacksum.sourceforge.net
Jacksum: Comment: created on Fri Jul 30 20:07:15 MSK 2021
Jacksum: Comment: os name=Windows 10;os version=10.0;os arch=amd64
Jacksum: Comment: jvm vendor=Oracle Corporation;jvm version=25.301-b09
Jacksum: Comment: user dir=C:\Users\Kaiser\Pictures
Jacksum: Comment: param dir=NamesTest

NamesTest:
7L3pgw==    544726  !!!??????.jpg
LYvdpw==    295569  Emdash.jpg
nnjwbA==    207941  Latin.jpg
CDhu2Q==    275456  ????????.jpg
O9xXbQ==    407278  Кириллица.jpg
tSWrSQ==    306373  ???????.jpg
CEgN2w==    342402  ????????.jpg
JxwXlA==    406643  ?? ??.jpg
fwCwVQ==    356975  ?? ??.jpg
2dXtaQ==    240099  ?? ???.jpg

So this haslist is just useless. Jacksum on Windows just can't handle some filenames. I insist that UTF-8 by default is necessary (of course, there could be parameter to use another specified codepage).

Thanks for attention.

Discussion

  • Johann N. Löfflmann

    • assigned_to: Johann N. Löfflmann
     
  • Johann N. Löfflmann

    Ticket moved from /p/jacksum/support-requests/15/

     
  • Johann N. Löfflmann

    Workaround for Jacksum 1.7.0 in order to use UTF-8 for both input and output:

    java -Dfile.encoding=utf8 -jar jacksum.jar ...
    

    Some environments don't support UTF-8, so setting it as default could lead to unexpected behavior on those systems. The byte order has no meaning in UTF-8. The Unicode Standard permits the byte order mark (BOM) in UTF-8, but does not require or recommend its use. See also https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8

    Anyway, support for non-default charsets (including UTF-8) comes with the next major release.

     
    • Kaiser

      Kaiser - 2021-07-31

      Thanks! Workaround is very useful (maybe it would be great for incorporating it to FAQ).
      But CLI output (not file) continues to replace chars with '?'. Any way to fix that?

      I already know what BOM is. Some editors may be adding BOM sometimes, but it confuses the parser, unabling it to use hashlist file correctly. Read-only BOM support would be great, isn't it?

       

      Last edit: Kaiser 2021-07-31
  • Johann N. Löfflmann

    • status: open --> accepted
    • Group: v1.7.0 --> Next_Release
     
  • Johann N. Löfflmann

    You're welcome!

    But CLI output (not file) continues to replace chars with '?'. Any way to fix that?

    AFAIK that is a limitation of the command prompt program. You could use the Window Terminal from Microsoft which comes with full UTF-8 character support. See also https://www.microsoft.com/store/productId/9N0DX20HK701
    Once installed you can open the shell you want, and if you open the command prompt in Windows Terminal and change the code page to 65001 by typing chcp 65001 you should see all UTF-8 characters.

    Some editors may be adding BOM sometimes, but it confuses the parses, unabling it to use hashlist file correctly.

    Thanks for explaining the BOM use case. That makes sense to me. Could you please open a new Feature Request called "Add BOM support" so we can track progress with respect to BOM there?

     
    👍
    1
  • Kaiser

    Kaiser - 2021-07-31

    Sure thing! I'll open a request on BOM.

    Actually, we got different situations there. The issue with plain CMD is purely a matter of presentation (I suppose it has to do with font), but one can just copy/paste any Unicode output provided (using tools like RHash, eg).
    But in case of Jacksum, in CLI output it actually replaces unavailable chars with U+003f (question marks). I suppose there's a room for improvement.

     
  • Kaiser

    Kaiser - 2021-07-31

    There's also troubles when pointing Unicode-containing filenames in CLI, I found no workaround here

     
  • Johann N. Löfflmann

    • status: accepted --> closed
     
  • Johann N. Löfflmann

    I am pleased to announce that Jacksum 3 is on the web!
    See also https://github.com/jonelo/jacksum
    Download and release notes: https://github.com/jonelo/jacksum/releases/tag/v3.0.0

    Jacksum 3 comes with charset support for both input and output and new options are available:

    --charset-check-file <charset>
    --charset-file-list <charset>
    --charset-error-file <charset>
    --charset-output-file <charset>
    

    which gives the maximum of flexibility for files. So from now on UTF-8 is the default if no charset has been specified explicitly.

    Both stdout and stderr still use the charset default of the terminal, because it could be that the users' terminal default is not UTF-8. In this case usually you have to tweak the terminal. In this case you can tell Jacksum that it should try to use UTF-8 also for stdout and stderr using the following options:

     --charset-stdout UTF-8
     --charset-stderr UTF-8
    

    or simply

    --utf8
    

    There's also troubles when pointing Unicode-containing filenames in CLI, I found no workaround here

    That was a really tricky one! The solution to this is to pass filenames through the pipe rather than as programm arguments. I put an example to demonstrate how this works with Jacksum 3 on Windows' cmd:

    chcp 65001 & echo "a filename that contains unicode chars" |
                    jacksum --utf8 --file-list - --file-list-format ssv
    

    You first have to set the UTF-8 codepage which is 65001 in cmd. Then you pass a filename that contains unicode chars through the pipe by echo. You have to tell Jacksum that it should read the file list from the pipe and set the file-list-format to ssv which stands for space separated values. I am going to use that pattern also for the file browser integrations.

    Closing this issue since all issues have been resolved.

    May I ask you to file new feature requests on github?

    Thanks & Regards,
    Johann

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.