Menu

#12 UTF8 doesn't work "as advertised"

v1.0_(example)
open
None
5
2017-11-25
2016-04-17
Chris Young
No

Hi Olaf

I was trying out the new release v1.102 of smbfs and noticed it had a UTF-8 option. Figuring it would solve my problem with accented characters not showing up properly on either the Amiga or server side I thought I would try it.

My initial experience was that whether the switch was enabled or not made no difference - any directories containing accented characters would not display. I've now created a test directory on my server and put a directory and a file instead called, resp. tést and testíng.txt. With UTF-8 enabled neither of these show up in smbfs despite the characters being present in ISO-8859-15 (my default charset under OS4). With UTF-8 disabled they both show up but predictably with the accented characters incorrect. So it looks like UTF8 is just filtering out the items with non-ASCII characters rather than translating them?

The server is running Samba under Raspbian Jessie.
$ smbd --version
Version 4.1.17-Debian

Discussion

  • Olaf Barthel

    Olaf Barthel - 2016-05-12

    I just learned that UTF-8 encoding for file names may not be portable among different operating systems and their file systems.

    smbfs encodes Amiga file names which use the ISO 8859-1 encoding using a process which maps each letter to a single UTF-8 code point. This is as simple as it should get, but in Mac OS X, for example, the 'c' with the cedilla in "français" will not be encoded as a single letter 'ç', but as two code points 'c' and '¸' using composition. What the respective operating system prefers, and its file system layer (which may be of a different mind), is difficult to gauge. Samba has a built-in translation engine just for this kind of quirk.

    The straightforward and simple translation approach which I picked clearly will not do. I suspect that I might have to go into adapting a more sophisticated process which, as a by-product, will cause the file system code to balloon :-(

     
  • Chris Young

    Chris Young - 2016-05-12

    I came across the same thing when implementing IDN support in NetSurf. The answer is Unicode Normalisation. I used libutf8proc to do this - I had to modify it slightly, but IIRC that's because I needed to normalise full Unicode rather than UTF-8. It should do exactly what you need - normalise, convert to ISO-8859-xx. On the way back it shouldn't matter so much.

     
  • Chris Handley

    Chris Handley - 2016-05-16

    @Olaf
    If you don't think you'll have time to fix UTF-8 support any time soon, then can you please add a "transparent" or "dumb" mode that just passes-through the SMB host's characters without any "code page" (nor Unicode) conversion or processing?

    My reason: I've added basic UTF-8 support to FolderSync2 (not yet in public version but maybe soon), which works fine with FTP Mount (which doesn't do any character conversion!), but with SMBFS v1.102 it doesn't work. I suspect this is because SMBFS is trying to do "code page" conversion on UTF-8 character sequences, and obviously the "UTF8" mode doesn't work any better.

     
  • Chris Handley

    Chris Handley - 2017-11-25

    @Olaf
    Is there any chance of you adding a simple UTF-8 "pass-through" option (i.e. no translation), as that would allow me to use SMBFS instead of FTP Mount (which is proving to be incredibly unreliable for me, and so prevents me from keeping my OS4 machine synced to my PC).

    I know it's not the "elegant" or "right" way to solve the problem, and is instead merely a "work around", but it should be very easy to implement, and would allow my FolderSync2 to work correctly (using it's own UTF-8 mode, like it does with FTP Mount).

     

Log in to post a comment.