Re: [fuse-devel] In what character coding is the path argument sent in fuse callbacks?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On 08/09/2013 01:13 AM, Hans Beckérus wrote:
>>> But, my worries are this. In what format should I actually store the
>>> filename entries in the cache?
>>
>> That's up to you. Storing them as byte sequences might be easier,
>> because otherwise you'd have to decode the directory entry before you
>> can look it up.
>>
> Exactly. But herein lies my confusion! If I choose (and I must) a
> certain encoding in my cache, then will this not break if the OS on
> local host is not sending the same encoding through fuse to my fs? I
> did some very simple test, and on Linux, irrespective of what locale I
> set to use, path entries looks like they are encoded in UTF-8. But
> what about other UNIX systems, such as OS X, BSD, Solaris? 

I think you are confused because you still think of path names as being
composed of characters rather than bytes.

Under unix, path names are byte sequences, not necessarily encoded
character sequences. They may happen to be encoded characters in some
situations, but you do not know when this is the case, and you do not
know what encoding to use.

This is different from Windows, where path names are character
sequences. A Windows FUSE would receive path names as wchars. But under
Unix, path names are byte sequences, and you can not and must not assume
that they are encoded characters.

Maybe it helps you if you design your file system as follows:

1) You convert all names in your database to byte sequences using any
encoding you like.

2) You write a FUSE file system that provides the data from this
converted database, without ever thinking about characters or encoding.

You probably do not want to do this in the actual code, but this is the
correct mental model to use.

> Is this the
> reason for the iconv fuse module? I played with it a bit but it will
> cause some overhead. Or is it a better option to add support for
> something similar like -o iocharsset= and internally use iconv?

In my opinion, the iconf fuse module is a bit of a hack and serves
mostly to confuse people (as proven here). If I understand correctly, it
sits between your file system and the FUSE kernel, assumes that both
your file system and the user space are using a specific encoding (which
is generally wrong) and then converts between those encodings (which is
error prone). I would recommend that you implement such a conversion
yourself if needed -- at least then you'll know exactly why and where to
look when things break :-).

>>> I lookup in the cache using the path argument in eg. getattr() calls.
>>> If someone does eg. 'cat åäöåäö', is there not a chance that I will miss
>>> in the cache because of the byte stream is not going to be UTF-8 and the
>>> hash I use will be wrong?
>>
>> You mean if the 'åäöåäö' is not encoded in UTF-8? Yes, then you will
>> "miss" any UTF-8 encoded 'åäöåäö' in your cache. But this is correct
>> behavior. Your file system does not contain this entry, because you
>> are encoding everything in UTF-8.
>>
> Or maybe rather "expected behavior", if it is correct is what I am
> trying to pin down here.
> But, and please correct me if I am wrong here,  UTF-8 simply looks
> like a byte sequence to me. Or can I convert UTF-8 something even
> less? 

UTF-8 is an encoding. An UTF-8 encoded character sequence is a byte
sequence.

> But in that case I would loose information and face the risk of
> getting duplicates, will I not? Or at least I need to store both the
> reduced name and the original in UTF-8 to resolve collisions.

Whenever you save a character sequence, you have to pick an encoding for
it. So your second sentence doesn't make sense.

HTH,
Nikolaus