Re: [V9fs-developer] Fun with DIOD.

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On 04/21/2011 12:25 PM, Eric Van Hensbergen wrote:
> On Wed, Apr 20, 2011 at 11:59 AM, Rob Landley <rla...@pa...> wrote:
>> On 04/19/2011 03:21 PM, Eric Van Hensbergen wrote:
>>> On Tue, Apr 19, 2011 at 3:06 PM, Rob Landley <rla...@pa...> wrote:
>>
>> But you pretty much always specify a path for the mount, so having that
>> be -o is silly, you shouldn't need -o for the default case.
>>
> 
> FWIW that's not really the Plan 9 model.

But it is the Linux model.

> Under Plan 9 you mount the
> whole thing somewhere and then bind what you really want.

Linux tends to have two interesting levels of security: server level and
client level.  At the server level the question is "can this machine
mount this share, and if so is it read only or what?".  That's security
enforced by the server, and the server gives the machine root access to
the share.  At the client level, the client is responsible for
distinguishing UIDs and honoring user/group/all access permission
distinctions, and it's the client's choice not to just do everything as
root.

I realize plan 9 doesn't work that way, but what I'm really looking for
here is a good Linux filesystem that can provide a sane non-FUSE
alternative to NFS and Samba.

>  That being
> said, I have no problem with supporting more traditional Linux
> distributed file system semantics as well, particularly since they
> don't directly conflict with the historical model.

Document 'em as part of 9p2000.L :)

Speaking of which, I'm likely to create a Documentation/filesystems/9p
subdirectory, move 9p.txt into it, add some more docs (see attached very
much unfinished thing), and clone the diod 9p2000.L wiki to create a
9p2000.L file in the Linux source describing the darn protocol it
implements.

You have been warned.

>>
>> When you say "URLs" I'm guessing you're talking about some way to
>> specify virtio tags?  Replace -o transport= as well as -o aname=?  Hmmmm...
>>
>> According to 9p.txt, right now you can have transport=tcp,
>> transport=virtio, transport=fd, transport=unix, and transport=rdma.
>> There are only two interesting _servers_ right now, one for virtio and
>> one for tcp, but presumably you've been testing them somehow...
>
> the transport=fd and transport=unix ones have been around for awhile,
> just not necessarily where most folks would look for them.  They are
> there to support things like plan9ports, etc. -- but nothing stopping
> them from being used for other purposes.

I'm still reading through the source, but trans_fd.c says the "socket
layer" is deprecated.  (I think I asked about this a while back.)

> If you actually take a look
> at wmii (which is pretty widely available in the Linux world), you can
> use transport=[unix|fd] to mount it's namespace (or at least you used
> to be able to).

transport=fd I can understand.  If the mount syscall can drill down
through the file handle to get the underlying struct and increment its
own reference to it so it can continue to use it when the mount process
goes away, then that's very plan 9: everything is a file.

Thanks to Bill Joy being insane (even before he became a luddite), Unix
networking semantics are one of the few things that are NOT a file, so
being able to specify a network address is necessary precisely because
it does not exist in the filesystem.

But most unix AF_UNIX sockets live in the filesystem.  (Yes, there's a
non-filesystem namespace, which is more Bill Joy brain damage as you can
tell by the name "sun_path".  And once again very anti-plan-9.  I take
it people actually use this, and you want to encourage it?)

Sigh.  The problem is distinguishing AF_UNIX and virtio namespaces
without an explicit verbose type:// marker.  Probably isn't a non-ugly
way to do that.

Sure you can say something random like [name]:/path with square brackets
around the name means virtio, but the first time somebody sees that
they're not immediatley going to grok what it means, nor are they going
to work it out for themselves without looking it up...

> In this mode, 9p is more like FUSE than a distributed
> file system, but the nice thing is that it can function as both.

I vaguely recall there is a FUSE implementation of plan 9 client on the
"big list of implementations that didn't do what I wanted".  But if I
was happy with FUSE I'd be using sshfs.  Crank up the memory pressure
and fuse based filesystems lead to deadlocks and OOM killer triggers.

>> When you say "URL", what comes to mind is checking for known transport
>> types followed by two slashes, ala:
>>
>>  virtio://127.0.0.1/path/to/file
>>
>> Couple problems here:
>>
>> 1) TCP is currently the default transport type, this implies you'd have
>> to specify transport type even for tcp, ala "tcp://1.2.3.4/blah".
>> Nobody's ever going to guess that syntax without reading it in the docs,
>> so there's more of a learning curve for people who are trying to get it
>> to work for the first time, translates to slower adoption.
>>
> 
> I could imagine parsing code which accepts:
> tcp://1.2.3.4/blah
> 1.2.3.4/blah
> 1.2.3.4
> 
> Its pretty easy to tokenize and select the right one.

Except both samba and nfs stick a colon in there.  (Maybe it's fallout
from DOS or OS/2, dunno why.)

I'm trying to imagine what's easiest for zillions of existing
battle-scarred amateur sysadmins to remember, to ease migration.

I assume everybody using this stuff is massively sleep deprived, in the
middle of some crisis du jour where something _else_ is on fire and they
have no time or attention to spare for this part, and hasn't touched it
in three months so they only vaguely remember what they once read, so
they have to guess right and get good error messages.

>> 2) URLs don't do a colon, they do first slash.
> 
> They do colon for port actually.  But that's not what you are going
> after, so moving on...
> 
>>  But how are you
>> specifying unix named pipes?
> 
> unix://path-to-pipe
> 
>> Don't they live in the filesystem,
>> potentially in a subdirectory?  Even sticking with a colon separator
>> wouldn't help there because a path can have a colon in it.
> 
> Good point. maybe you'd have to do unix://localhost/path/to/pipe
> Although there really isn't a role for a port per say, so path on its own
> would be fine in this case.

As I said, there is a default port for 9p assigned by IANA, so I'm ok
with having to go "-o port=1234" to specify a non-default port.  The
important thing is that there is a common case that does NOT require a
-o option, not to eliminate all use of -o.

>> would be a bad idea anyway: being similar to but not QUITE compatible
>> with a technology people are familiar with is worse than being nothing
>> like it.  False similarity encourages mistakes, don't go there: if
>> you're gonna do a URL, do a URL.  That means
>> "transport://servername/path", and the server name can't have a / in it.)
>>
> 
> agreed.
> 
>> I also have no idea what an RDMA name looks like, and for once wikipedia
>> doesn't have an opinion.  (I need to fluff out 9p.txt quite a bit once
>> I've got some good working examples people can try.)
> 
> I could be mistaken (I didn't have the hardware to test it myself, it was done
> externally), but I think it lookes like IP.

Documentation/filesystems/9p needs test cases in it.  We may not have
the hardware, but we need to document "if you have the hardware, do this
to use it".

>> I _do_ note that transport=fd is essentially one of these two:
>>
>>  1,2:/path/to
>>  fd://1,2/path/to
>>
>> I.E. decimal read fd, write fd, presumably within mount's process
>> context but outliving the mount process?
>>
> 
>> How does "unix named mount point" differ from just having _one_
>> filehandle that you both read and write to?  (Your mount helper opening
>> the sucker for you.)  Is there a reason this is a separate transport
>> type rather than:
>>
>>  mount 0,1:/path blah < /path/to/pipe > /path/to/pipe
> 
> it was really just trying to make things easy, I didn't want to have
> to rely on mount helpers and the code was simple enough.

"Easy to code" and "deeply ugly at a design level" are not mutually
exclusive.

Let me cut and paste from man 7 unix:

*  abstract: an abstract socket address is distinguished  by  the  fact
   that  sun_path[0] is a null byte ('\0').  All of the remaining bytes
   in sun_path define the "name" of the socket.   (Null  bytes  in  the
   name have no special significance.)  The name has no connection with
   file system pathnames.  The socket's address in  this  namespace  is
   given  by the rest of the bytes in sun_path.  When the address of an
   abstract socket is returned by getsockname(2),  getpeername(2),  and
   accept(2),  its  length  is sizeof(struct sockaddr_un), and sun_path
   contains the abstract name.  The abstract socket namespace is a non-
   portable Linux extension.

So if you initialize it like this:

  struct sockaddr_un fred = {
    sun_family=AF_UNIX,
    sun_path="\000127.0.0.1"
  }

Then A) there's no way to autodetect the transport type based on the
transport name, B) if you didn't memset() the structure first you have a
bug because all 108 bytes of the sun_path field are significant so the
uninitialized binary crap after our initializer is part of the name.

Do you really want to get any of this non-portable Linux extension on
you?  (Especially since you _can_ have a userspace mount helper add
support for this and feed it to the kernel via the fd mechanism.)

Who is _currently_ using it?

>> I do know that if you need DNS resolution, you need a mount.p9 helper to
>> do it for you because that's NOT the kernel's job.
>>
> 
> yup.  unless you are NFS and get to cheat because they incorporate you
> into the default mount tool :P

The kernel has a callback to do DNS resolution.  It is unholy layering
violation, and is one small part of the reason I hope 9p obsoletes NFS.

>>> Since supporting a richer dev_name can happen while still supporting
>>> option arguments, I would not be opposed to patches to the transports
>>> which allowed for richer dev names along the lines you describe.
>>
>> I'm happy to provide a patch, but we'd need to agree on the design first.
>>
>> I could introduce a colon based syntax that checked for (in order):
>>
>>  server:/path
>>
>> 1) valid ipv4 address (transport = tcp)
>> 2) valid ipv6 address (transport = tcp)
>> 3) decimal,decimal fd numbers (transport = fd)
>>
>> And would assume anything else it got was a virtio channel name.  I have
>> no idea what rdma looks like.
>>
> 
> That's sort of the catch, since rdma would look the same, but we can
> keep the option syntax for things like that and just provide sensible
> defaults based on dev_name parsing.  It assumes you don't do something
> like name your virtio channel with something that looks like an IP
> addr.

We could just say that name:/path is only supported for ipv4, ipv6, and
fd pairs, and anything else use the url:// syntax.  (Solving 80% of the
problem cleanly is better than solving 100% of the problem badly,
especially if the remaining cases can be somebody else's problem.)

This thing's config symbol is under "network filesystems" for a reason...

>> I could also introduce a more verbose/explicit URL syntax:
>>
>>  transport://server/path
>>
> 
> It seems to me we can probably do both, but for the patch you can feel
> free to do your favorite and let someone else handle providing the
> support they want to provide.

Oh I can do all of it, this is fairly simple programming, I just want to
do it _right_.

>>
>> Doing this would remove the need for the the "aname=" option, instead
>> parsing it from the "block" field.  Should I remove the aname option
>> when I do so?  (Deprecate it?)
>>
> 
> Leave the code in the client, but we can drop it from the common howto
> documentation.

It's a big issue deprecating experimental code?

  “Perfection is achieved, not when there is nothing more to add, but
  when there is nothing left to take away.” Antoine de Saint Exupéry

>> It seems to me that backwards compatability's less of an issue with
>> 9p2000.L being up in the air.  Right now v9fs is fairly experimental.
>> I'd like to get the UI right before people start seriously using it, and
>> this change would mean "easy to use from busybox mount without touching
>> the existing busybox code".
> 
> Well, there are a substantial silent user base, so I want to try and
> preserve backwards compatibility as much as possible.

feature-removal-schedule.txt

>>
>>>>
>>>> Why is there no way to probe for what
>>>> version of the protocol the server supports instead of having to specify
>>>> it in a surprisingly case-sensitive way involving a lower-case p?
>>>>
>>>
>>> The client does negotiate, but 9p2000.L is not currently the default
>>> since it is still under active development and until diod was
>>> available there wasn't even a TCP server for it.  At some point in the
>>> future I wouldn't have a problem with 9p2000.L being the default, but
>>> until we've had an opportunity to make sure that legacy servers (xcpu
>>> is my principle concern since its probably the main production 9p
>>> server out there in use by end-users) don't blow up when the client
>>> sends them a 9p2000.L negotiation.  Once the 9p2000.L servers are more
>>> fully baked (as opposed to just released), I'd be happy to update the
>>> default for Linux.
>>
>> The diod drop we got is explicitly an -rc, but right now "Plan 9
>> Resource Sharing Support" in the network filesystems menu is marked as
>> experimental too.  Presumably things can still change until the
>> experimental tag gets removed.
>>
>> That said, I would like to work towards removing that tag ASAP, patches,
>> documentation, testing, etc.  This is cool technology and I would like
>> to help it <strike>drive out the horror of NFS</strike> mature and
>> become ubiquitous.
>>
> 
> I'm all for it, now that we are feature complete perhaps we'll try and
> remove the EXPERIMENTAL tag from the kernel code next merge window.
> 
>      -eric

Eh, let's get the UI issues and testing happy first.  The experimental
tag means that we can still fix stuff with less pushback.

The problem with a silent user base is they don't report even obvious
breakage.  I'd like to get some feedback from people that it does what
they want.  Maybe write an LWN article...

And keep in mind there's an unresolved performance issue that may
require some sort of new readdir_stat() or something...  (And it has to
be readdir_many() rather than stat_many() because readdir() is a search
function that naturally returns multiple results for a single query, and
stat() is a query function that returns one result per query so you'd
have to feed IN long lists of things to query which turns into a giantic
racy mess really fast.)

Let's focus on improving it first, freezing it and declaring victory
afterwards.

Rob