Re: [Relfs-devel] Some thought's.... - News about relfs

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Tuesday 04 January 2005 22:41, Peter Schrammel wrote:
> Hi
>
> I'm new to this list but I want to give my 2c for this gorgeous
> project!

Welcome! This list has been "a little" quiet since relfs was born, and I 
had fewer spare time to code (but I worked on relfs a lot anyway) due 
to my job getting full-time.

I am going to start my ph.d now, so I guess I will have more free time, 
in the meantime there are news:

=== 1. OCaml port of RelFS ===

I faced hard problems with c++, related to memory management and passing 
of dynamically allocated memory between threads, and realized that I am 
not a C++ expert, and also that C++ is not the ideal language to 
quickly write prototypes implementing new ideas. In a word, I found 
that it would take more time for me to become good at coding in C++ 
than for any C++ coder to become good at coding in a simpler language 
like OCaml. 

I made up my mind and did the port to OCaml - the hardest part is to get 
a multithreaded (in the sense that one handles multiple fs callbacks at 
the same time) fuse binding for this language, there was one but was 
not designed for multithreading and was not up-to-date. I wrote another 
ocaml fuse binding, available at 

http://www.sourceforge.net/projects/ocamlfuse

Even if it's up-to-date and designed for multithreading, it's not stable 
enough on the latter: after some milion of requests it crashes for 
still unknown reasons - while in single-threaded mode it works well.

This is not a serious problem at this time (requests usually interleave 
well with each other) but a production system can't afford to block 
while waiting for the cdrom tray to close - so I will have to make 
multithreading work better in the future - I have several alternatives 
and am sure that I will find my way, but I can keep on working on relfs 
in the meantime. Also, I will have to improve speed (by now filesystem 
data is copied twice and this makes it slow - about 10Mb per second on 
my centrino laptop).

I made a port of the relfs core, and I'm satisfied: the source is now 
about 300 lines of code, which is exactly its value :) Moreover, apart 
from functional programming and its very good type system, ocaml has 
superb multithreading primitives and libraries and I hope this will pay 
off during development. 

Said this, I did not commit any changes to CVS, since I had no time to 
write installation instructions (postgresql-ocaml is needed), however I 
think I am going to make a branch next week to allow everybody to see 
it.

=== 2. RelFS design, and goals for the first release ===

Here we come to your e-mail :) In the last months I also redesigned the 
DB schema which looks a little like you propose, I will put it in the 
CVS next week.

>
> First I want to structure my thought's about  FS, Files and
> Databases....
>

1.
> Filesystems hold Objects with some properies:

2.
> Files hold Objects with some properties:

3.
> Databases hold Objects with some properties.

Ok for 1. and 3., while for 2. I am still unsure on how to generalize it 
- not properties which could be seen as extended attributes (fuse 
supports those) but a membership relation: does it make sense to 
represent a mbox file as a directory of text messages? Of course, but 
what should each message look like? The text of the message without 
headers (because they can be shown as extended attributes of the file) 
or the plain text of the message? And how does one deal with message 
attachments? Each message should be a directory, but it could be 
unconvenient from the user point of view.

>
> ----
>
> 1. It would be nice to have the Objects of the FS/File in a DB.

This will be the first feature to be implemented, using index plugins.

> 2. It would be nice to have Objects of a DB queryable in the FS/File

This will not be completely done in the first release, because it 
requires a complex storage architecture: for each file or directory, it 
should be decided if its contents are provided by a raw file on the 
underlying filesystem, or by a plugin which performs a db query, or 
parses and reassembles a file on the fly and so on, and this "storage 
plugins" architecture is completely orthogonal to an "index plugins" 
architecture:

- Index plugins can't fail, they can be run in batch queues and there 
can be more than one index plugin for any file. 

- Storage plugins can fail, are used interactively and if more than one 
plugin has to provide contents for a file, they should be stacked on 
top of each other - e.g. a plugin which turns email messages into 
directory, stacked on top of a plugin which provides email messages 
reading from an ftp site or a pop3 server.

Even if storage plugins will not be in the first relase, I absolutely 
want directories representing queries on the db, like "my punk/rock 
mp3s". Those will be implemented ad-hoc and then replaced by a storage 
plugin.

> 3. It would be nice to modify the DB in the FS/File

There are various implementations of this, and we can surely get one (if 
you refer to seeing db objects, like procedures, as files in the 
filesystem)

> 4. It would be nice to modify the FS/File in the DB
>

This will also be the purpose of index plugins but not in the first 
release, where we won't do this bidirectional communication (I think it 
can be done extending the SQL server and using triggers).

> The first two aren't that hard I think but you should keep one thing
> in mind: a file is not it's filename it's an abstract concept. I
> would give files a UUID. A UUID can be given a name like
> "/archive/coolsong.mp3" or another name
> "/archive/genre/rock/coolsong.mp3" (some call this a link).
>

Yes, I have now a table which holds pairs object ids and names, which 
are not _paths_ but only the result of the "basename" function. There 
is also a "membership" relationship between objects which gives rise to 
paths. This way there is no logical difference between a file contained 
into a zip archive and a file contained into a directory. However I 
will have to find a convenient syntax to allow the coexistence of a zip 
file seen as a file and as a directory, in the same parent directory, 
something like "file.zip" and "file.zip#" to represent the directory, 
but where "#" is a character which is not allowed in an unix filename. 
Unfortunately it seems that the only not allowed character is "/" which 
would not work in many applications.

What are UUIDs? Is this a standard of some sort? Are UUIDs a function of 
file data? I am using just integers right now.

> The second problem is that you have to access the properties with the
> filname. Be realistic nobody will use special tools to query your FS.
>

I know :)

> A good aproach would be waht the guys from reiserFSv4 did
> (http://www.namespace.com):
> every file has it's properties accessed if it was a directory and the
>
> property a filename in the directory:
> :cat "/archive/coolsong.mp3/genre"
> :rock
>

I don't like this because a shell or an userspace program could perform 
a "dirname" operation on the file name to get its path, find that the 
path is not a directory and complain to the user - I would prefer a 
different character than "/" e.g. /archive/coolsong.mp3#genre, but am 
still unsure.

> Question are there any properties in the filesystem that 
> are attached to the filename? Yes! A comment on the filename
> "/etc/shadows" is not a comment to it's content.

This is _exactly_ the motivating example to assume that there will be 
properties attached to paths vs properties attached to files (objects) 
- it seems that we are going to use relfs for the same reasons :)

> So usually a filename indicates a file as above. But you can do
> something like:
> echo "This file is a Security risk" > /etc/shadow/filename/comment
> you could even do comment's on comment ;-)
>
>
> What about directories? It's the same but it means some loss to the
> filenamespace ( a special filename e.g. .rfs indicates the
> properties)

This is another trouble with using "/" as the final separator.

>
> :echo "my brother's files! don't remove" > /archive/.rfs/comment
>
> So most of the 4 quest are solved:
>
> 1. putting the Objects of the FS/Files into a DB (I think you call it
> proxy) I thought of FAM (File alternation monitor) and udev: Just
> tell other programs, that something has changed and they'll do the
> rest. So your FS would just send a message e.g. on the DBUS that a
> file has been altered/ created. Daemons (even user-daemons) could
> listen on the dbus and do some caching of the information (like
> extracting mp3 tags, do checksums...). Don't force a specific DB
> schema on them...they'll hack around it.

In fact I am going to leave the db schema unspecified, just like the 
filesystem hierarchy in a linux distribution; I plan to use more than 
one communication protocol for indexing applications, e.g. dbus but 
also xmlrpc or just dynamic loading and linking of shared libraries, or 
a command line (a la CGI) interface for simple shell scripts.

> Here a simple aproach for a DB modell:
>
> FS Objects:
> UUID char(33)
>

The fact that you write "char(33)" makes me think that there is a 
written specification. There is this problem with the identity of 
objects, that it's going to be lost if you copy the file "outside" the 
filesystem, so you loose any information attached to this uuid, unless 
it's like an md5sum, but then it's going to change when the file is 
modified, and so we would need cascade update of primary keys 
everywhere. There is also the problem of open-endness: how do I join 
two different filesystems on different machines if object ids can 
collide? 

> FS Attribnames:
> ID  integer e.g. 9088
> value  integer e.g. 78 (specific to the plugin)
> plugin  char(16) e.g. mp3tag
> name  char(16) e.g. artist

Hmmm, even if I realize that it could be useful to reify attribute 
names, I would prefer to keep a rich relational structure, like an 
"mp3" table with author, size etc as columns, where e.g. author is a 
foreign key into another table. I know there are many problems, but it 
would allow us to exploit the full power of a relational database in 
user applications.

> Sorry my C++ is not the best but with this aproach I could even use
> haskell as filters FS->DB ;-).
>

You see, I switched from C++ to ocaml because mine was not the best 
too :) However, language independence for index plugins is an important 
requirement for relfs - I would like to be able to even use shell 
scripts taking the modified file as argument, and outputting an sql 
script.

V.