Thread: [fuse-devel] RFC: Moose FS - multi-disk, redundant, checksumming filesystem

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi,

I've desided on a design for my own filesystem so now it is time to
ask for some feedback in case I've overlooked any obvious flaws.

The filesystem is for archiving data, lots of it. The most important
goal is reliability. Everything else is secondary.

Features:

- the filesystems is COW (blocks are never changed but copied to a new
  location on every change)
- always consistent at all times
- a crash might lose the last few seconds of writes but never once
  fsync() or close() has completed
- all metadata and data is checksummed
- metadata is stored with mirroring for simplicity
- data is stored striped with parity (raid5) to maximize space
- checksums are checked on read an errors corrected
- files can be restriped on the fly (increase/decrease stripe count)
- disks can be added, vacated and removed on the fly
- online defragmenter (also does restriping and vacating)
  This might actualy do a little defragmenting for every write.

So much for what I want to achive. Now here is my design:

The basic building block for everything is 4K for now. I might try
bigger blocks later but a page seems like a nice round size. Every
object (file, directory) is stored using the SHA1 checksum of their
full path in a B-tree. Blocks of a file are also stored in a B-Tree
and use a 8 bit disk number and 56 bit block number.

Every meta data block starts with 48 bytes header:

class DiskBlock {
    BlockType type;
    CheckSum  checksum;
    CheckSum  parent;
};

BlockType is a 64bit magic for Data Pointer, Data Node, File, Dir,
Node, Root or Super (see below). checksum is the SHA1 checksum of the
block with checksum set to zero. parent is the checksum of the file
(or directory for files) the block belongs too. This is usefull for
recovery if things blow up or for fsck.

Every disk starts with an array of Superblocks identifying the
filesystem and disk:

class DiskSuper : public DiskBlock {
    UUID fs_uuid;
    UUID disk_uuid;
    uint64_t super_serial;
    uint64_t version;
    BlockPair root1;
    BlockPair root2;
    BlockPair root3;
    int num_devices;
    UUID device_uuids[MAX_DEVICES];
};

On startup the array is scanned to find the superblock with the
highest serial. If it is not present on all disks (crashed during
superblock update) a lower serial can be used. The version is used to
allow future changes. BlockPair is just a pair of device+block
references which are mirrors of each other. On every sync() the next
root* pair is overwritten in a round-robin fashion. At any time there
are at least 2 valid root pairs while one might currently be in
transit. Those 3 (6 with mirroring) blocks are the only blocks ever
overwritten (instead of copied) during normal operations.

The 3 root* pairs point to a Root blocks:

class DiskRoot : public DiskBlock {
    uint64_t serial;
    struct statvfs statvfs_buf;
    BlockPair root_dir;
    Block free[MAX_DEVICES];
};

serial is used to detect the newest block. statvfs_buf keeps track of
total size, free, used, ... root_dir points to the B-Tree (Node Block)
for all objects (files, directories). free points to a file block that
is made up of all the free blocks for each device, blocks not used in
the active or backup root.

As said above all objects are stored by their SHA1 checksum of the
full path in a B-Tree:

class DiskNode : public DiskBlock {
    int used;
    CheckSum checksums[NUM_KEYS-1];
    BlockPair pairs[NUM_KEYS];
};

The top node can have 0 entries (empty filesystem). All other nodes in
the tree are at least half full as usual in a B-tree. The leafes are
detected by their type, either File Block or Directory Block (actualy
anything not Node Block). File lookup, creation, deletion all is O(log
n).

Currently File and Directory Blocks use the same data structure:

class DiskObject : public DiskBlock {
    struct stat stat_buf;
    BlockPair data_node_or_pointer;
    int name_len;
    char name[MOOSE_PATH_MAX];
};

stat_buf contains the attributes of the file or directory.
data_node_or_pointer is either a DataNode or DataPointer Block.
DataNote is another B-Tree with DataPointer as leafes and has at least
2 leafes or the DataPointer is stored directly. name is the full path
for the object. [Should I have a "BlockPair next" to handle SHA1
checksum collisions?]

For files the data_node_or_pointer handles the actual data in the
file. For directories the data is an unsorted array or SHA1 checksums
of the objects in that directory for use in readdir(). This means that
each readdir() will be O(log n) and ls will be O(log n * # files in
dir). While not ideal I think that is acceptable.

A DataNode block is also a B-tree containing:

class DiskDataNode : public DiskBlock {
    int used;
    uint64_t offsets[NUM_KEYS-1];
    BlockPair pairs[NUM_KEYS];
};

This time it is over offsets in the file and the pairs are either
DataNode blocks or DataPointer blocks containing:

class DiskDataPointerBlock : public DiskBlock {
    int blocks_used;
    CheckSum checksums[MAX_BLOCKS];
    uint64_t offsets[MAX_BLOCKS];
    Block blocks[MAX_BLOCKS];
};

Now this needs some more explaining. The DataPointer block contains a
series of stripes of data blocks. The end of a stripe is indicated by
offsets[i] == 0 (except first block is never end of stripe). Here is
an example (numbers shortened):

blocks_used: 9
checksums:   xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx
offsets:        0    0    1    2    0    3    6    9    0
blocks:      1010 2020 1011 2035 3032 4017 1014 3011 2017

Here the file consists of 3 stripes. A mirror stripe (1010 + 2020) and
2 XOR stripes (1011 + 2035, parity 3032) (4017 + 1014 + 3011, parity
2017). The file is also sparse.

Now, how will this all work together? I think it is best if I describe
what happens in two examples:

What happens on write()?

The filesystem is COW everywhere. So a new block is allocated. If the
block already exists in the file and is partial the old block is
read, the chnages copied into and then written to the new location.
The mirror/parity block of the stripe the block belongs too is
recomputed and also stored in a new location. If the block is new to
the file an existing stripe can be grown or a split into 2.

The DataPointer block containing the change is copied to a new
location and the stripe information is updated. If neccessary the
DataPointer block is split in 2. As the DataPointer block is
mirrored this actually allocates 2 blocks on different disks. This
holds true for all meta data and won't be mentioned again.

The DataNode block containing the DataPointer block is copied to a new
location and updated to point to the new block (BlockPair
actually). If the DataPointer was split in 2 then the DataNode might
have to split as well or, if there was no DataNode yet, a new one has
to be created. This can cause a domino effect in the B-tree splitting
the node on each level and creating a new root in the end.

The DiskObject block containing the DataPointer or DataNode block gets
copied to a new location and updated to point to the new block. The
stat_buf gets updated for the new atime/mtime/ctime/size/etc.

That is all for the file itself. But we have actually created a new
file that shares most data with the old file. Now the object B-tree
must be updated to point to the new file instead of the old.

So the DiskNode block containing the DiskObject gets copied to a new
location and the BlockPair updated. While no splitting can occur the
change still needs to propagate up the tree till it reaches the
root. Each DiskNode is copied in turn.

Last the DiskRoot block has to be updated. We have 3 of them used in
round-robin fashion. So a new DiskRoot block is created in memory with
the next serial, new root_dir, new free array and updated
statvfs_buf. Next all blocks are sync()ed to disk. Only then is the
new DiskRoot written causing the write to the file to atomically
appear on disk. If the system crashes before this the write is lost.

What hapens when a file is created?

A new DiskObject is allocated and initialy empty.

The DiskNode where the new DiskObject should be inserted is found and
the new BlockPair is inserted into a copy of it. This might cause the
DiskNode to split. Again the new location must be propagated to the
parent up to the root.

Also the DiskObject for the directory containing the file is found and
the new SHA1 checksum is added to it. This is basically the same as a
normal write to a file. See above for that.

As you can see there are a lot blocks copied for a single write. A
write will always cause O(log n) copies. To get any speed out of the
filesystems writes must be combined as much as possible. At a minimum
a new DiskRoot must be written to disk on every fsync() and flush()
[close()]. Software expects data to be on disk after those calls. A
new DiskRoot must also be written when I run out of free blocks.
Blocks can only be marked free and reused if the old DiskRoot
referencing them is overwritten. Both conditions can take a long time
to occur (and lets hope they do so this becomes efficient) so a
DiskRoot should also be written after a certain time (if anything has
changed).

Note that writes can be combined from any block and file. As long as
the DiskRoot has not been written yet other blocks can be marked dirty
when they are first copied and then changed in place until the
DiskRoot referencing them is written. The nearer the changed blocks
are to each other the earlier on the way to the root they will hit the
same block. If the kernels block cache still has the block unwritten
the next change will replace it saving all the writes for the
remaining path to the root. So n writes may cause only O(n) copies.

The blocks can also be cached in moosefs for a while before being
written at all. This would be most valuable to prevent recomputing the
parity block over and over when blocks of the same stripe are written.

As a bonus here is a simplified diagram of the filesystem structure:

         Superblock
             |
            Root
            / \
         Node Node
         / \
    Object Object
            /  \
     DataNode DataNode
          / \
DataPointer DataPointer
       ____/ / \    \ \____
      /     /   \    \     \
[Block Block] [Block Block Block]

Unsolved!
=========

One thing I'm undecided about is the actual structure to manage the
free blocks. I can maintain a B-tree of all free blocks, which would
be like a file consiting of the free blocks as mentioned
above. Updating that from DiskRoot to DiskRoot is a bit complicated
though and I might even need a 4th DiskRoot for it to work at all. The
advantage would be that the space needed to manage this sinks the less
blocks are free. i.e. I get more data on my disks.

On the other hand I could do reference counting for every block. This
would require a fixed amount of space no matter how much is used. On
the other hand this makes maintaining it much simpler and would
trivialy allow snapshoting. Allowing for up to 250 snapshots this
would require 256Mib for a 1TiB disk. 512MiB when allowing for 65530
snapshots.

I tend towards reference counting for simplicity sake and because
snapshots would be cool. The cost is ~0.05% of the total disk
space. Neglible.

So what do you think?

MfG
        Goswin