From: Goswin v. B. <gos...@we...> - 2008-05-31 10:04:42
|
Hi, I've desided on a design for my own filesystem so now it is time to ask for some feedback in case I've overlooked any obvious flaws. The filesystem is for archiving data, lots of it. The most important goal is reliability. Everything else is secondary. Features: - the filesystems is COW (blocks are never changed but copied to a new location on every change) - always consistent at all times - a crash might lose the last few seconds of writes but never once fsync() or close() has completed - all metadata and data is checksummed - metadata is stored with mirroring for simplicity - data is stored striped with parity (raid5) to maximize space - checksums are checked on read an errors corrected - files can be restriped on the fly (increase/decrease stripe count) - disks can be added, vacated and removed on the fly - online defragmenter (also does restriping and vacating) This might actualy do a little defragmenting for every write. So much for what I want to achive. Now here is my design: The basic building block for everything is 4K for now. I might try bigger blocks later but a page seems like a nice round size. Every object (file, directory) is stored using the SHA1 checksum of their full path in a B-tree. Blocks of a file are also stored in a B-Tree and use a 8 bit disk number and 56 bit block number. Every meta data block starts with 48 bytes header: class DiskBlock { BlockType type; CheckSum checksum; CheckSum parent; }; BlockType is a 64bit magic for Data Pointer, Data Node, File, Dir, Node, Root or Super (see below). checksum is the SHA1 checksum of the block with checksum set to zero. parent is the checksum of the file (or directory for files) the block belongs too. This is usefull for recovery if things blow up or for fsck. Every disk starts with an array of Superblocks identifying the filesystem and disk: class DiskSuper : public DiskBlock { UUID fs_uuid; UUID disk_uuid; uint64_t super_serial; uint64_t version; BlockPair root1; BlockPair root2; BlockPair root3; int num_devices; UUID device_uuids[MAX_DEVICES]; }; On startup the array is scanned to find the superblock with the highest serial. If it is not present on all disks (crashed during superblock update) a lower serial can be used. The version is used to allow future changes. BlockPair is just a pair of device+block references which are mirrors of each other. On every sync() the next root* pair is overwritten in a round-robin fashion. At any time there are at least 2 valid root pairs while one might currently be in transit. Those 3 (6 with mirroring) blocks are the only blocks ever overwritten (instead of copied) during normal operations. The 3 root* pairs point to a Root blocks: class DiskRoot : public DiskBlock { uint64_t serial; struct statvfs statvfs_buf; BlockPair root_dir; Block free[MAX_DEVICES]; }; serial is used to detect the newest block. statvfs_buf keeps track of total size, free, used, ... root_dir points to the B-Tree (Node Block) for all objects (files, directories). free points to a file block that is made up of all the free blocks for each device, blocks not used in the active or backup root. As said above all objects are stored by their SHA1 checksum of the full path in a B-Tree: class DiskNode : public DiskBlock { int used; CheckSum checksums[NUM_KEYS-1]; BlockPair pairs[NUM_KEYS]; }; The top node can have 0 entries (empty filesystem). All other nodes in the tree are at least half full as usual in a B-tree. The leafes are detected by their type, either File Block or Directory Block (actualy anything not Node Block). File lookup, creation, deletion all is O(log n). Currently File and Directory Blocks use the same data structure: class DiskObject : public DiskBlock { struct stat stat_buf; BlockPair data_node_or_pointer; int name_len; char name[MOOSE_PATH_MAX]; }; stat_buf contains the attributes of the file or directory. data_node_or_pointer is either a DataNode or DataPointer Block. DataNote is another B-Tree with DataPointer as leafes and has at least 2 leafes or the DataPointer is stored directly. name is the full path for the object. [Should I have a "BlockPair next" to handle SHA1 checksum collisions?] For files the data_node_or_pointer handles the actual data in the file. For directories the data is an unsorted array or SHA1 checksums of the objects in that directory for use in readdir(). This means that each readdir() will be O(log n) and ls will be O(log n * # files in dir). While not ideal I think that is acceptable. A DataNode block is also a B-tree containing: class DiskDataNode : public DiskBlock { int used; uint64_t offsets[NUM_KEYS-1]; BlockPair pairs[NUM_KEYS]; }; This time it is over offsets in the file and the pairs are either DataNode blocks or DataPointer blocks containing: class DiskDataPointerBlock : public DiskBlock { int blocks_used; CheckSum checksums[MAX_BLOCKS]; uint64_t offsets[MAX_BLOCKS]; Block blocks[MAX_BLOCKS]; }; Now this needs some more explaining. The DataPointer block contains a series of stripes of data blocks. The end of a stripe is indicated by offsets[i] == 0 (except first block is never end of stripe). Here is an example (numbers shortened): blocks_used: 9 checksums: xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx offsets: 0 0 1 2 0 3 6 9 0 blocks: 1010 2020 1011 2035 3032 4017 1014 3011 2017 Here the file consists of 3 stripes. A mirror stripe (1010 + 2020) and 2 XOR stripes (1011 + 2035, parity 3032) (4017 + 1014 + 3011, parity 2017). The file is also sparse. Now, how will this all work together? I think it is best if I describe what happens in two examples: What happens on write()? The filesystem is COW everywhere. So a new block is allocated. If the block already exists in the file and is partial the old block is read, the chnages copied into and then written to the new location. The mirror/parity block of the stripe the block belongs too is recomputed and also stored in a new location. If the block is new to the file an existing stripe can be grown or a split into 2. The DataPointer block containing the change is copied to a new location and the stripe information is updated. If neccessary the DataPointer block is split in 2. As the DataPointer block is mirrored this actually allocates 2 blocks on different disks. This holds true for all meta data and won't be mentioned again. The DataNode block containing the DataPointer block is copied to a new location and updated to point to the new block (BlockPair actually). If the DataPointer was split in 2 then the DataNode might have to split as well or, if there was no DataNode yet, a new one has to be created. This can cause a domino effect in the B-tree splitting the node on each level and creating a new root in the end. The DiskObject block containing the DataPointer or DataNode block gets copied to a new location and updated to point to the new block. The stat_buf gets updated for the new atime/mtime/ctime/size/etc. That is all for the file itself. But we have actually created a new file that shares most data with the old file. Now the object B-tree must be updated to point to the new file instead of the old. So the DiskNode block containing the DiskObject gets copied to a new location and the BlockPair updated. While no splitting can occur the change still needs to propagate up the tree till it reaches the root. Each DiskNode is copied in turn. Last the DiskRoot block has to be updated. We have 3 of them used in round-robin fashion. So a new DiskRoot block is created in memory with the next serial, new root_dir, new free array and updated statvfs_buf. Next all blocks are sync()ed to disk. Only then is the new DiskRoot written causing the write to the file to atomically appear on disk. If the system crashes before this the write is lost. What hapens when a file is created? A new DiskObject is allocated and initialy empty. The DiskNode where the new DiskObject should be inserted is found and the new BlockPair is inserted into a copy of it. This might cause the DiskNode to split. Again the new location must be propagated to the parent up to the root. Also the DiskObject for the directory containing the file is found and the new SHA1 checksum is added to it. This is basically the same as a normal write to a file. See above for that. As you can see there are a lot blocks copied for a single write. A write will always cause O(log n) copies. To get any speed out of the filesystems writes must be combined as much as possible. At a minimum a new DiskRoot must be written to disk on every fsync() and flush() [close()]. Software expects data to be on disk after those calls. A new DiskRoot must also be written when I run out of free blocks. Blocks can only be marked free and reused if the old DiskRoot referencing them is overwritten. Both conditions can take a long time to occur (and lets hope they do so this becomes efficient) so a DiskRoot should also be written after a certain time (if anything has changed). Note that writes can be combined from any block and file. As long as the DiskRoot has not been written yet other blocks can be marked dirty when they are first copied and then changed in place until the DiskRoot referencing them is written. The nearer the changed blocks are to each other the earlier on the way to the root they will hit the same block. If the kernels block cache still has the block unwritten the next change will replace it saving all the writes for the remaining path to the root. So n writes may cause only O(n) copies. The blocks can also be cached in moosefs for a while before being written at all. This would be most valuable to prevent recomputing the parity block over and over when blocks of the same stripe are written. As a bonus here is a simplified diagram of the filesystem structure: Superblock | Root / \ Node Node / \ Object Object / \ DataNode DataNode / \ DataPointer DataPointer ____/ / \ \ \____ / / \ \ \ [Block Block] [Block Block Block] Unsolved! ========= One thing I'm undecided about is the actual structure to manage the free blocks. I can maintain a B-tree of all free blocks, which would be like a file consiting of the free blocks as mentioned above. Updating that from DiskRoot to DiskRoot is a bit complicated though and I might even need a 4th DiskRoot for it to work at all. The advantage would be that the space needed to manage this sinks the less blocks are free. i.e. I get more data on my disks. On the other hand I could do reference counting for every block. This would require a fixed amount of space no matter how much is used. On the other hand this makes maintaining it much simpler and would trivialy allow snapshoting. Allowing for up to 250 snapshots this would require 256Mib for a 1TiB disk. 512MiB when allowing for 65530 snapshots. I tend towards reference counting for simplicity sake and because snapshots would be cool. The cost is ~0.05% of the total disk space. Neglible. So what do you think? MfG Goswin |
From: Valient G. <va...@gm...> - 2008-06-08 02:49:30
|
On Sat, May 31, 2008 at 3:04 AM, Goswin von Brederlow <gos...@we...> wrote: > Unsolved! > ========= > > One thing I'm undecided about is the actual structure to manage the > free blocks. I can maintain a B-tree of all free blocks, which would > be like a file consiting of the free blocks as mentioned > above. Updating that from DiskRoot to DiskRoot is a bit complicated > though and I might even need a 4th DiskRoot for it to work at all. The > advantage would be that the space needed to manage this sinks the less > blocks are free. i.e. I get more data on my disks. > I wrote a proof of concept that used a similar scheme. However mine translated the other way - it provided a block device using a filesystem as the backing store. Reference counting was handled by using hard links in the underlying filesystem. A non-leaf node in the tree contains nothing but links to other blocks. In my case, the name of the reference was its SHA hash. The block was stored as both a file and a directory of hard links to child nodes. As you say, using reference counts makes it easy to take snapshots - just add a reference to a root node. The idea of a hash tree goes back a long ways - it's basically a Merkle tree, which has applications in distributed systems protocols. ZFS works in a similar way, so you may consider looking at ZFS for ideas - or even consider use the ZFS fuse port itself.. cheers, Valient |