[Bsdbook-commit] cvs commit: bsdbook/src filesys.tex,NONE,1.1
Status: Planning
Brought to you by:
h_pandya
From: <dr...@us...> - 2002-08-28 11:02:39
|
Update of /cvsroot/bsdbook/bsdbook/src In directory usw-pr-cvs1:/tmp/cvs-serv11100 Added Files: filesys.tex Log Message: prototype of filesystem chapter --- NEW FILE: filesys.tex --- % Contents: Filesystem % $Id: filesys.tex,v 1.1 2002/08/28 11:02:28 drozd Exp $ CHAPTER: Filesystem architecture SECTION: Overview Most of the data UNIX handles is stored in the files. Files are organized in a tree and kept on some data store, mostly on hard disk(s). First UNIX filesystem was s5fs (System V FileSystem), designed for System V UNIX. The FFS filesystem, designed in Berkely, was introduced lately and brought a number of seriuos improvements to reliability and performance. The modern filesystems used in many UNIX flavours exploit the same basic ideas and concepts, pioneered by s5fs and FFS. At older times, only one filesystem was supported by the UNIX kernel. Therefore, UNIX vendors must choose between several filesystems available at that moment. To work out this problem, a new architecture was developed, called Virtual File System, which allows to work with different filesystems at the same time. This attempt was made by several vendors, but Sun's realization appeared most flexible and clean. Later, it was ported with few updates into BSD UNIX (McKusick et al) In this chapter the s5fs architecture will be described. It is pretty outdated nowadays, but describing it is good because of its simplisity. Then, VFS (Virtual File System) interface used in FreeBSD will be discussed, and overview of handling files in user process, and associated data structures. SECTION s5fs architecture It is worth to note that UNIX represents the media where filesystem may reside as a flat array of blocks. Block is the chunk of data with fixed size. The size of a block is defined by underlying driver, and represents a minimal amount of data that the driver can read/write to media. Usually, block size reflects the size of physical sector of underlying media. s5fs resides in one disk partition and consists of three major parts: o superblock Constains general information about the filesystem. o ilist An array of disk inodes. Contains meta-data for all files in filesystem. One of the inodes is the root inode, and the lookup of every other file is done through it. The array has fixed size, i.e. fixed number of inodes is available in system. o data blocks Data storage for a files. Each block is 512 bytes long. SUBSECTION: The inode The key element for the file representation in s5fs is 'inode', which stands for 'index node'. Inode is the data structure, which resides on disk, and contains the file's metadata. Here is the list of most important fields disk inode (struct dinode) has: o di_mode File type (IFREG, IFDIR, IFBLK etc) o di_uid, di_gid UID and GID of the file's owner o di_size File size in bytes o di_atime, di_mtime, di_ctime Time of last access, last modification, and last metadata modification, respectively o di_nlinks Number of hard links o di_addr[13] Data block addresses Prefix `di_' stands for ''Disk Inode''. When process requests a file, kernel reads associated inode into memory. But memory copy (represented by struct inode) is somewhat different from disk inode. That happens because some fields that are in disk inode are unnecesary to keep in memory, for instance, modification times. In other hand, some additional information is required for memory copy, for example, inode locks, to synchronize file access among several processes. Therefore, things end up with two structures - on-disk inode, and in-memory inode, also called in-core inode. Data fields inside in-core inode are prepended by prefix `i_'. Inode must keep the unformation where file data is stored. Since file data blocks may be not contigous, the inode must keep block addresses explicitely. This is the role of di_addr field. The addressing scheme used in s5fs is somewhat sophisticated, but allows to handle relatively large files using small fixed inode size. The simplest way to address data blocks is to keep an array of addresses to them. If to keep inode size fixed, and, say, use 13 elements to store the block adresses, the maximum file size wold be not more than (13 * block size) bytes. This is a very tough restriction. This problem is solved by using indirect addressing. First 10 elements of di_addr field are used as direct block addresses. If the file size is more that (10 * block size) bytes, the 11th and 12-th elements are used. They are interpreted as a block addresses, which contain not file data, but addresses of data blocks. This is one-level indirection. The 13th element uses two-level indirection, i.e. it points to a block containing the addresses to a blocks that contain the addresses of data blocks. Using this addressing mechanism, it is possible to hadle files up to 16MB with small fixed inode size. It should be mentioned, that this solutuion is used in some modern filesystem, such as linux's ext2 filesystem. SUBSECTION: The superblock The superblock contains the information for mounting the filesystem, and general information such as number of inodes, free blocks etc. Each filesystem has just one superblock, which resides at the beginning of a disk partition. Superblock is read into memory at mount time and is resident till unmounting the filesystem. Superblock spans more than one block and has the following data: o s_fsize Filesystem size in blocks, including the superblock itself, array of inodes and data blocks. o s_isize Size of ilist array o s_tfree Number of free blocks o s_tinode Number of free inodes When user process asks to create a new file, an inode must be assigned to it. In terms of efficiency, it is undesirable to scan the the ilist every time new file is created just to find free inode, because ilist array may span several blocks. It is inefficient to keep full array of free inodes in memory as well, since it may be quite big. Therefore, a partial list of free inodes kept in superblock. When the number of free inodes in partial list reaches 0, the kernel scans the ilist array and renew it. While scanning the ilist array, kernel can differ free inode from non-free one by using `di_mode' field, which is always zero for free inodes. When user process writes into file, file data is growing and a new block may be needed accidentally to store arriving data. Once again, the kernel must maintain a list of free blocks to quickly assign it by every request. Since the list of free blocks may be large, it is inefficient to keep it in the memory. The superblock contains just one block of them. The first element in the block used as an address of a block that contains continuation of the list, forming some kind of on-disk linked list. The 'head' of this linked list, therefore, resides in superblock and stays resident in memory. When kernel asked for a free block, it returns the blocks starting from the end of block. When the number of free blocks becomes 1, i.e. just the first element left, the kernel 'follows the list' and reads the block addressed by this element, renewing in-memory list of free blocks. SUBSECTION: Disk layout +---------+ | DIAGRAM | +---------+ SUBSECTION: Directory structure One important feature of s5fs filesystem is that inode does not contain the file name. Filename is stored in a special files - directories, as a common file data. This approach allows any file to have many names. File data for the directory is an array of 16-bytes entries: struct dentry { short d_ino; char d_name[14]; } d_ino field holds an inode of a file, and d_name holds a file name. File name in s5fs is restricted by 14 bytes. First two elements in a directory are fixed: first entry is addressing the directory itself and has a special name ".", and second entry is addressinf the parent directory and has a special name "..". When file is deleted from a directory, the only thing kernel does it sets d_ino field to 0, but does not actually remove the file name and shrink the file size for a directory. As a result, directory sizes are never reduced. SUBSECTION: Weaknesses of s5fs filesystem SECTION: Virtual File System In order to make several filesystems live on one UNIX machine, it is necessary to keep a unified filesystem interface. This unified interface may be considered as an abstract class in terms of OO programming. Every 'real' filesystem is an instance of the class. The VFS (Virtual File System) provides a unified interface - a set of 'virtual functions'. Those funcions must be implemented by an 'instances' - 'real' filesystems. VFS represents two 'classes': o virtual filesystem (struct vfs, associated struct vfsops) o virtual node (struct vnode, assosiated struct vnodeops) First is an abstraction of a filesystem. It has an associated list of methods used to handle general filesystem operations such as mount/unmount etc. Second is the abstraction of inode. The associated methods represent all the actions may have been taken over the file object. |