[Bsdbook-commit] cvs commit: bsdbook/src filesys.tex,NONE,1.1

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Update of /cvsroot/bsdbook/bsdbook/src
In directory usw-pr-cvs1:/tmp/cvs-serv11100

Added Files:
	filesys.tex 
Log Message:
prototype of filesystem chapter

--- NEW FILE: filesys.tex ---
% Contents: Filesystem
% $Id: filesys.tex,v 1.1 2002/08/28 11:02:28 drozd Exp $

CHAPTER: Filesystem architecture

SECTION: Overview

Most of the data UNIX handles is stored in the files. Files are
organized in a tree and kept on some data store, mostly on hard disk(s).
First UNIX filesystem was s5fs (System V FileSystem), designed for System V
UNIX. The FFS filesystem, designed in Berkely,  was introduced lately and
brought a number of seriuos improvements to reliability and performance.
The modern filesystems used in many UNIX flavours exploit the same basic
ideas and concepts, pioneered by s5fs and FFS.

At older times, only one filesystem was supported by the UNIX kernel.
Therefore, UNIX vendors must choose between several filesystems available at
that moment. To work out this problem, a new architecture was developed,
called Virtual File System, which allows to work with different filesystems
at the same time. This attempt was made by several vendors,
but Sun's realization appeared most flexible and clean. Later, it was
ported with few updates into BSD UNIX (McKusick et al)

In this chapter the s5fs architecture will be described. It is pretty outdated
nowadays, but describing it is good because of its simplisity. Then,
VFS (Virtual File System) interface used in FreeBSD will be discussed,
and overview of handling files in user process, and associated data structures.

SECTION s5fs architecture

It is worth to note that UNIX represents the media where filesystem may
reside as a flat array of blocks. Block is the chunk of data with fixed
size. The size of a block is defined by underlying driver, and represents
a minimal amount of data that the driver can read/write to media.
Usually, block size reflects the size of physical sector of underlying media.

s5fs resides in one disk partition and consists of three major parts:

	o superblock
		Constains general information about the filesystem.
	o ilist
	 	An array of disk inodes. Contains meta-data for all files in
		filesystem. One of the inodes is the root inode, and the lookup
		of every other file is done through it.  The array has fixed
		size, i.e. fixed number of inodes is available in system.
	o data blocks
		Data storage for a files. Each block is 512 bytes long.

SUBSECTION: The inode
The key element for the file representation in s5fs is 'inode', which stands
for 'index node'. Inode is the
data structure, which resides on disk, and contains the file's metadata.
Here is the list of most important fields disk inode (struct dinode) has:
	o di_mode
		File type (IFREG, IFDIR, IFBLK etc)
	o di_uid, di_gid
		UID and GID of the file's owner
	o di_size
		File size in bytes
	o di_atime, di_mtime, di_ctime
		Time of last access, last
		modification, and last metadata modification, respectively
	o di_nlinks
		Number of hard links
	o di_addr[13]
		Data block addresses

Prefix `di_' stands for ''Disk Inode''. When process requests a file, kernel
reads associated inode into memory. But memory copy (represented by struct
inode) is somewhat different from disk inode. That happens because some fields
that are in disk inode are unnecesary to keep in memory, for instance,
modification times. In other hand, some additional information is required for
memory copy, for example, inode locks, to synchronize file access among
several processes. Therefore, things end up with two structures - on-disk
inode, and in-memory inode, also called in-core inode. Data fields inside
in-core inode are prepended by prefix `i_'.

Inode must keep the unformation where file data is stored. Since file data
blocks may be not contigous, the inode must keep block addresses explicitely.
This is the role of di_addr field.

The addressing scheme used in s5fs is somewhat sophisticated, but allows to
handle relatively large files using small fixed inode size. The simplest
way to address data blocks is to keep an array of addresses to them.
If to keep inode size fixed, and, say, use 13 elements to store the block
adresses, the maximum file size wold be not more than (13 * block size) bytes.
This is a very tough restriction. This problem is solved by using indirect
addressing. First 10 elements of di_addr field are used as direct block
addresses. If the file size is more that (10 * block size) bytes, the 11th
and 12-th elements are used. They are interpreted as a block addresses, which
contain not file data, but addresses of data blocks. This is one-level
indirection. The 13th element uses two-level indirection, i.e. it points to
a block containing the addresses to a blocks that contain the addresses of data
blocks.

Using this addressing mechanism, it is possible to hadle files up to 16MB
with small fixed inode size. It should be mentioned, that this solutuion
is used in some modern filesystem, such as linux's ext2 filesystem.

SUBSECTION: The superblock

The superblock contains the information for mounting the filesystem, and
general information such as number of inodes, free blocks etc.  Each filesystem
has just one superblock, which resides at the beginning of a disk partition. 
Superblock is read into memory at mount time and is resident till unmounting
the filesystem. Superblock spans more than one block and has the following data:
	o s_fsize Filesystem size in blocks, including the superblock
	itself, array of inodes and data blocks.
	o s_isize Size of ilist array
	o s_tfree Number of free blocks
	o s_tinode Number of free inodes

When user process asks to create a new file, an inode must be assigned to it.
In terms of efficiency, it is undesirable to scan the the ilist every time
new file is created just to find free inode, because ilist array may span
several blocks. It is inefficient to keep full array of free inodes in memory
as well, since it may be quite big. Therefore, a partial list of free inodes
kept in superblock. When the number of free inodes in partial list reaches 0,
the kernel scans the ilist array and renew it. While scanning the ilist array,
kernel can differ free inode from non-free one by using `di_mode' field, which
is always zero for free inodes.

When user process writes into file, file data is growing and a new block
may be needed accidentally to store arriving data. Once again, the kernel must
maintain a list of free blocks to quickly assign it by every request. Since
the list of free blocks may be large, it is inefficient to keep it in the
memory. The superblock contains just one block of them. The first element in the
block used as an address of a block that contains continuation of the list,
forming some kind of on-disk linked list. The 'head' of this linked list,
therefore, resides in superblock and stays resident in memory. When kernel
asked for a free block, it returns the blocks starting from the end of block.
When the number of free blocks becomes 1, i.e. just the first element left,
the kernel 'follows the list' and reads the block addressed by this element,
renewing in-memory list of free blocks.

SUBSECTION: Disk layout

	+---------+
	| DIAGRAM |
	+---------+

SUBSECTION: Directory structure

One important feature of s5fs filesystem is that inode does not contain
the file name. Filename is stored in a special files - directories, as a
common file data. This approach allows any file to have many names.
File data for the directory is an array of 16-bytes entries:

struct dentry {
	short d_ino;
	char d_name[14];
}

d_ino field holds an inode of a file, and d_name holds a file name.
File name in s5fs is restricted by 14 bytes. First two elements in a directory
are fixed: first entry is addressing the directory itself and has a special
name ".", and second entry is addressinf the parent directory and has a
special name "..".

When file is deleted from a directory, the only thing kernel does it sets
d_ino field to 0, but does not actually remove the file name and shrink the
file size for a directory. As a result, directory sizes are never reduced.

SUBSECTION: Weaknesses of s5fs filesystem

SECTION: Virtual File System

In order to make several filesystems live on one UNIX machine, it is necessary
to keep a unified filesystem interface. This unified interface may be
considered as an abstract class in terms of OO programming. Every 'real'
filesystem is an instance of the class. The VFS (Virtual File System) provides
a unified interface - a set of 'virtual functions'. Those funcions must be
implemented by an 'instances' - 'real' filesystems.
VFS represents two 'classes':

	o virtual filesystem (struct vfs, associated struct vfsops)
	o virtual node (struct vnode, assosiated struct vnodeops)

First is an abstraction of a filesystem. It has an associated list of methods
used to handle general filesystem operations such as mount/unmount etc.
Second is the abstraction of inode. The associated methods represent all
the actions may have been taken over the file object.