Menu

Tree [5c0a0d] default tip /
 History

Read Only access


File Date Author Commit
 .hgtags 2009-08-05 Kurt Garloff Kurt Garloff [e8dfa8] Added tag VERSION_0_1 for changeset da38f01d691c
 AUTHORS 2009-08-05 Kurt Garloff Kurt Garloff [fcb6f8] Add .spec file.
 COPYING 2009-08-05 Kurt Garloff Kurt Garloff [fcb6f8] Add .spec file.
 Makefile 2009-08-05 Kurt Garloff Kurt Garloff [5bd748] Fix install-lib.
 README 2009-08-10 Kurt Garloff Kurt Garloff [5c0a0d] Recommend FALLOC_FL_KEEP_SIZE.
 TODO 2009-08-05 Kurt Garloff Kurt Garloff [829ab2] Fix Groups.
 autogen.sh 2009-08-04 Kurt Garloff Kurt Garloff [3b3182] autoconf stuff.
 configure.in 2009-08-04 Kurt Garloff Kurt Garloff [3b3182] autoconf stuff.
 falloc.h 2009-08-05 Kurt Garloff Kurt Garloff [8841cb] Fix compile error for asm-generic.
 falloc_sparse.c 2009-08-05 Kurt Garloff Kurt Garloff [6281f2] Split headers.
 falloc_xfs_ocfs2.c 2009-08-05 Kurt Garloff Kurt Garloff [6281f2] Split headers.
 fallocate.h 2009-08-05 Kurt Garloff Kurt Garloff [6281f2] Split headers.
 fallocate_internal.h 2009-08-05 Kurt Garloff Kurt Garloff [6281f2] Split headers.
 libfallocate.version 2009-08-03 Kurt Garloff Kurt Garloff [19cb83] Add version script to not export all symbols.
 libfallocate0.changes 2009-08-05 Kurt Garloff Kurt Garloff [da38f0] Track changelog in hg as well.
 libfallocate0.spec 2009-08-06 Kurt Garloff Kurt Garloff [1b3b9d] Fix LIcense in RPM to be LGPL v2.1+
 linux_fallocate.c 2009-08-05 Kurt Garloff Kurt Garloff [80fae5] fallocate on pipes does not make sense either.
 sys_fallocate.c 2009-08-05 Kurt Garloff Kurt Garloff [0c9e93] Still implement fallocate64 if HAVE_FALLOCATE i...
 test_falloc.c 2009-08-04 Kurt Garloff Kurt Garloff [94ec8d] C++ fixes (signed vs unsigned).

Read Me

README for libfallocate
=======================

Why this library?
-----------------
Modern hard disks deliver enormous throughput if you contrast their 
performance with drives of a decade ago. However, one parameter has
not been improved much: Seek time. So whenever you need to move a
head over a rotating disk, you'll incur milliseconds of wait time.
This means that for a filesystem, keeping files contiguously stored,
i.e. avoiding fragmentation, is one of the most important things it
can do to deliver good performance. However, when users write files
to the hard disk, that's not an easy task -- often the writes do not
happen with a single large write() system call, but with a number of
small ones. The filesystem does not know how large the file may get
and thus does not know where to best allocate space.

Sometimes the application that writes a file knows at least 
approximately how large the file will get -- if it could tell the
filesystem, it could do a better job to allocate the right contiguous
space on the disk.[*]

Linux offers an interface for applications to do that:
The fallocate() system call. So applications should just use that.
Problem solved?
Not quite:
- The fallocate() system call only got introduced with kernel 2.6.23.
- Only the filesystems ext4 and btrfs support fallocate() as of
  summer 2009.
- The fallocate() wrapper in only got added to glibc-2.10.
- There is the posix_fallocate() wrapper in glibc since 2.8 which
  indeed calls the fallocate() system call; if that fails, it will
  zero fill the file -- which can take a serious amount of time
  and can easily turn out to destroy the anticipated benefits of
  fallocate().
- There are interfaces in the kernel to preallocate space; XFS and
  ocfs2 support an ioctl to achieve this.
- The ioctl has slightly different semantics; it does not extend the
  file to the preallocated size unlike fallocate().

This dire situation is described by Ted Ts'o on
http://www.linux-mag.com/id/7272/2/
Application developers can not be expected to deal with this mess.
Thus this library.

[*] There's another strategy that can help to allocate the exactly right
    amount of space: Delay the allocation of sectors to a file for a
    while -- meanwhile the application may have finished writing and the
    filesystem then knows how large the file is and can take the optimal
    decision. This is implemented in ext4, currently.

What does the library provide?
------------------------------
This library does provide three things:

(1) The syscall wrapper fallocate() for systems whose glibc does
    not have it (yet).

    int fallocate64(int fd, int mode, __off64_t start, __off64_t len);
    int fallocate(int fd, int mode, off_t start, off_t len);

    The routine tells the filesystem to preallocate len bytes starting
    at offset start in the file with file descriptor fd. The mode
    flag should be zero (FALLOC_FL_ADJUST_SIZE) or FALLOC_FL_KEEP_SIZE.
    If it's zero, the file size will be extended if needed (and like 
    a sparse file, reading those bytes behind the prior end of the file
    will be zeroed out). Otherwise, the filesize won't be affected.

    If the preallocation fails, the function will return -1 and 
    errno will be set. Typical errno values are EPERM, EINVAL,
    EOPNOTSUPP (==ENOTSUP), ENOSYS, EFBIG, EBADF, EROFS, ESPIPE,
    and ENOSPC.

    The syscall wrapper is implemented as a weak symbol -- if glibc 
    provides the symbol, glibc will take precedence.

(2) int linux_fallocate64(int fd, int mode, __off64_t start, __off64_t len);
    int linux_fallocate(int fd, int mode, off_t start, off_t len);
     
    This is a slightly more intelligent interface; if fallocate() fails,
    the ioctl interface will be tried for XFS and OCFS2 filesystems.
    Depending on mode, the file will be extended like fallocate() would
    have done. The ioctl also supports FALLOC_FL_UNRESV to remove a
    reservation.

    Error return values are the same as for fallocate().
    (The functions translates ENOTTY to EOPNOTSUPP.)

(3) int fallocate64_with_fallback(int fd, int mode, __off64_t start,
                                  __off64_t len, char fb_posix, char fb_sparse)
    int fallocate_with_fallback(int fd, int mode, off_t start,
                                off_t len, char fb_posix, char fb_sparse)

    This interface acts like linux_fallocate64(); however, there's
    two fallback flags that control what happens if neither fallocate()
    nor the ioctl succceeded (nor did a fatal error such as EPERM,
    EBADF or EROFS happen).
    If fb_posix is set, the function will next do posix_fallocate(),
    which will end up writing lots of zeros to the file. Note that
    this won't happen for FALLOC_FL_KEEP_SIZE. If posix_fallocate()
    is not implemented in glibc, this library will provide it's own
    zero writing routine.
    If fb_sparse is set, the function will next try a sparse write at the
    end of the desired range, hoping that the filesystem takes it as a
    hint for the final size of the file. (With FALLOC_FL_KEEP_SIZE, it
    will write 0 bytes, thus not extending the file, otherwise it will
    write one 0 byte -- but only if the file was previously smaller,
    so no data will get overwritten.)

    If the fallback flags are set, the errno values EOPNOTSUPP and 
    ENOSYS should not be observed, unless you request a mode != 0.

The versions without the 64 are just macros to call the functions with
64bit offsets. So don't assume any argument will be truncated to 32bit,
except for fallocate on 32bit systems with _FILE_OFFSET_BITS == 32
(for compatibility with glibc).

Recommended use by applications
-------------------------------
To use this library, just
#include <fallocate.h>
in your application and link with
-lfallocate
You can do this irrespectively of your glibc version. (TODO: Check with
glibc-2.10.)

For applications that write files in several chunks, but have at least
an idea how large the files end up being, should tell the filesystem
about it. As soon as possible, ideally directly after creating a file with
	int fd = open(FILENAME, O_RDWR | O_CREAT | O_TRUNC, perm);
the app should tell the kernel about the estimated size by calling
	linux_fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, size);
If the function returns an error, it can be safely ignored.
The file size won't be changed by this call, whether or not it succeeds.
If the app has been written to use posix_fallocate() and thus expects
that the file size will be changed to at least have the preallocated
size now, the app can call 
	linux_fallocate(fd, 0, 0, size);
instead. Note that this can fail in which case the file size won't be
affected. If the app does not want to deal with it, it can call
	fallocate_with_fallback(fd, 0, 0, size, 0, 1);

In some cases, the overhead of writing zeros out can be justified; only
testing can tell. In that case, the _with_fallback() interface can be
used and called with fb_posix == 1.
The author of this code does not know if there's use cases where the
implementation of a filesystem and the use pattern make it beneficial
to use the sparse extending of files. However, the app may expect that
the reservation really is reflected in the file size, and fb_sparse == 1
will make it very unlikely that this would fail.