libfallocate Code
Status: Beta
Brought to you by:
garloff
README for libfallocate
=======================
Why this library?
-----------------
Modern hard disks deliver enormous throughput if you contrast their
performance with drives of a decade ago. However, one parameter has
not been improved much: Seek time. So whenever you need to move a
head over a rotating disk, you'll incur milliseconds of wait time.
This means that for a filesystem, keeping files contiguously stored,
i.e. avoiding fragmentation, is one of the most important things it
can do to deliver good performance. However, when users write files
to the hard disk, that's not an easy task -- often the writes do not
happen with a single large write() system call, but with a number of
small ones. The filesystem does not know how large the file may get
and thus does not know where to best allocate space.
Sometimes the application that writes a file knows at least
approximately how large the file will get -- if it could tell the
filesystem, it could do a better job to allocate the right contiguous
space on the disk.[*]
Linux offers an interface for applications to do that:
The fallocate() system call. So applications should just use that.
Problem solved?
Not quite:
- The fallocate() system call only got introduced with kernel 2.6.23.
- Only the filesystems ext4 and btrfs support fallocate() as of
summer 2009.
- The fallocate() wrapper in only got added to glibc-2.10.
- There is the posix_fallocate() wrapper in glibc since 2.8 which
indeed calls the fallocate() system call; if that fails, it will
zero fill the file -- which can take a serious amount of time
and can easily turn out to destroy the anticipated benefits of
fallocate().
- There are interfaces in the kernel to preallocate space; XFS and
ocfs2 support an ioctl to achieve this.
- The ioctl has slightly different semantics; it does not extend the
file to the preallocated size unlike fallocate().
This dire situation is described by Ted Ts'o on
http://www.linux-mag.com/id/7272/2/
Application developers can not be expected to deal with this mess.
Thus this library.
[*] There's another strategy that can help to allocate the exactly right
amount of space: Delay the allocation of sectors to a file for a
while -- meanwhile the application may have finished writing and the
filesystem then knows how large the file is and can take the optimal
decision. This is implemented in ext4, currently.
What does the library provide?
------------------------------
This library does provide three things:
(1) The syscall wrapper fallocate() for systems whose glibc does
not have it (yet).
int fallocate64(int fd, int mode, __off64_t start, __off64_t len);
int fallocate(int fd, int mode, off_t start, off_t len);
The routine tells the filesystem to preallocate len bytes starting
at offset start in the file with file descriptor fd. The mode
flag should be zero (FALLOC_FL_ADJUST_SIZE) or FALLOC_FL_KEEP_SIZE.
If it's zero, the file size will be extended if needed (and like
a sparse file, reading those bytes behind the prior end of the file
will be zeroed out). Otherwise, the filesize won't be affected.
If the preallocation fails, the function will return -1 and
errno will be set. Typical errno values are EPERM, EINVAL,
EOPNOTSUPP (==ENOTSUP), ENOSYS, EFBIG, EBADF, EROFS, ESPIPE,
and ENOSPC.
The syscall wrapper is implemented as a weak symbol -- if glibc
provides the symbol, glibc will take precedence.
(2) int linux_fallocate64(int fd, int mode, __off64_t start, __off64_t len);
int linux_fallocate(int fd, int mode, off_t start, off_t len);
This is a slightly more intelligent interface; if fallocate() fails,
the ioctl interface will be tried for XFS and OCFS2 filesystems.
Depending on mode, the file will be extended like fallocate() would
have done. The ioctl also supports FALLOC_FL_UNRESV to remove a
reservation.
Error return values are the same as for fallocate().
(The functions translates ENOTTY to EOPNOTSUPP.)
(3) int fallocate64_with_fallback(int fd, int mode, __off64_t start,
__off64_t len, char fb_posix, char fb_sparse)
int fallocate_with_fallback(int fd, int mode, off_t start,
off_t len, char fb_posix, char fb_sparse)
This interface acts like linux_fallocate64(); however, there's
two fallback flags that control what happens if neither fallocate()
nor the ioctl succceeded (nor did a fatal error such as EPERM,
EBADF or EROFS happen).
If fb_posix is set, the function will next do posix_fallocate(),
which will end up writing lots of zeros to the file. Note that
this won't happen for FALLOC_FL_KEEP_SIZE. If posix_fallocate()
is not implemented in glibc, this library will provide it's own
zero writing routine.
If fb_sparse is set, the function will next try a sparse write at the
end of the desired range, hoping that the filesystem takes it as a
hint for the final size of the file. (With FALLOC_FL_KEEP_SIZE, it
will write 0 bytes, thus not extending the file, otherwise it will
write one 0 byte -- but only if the file was previously smaller,
so no data will get overwritten.)
If the fallback flags are set, the errno values EOPNOTSUPP and
ENOSYS should not be observed, unless you request a mode != 0.
The versions without the 64 are just macros to call the functions with
64bit offsets. So don't assume any argument will be truncated to 32bit,
except for fallocate on 32bit systems with _FILE_OFFSET_BITS == 32
(for compatibility with glibc).
Recommended use by applications
-------------------------------
To use this library, just
#include <fallocate.h>
in your application and link with
-lfallocate
You can do this irrespectively of your glibc version. (TODO: Check with
glibc-2.10.)
For applications that write files in several chunks, but have at least
an idea how large the files end up being, should tell the filesystem
about it. As soon as possible, ideally directly after creating a file with
int fd = open(FILENAME, O_RDWR | O_CREAT | O_TRUNC, perm);
the app should tell the kernel about the estimated size by calling
linux_fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, size);
If the function returns an error, it can be safely ignored.
The file size won't be changed by this call, whether or not it succeeds.
If the app has been written to use posix_fallocate() and thus expects
that the file size will be changed to at least have the preallocated
size now, the app can call
linux_fallocate(fd, 0, 0, size);
instead. Note that this can fail in which case the file size won't be
affected. If the app does not want to deal with it, it can call
fallocate_with_fallback(fd, 0, 0, size, 0, 1);
In some cases, the overhead of writing zeros out can be justified; only
testing can tell. In that case, the _with_fallback() interface can be
used and called with fb_posix == 1.
The author of this code does not know if there's use cases where the
implementation of a filesystem and the use pattern make it beneficial
to use the sparse extending of files. However, the app may expect that
the reservation really is reflected in the file size, and fb_sparse == 1
will make it very unlikely that this would fail.