Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

Tree [5d17b4] master fpart-0.9-1 /
History



File Date Author Commit
man 2013-06-17 User Martymac User Martymac [9d2b55] Handle man page installation
src 2013-08-08 Ganael Laplanche Ganael Laplanche [5cdde2] Introduce 2 CPP macros : if_not_malloc() and if...
COPYING 2013-01-24 User Martymac User Martymac [a3c56f] Update Copyright dates
Changelog 2013-11-14 Ganael Laplanche Ganael Laplanche [5d17b4] Prepare for 0.9-1
Makefile.am 2013-06-17 User Martymac User Martymac [e784b8] Move src/types.h from EXTRA_DIST to fpart_SOURCES
README 2013-10-10 Ganael Laplanche Ganael Laplanche [519a9a] Fpart has been successfully tested on Mac OS X
TODO 2013-09-09 Ganael Laplanche Ganael Laplanche [1443b8] Re-order + add ideas
configure.ac 2013-09-09 Ganael Laplanche Ganael Laplanche [97eba1] - Disable large file support on Linux when not ...

Read Me

Fpart - README

What is fpart ?
***************

Fpart is a tool that helps you sort file trees and pack them into bags (called
"partitions"). It is developped in C and available under the BSD license.

It splits a list of directories and file trees into a certain number of
partitions, trying to produce partitions with the same size and number of files.
It can also produce partitions with a given number of files or a limited size.

Once generated, partitions are either printed as file lists to stdout (default)
or to files. Those lists can then be used by third party programs.

Fpart also includes a live mode, which allows it to crawl very large filesystems
and produce partitions in live. Hooks are available to act on those partitions
(e.g. immediatly start a transfer using rsync(1)) without having to wait for
the filesystem traversal job to be finished. Used this way, fpart can be seen
as a powerful data migration tool.

Compatibility :
***************

Fpart has been successfully tested on :

* FreeBSD 7.x, 8.x, 9.x (i386, amd64)
* GNU/Linux (x86_64, arm)
* Solaris 9, 10 (Sparc, i386)
* NetBSD (amd64, alpha)
* Mac OS X (10.6, 10.8)

and will probably work on other operating systems too (*).

(*) fpart built as a static binary within a Debian (armel) chroot will give you
a powerful tool for backing-up your Android (arm) phone ;-)

Examples - common usage :
*************************

The following will produce 3 partitions, with (approximatively) the same size
and number of files. Three files : "var-parts.[0-2]", are generated as output :

$ fpart -n 3 -o var-parts /var

$ ls var-parts*
var-parts.0 var-parts.1 var-parts.2

$ head -n 2 var-parts.0
/var/some/file1
/var/some/file2

The following will produce partitions of 4.3 GB, containing music files ready
to be burnt to a DVD (for example). Files "music-parts.[0-n]", are generated
as output :

$ fpart -s 4617089843 -o music-parts /path/to/my/music

The following will produce partitions containing 10000 files each by examining
/usr first and then /home and display only partition 0 on stdout :

$ find /usr ! -type d | fpart -f 10000 -i - /home | grep '^0:'

The following will produce two partitions by re-using du(1) output. Fpart will
not examine the filesystem but instead re-use arbitrary values printed by du(1)
and sort them :

$ du * | fpart -n 2 -a

Examples - live mode :
**********************

By default, fpart will wait for FS crawling to terminate before generating and
displaying partitions. If you use the live mode (option -L), fpart will display
each partition as soon as it is complete. You can combine this option with
hooks; they will be triggered just before (pre-part hook, option -w) or after
(post-part hook, option -W) partitions' completion.

Hooks provide several environment variables (see fpart(1)); they are a
convenient way of getting information about fpart's and partition's current
states. For example, ${FPART_PARTFILENAME} will contain the name of the output
file of the partition that has just been generated; using that variable within a
post-part hook permits starting manipulating the files just after partition
generation.

See the following example :

$ mkdir foo && touch foo/{bar,baz}
$ fpart -L -f 1 -o /tmp/part.out -W \
    'echo == ${FPART_PARTFILENAME} == ; cat ${FPART_PARTFILENAME}' foo/
== /tmp/part.out.0 ==
foo/bar
== /tmp/part.out.1 ==
foo/baz
2 file(s) found.

This example crawls foo/ in live mode (option -L). For each file (option -f,
1 file per partition), it generates a partition into /tmp/part.out.<n>
(option -o; <n> is the partition index and will be automatically added by fpart)
and executes the following post-part hook (option -W) :

echo == ${FPART_PARTFILENAME} == ; cat ${FPART_PARTFILENAME}

This hook will display the name of current partition's output file name as well
as display its contents.

Examples - migrating data :
***************************

Here is a more complex example that will show you how to use fpart, GNU Parallel
and Rsync to split up a directory and immediately schedule data synchronization
of smaller lists of files, while FS crawling goes on. We will be synchronizing
data from /data/src to /data/dest.

First, go to the source directory (as rsync's --files-from option takes a file
list relative to its source directory) :

$ cd /data/src

Then, run fpart from here :

$ fpart -L -f 10000 -x '.snapshot' -x '.zfs' -Z -o /tmp/part.out -W \
  '/usr/local/bin/sem -j 3
    "/usr/local/bin/rsync -av --files-from=${FPART_PARTFILENAME}
      /data/src/ /data/dest/"' .

This command will start fpart in live mode (option -L), making it generate
partitions during FS crawling. Fpart will produce partitions containing at most
10000 files each (option -f), will skip files and folders named '.snapshot' or
'.zfs' (option -x) and will list empty and non-accessible directories (option
-Z; that option is necessary when working with rsync to make sure the whole file
tree will be re-created within the destination directory). Last but not least,
each partition will be written to /tmp/part.out.<n> (option -o) and used within
the post-part hook (option -W), run immediately by fpart once the partition is
complete :

/usr/local/bin/sem -j 3
    "/usr/local/bin/rsync -av --files-from=${FPART_PARTFILENAME} /data/src/ /data/dest/"

This hook is itself a nested command. It will run GNU Parallel's sem scheduler
(any other scheduler will do) to run at most 3 rsync in parallel.

The scheduler will finally trigger the following command :

/usr/local/bin/rsync -av --files-from=${FPART_PARTFILENAME} /data/src/ /data/dest/

where ${FPART_PARTFILENAME} will be part of rsync's environment when it runs
and contain the file name of the partition that has just been generated.

That's all, folks ! Pretty simple, isn't it ?

In this example, FS crawling and data transfer are run from the same -local-
machine, but you can use it as the basis of a much sophisticated solution : at
work, by using a cluster of machines connected to our filers through NFS and
running Open Grid Scheduler, we successully migrated over 400 TB of data.

Note : several fpart runs can be launched using the above example ; you will
perform incremental synchronisations. That is, deleted files from the source
directory will not be removed from destination unless rsync's --delete option is
used. Unfortunately, this option cannot be used with a list of files (files that
do not appear in the list are just ignored). To use the --delete option in
conjunction with fpart, you *have* to provide rsync's --files-from option a list
of directories (only). Fpart's -d option can do the trick but it will probaby
not generate partitions having the same size (if you choose to use that method
anyway, do not forget to use rsync's -r option too). Another - simpler - method
if you plan to perform several synchronisations is to use a single, last rsync
to delete extra files from destination using the --delete option. This simple
rsync pass will not transfer data but only delete files from destination
directory.

Limitations :
*************

Fpart will *NOT* modify data, it will *NOT* split your files !

As a consequence, if you have a directory containing several small files and a
huge one, it will be unable to produce partitions with the same size. Fpart does
magic, but not that much ;-)

If you provide several paths to fpart, it will examine all of them. If those
paths overlap or if the same path is specified more than once, same files will
appear more than once within generated partitions. This is not a bug, fpart does
not deduplicate FS crawling results.

Installing :
************

For FreeBSD users, fpart is already available in ports, see sysutils/fpart. You
can install fpart from there or from sources if you really need the latest
revision.

Installing from sources is simple. First, if there is no 'configure' script in
the main directory, run :

$ autoreconf -i

(autoreconf comes from the GNU autotools), then run :

$ ./configure
$ make

to configure and build fpart.

Finally, install fpart (as root) :

# make install

See also :
**********

See fpart(1) for more details.

The partition problem is detailed here :
http://en.wikipedia.org/wiki/Partition_problem

and here :
http://en.wikipedia.org/wiki/Bin_packing_problem

I am sure you will also be interested in :
https://github.com/jbd/packo

which was developped by Jean-Baptiste Denis as the original proof of concept.

Author / Licence :
******************

Fpart has been written by Ganaël LAPLANCHE <ganael.laplanche@martymac.org>
and is available under the BSD license (see COPYING for details).

Thanks to Jean-Baptiste Denis for having given me the idea of this program !

Contributions :
***************

FTS code comes from FreeBSD :
    lib/libc/gen/fts.c -> fts.c
    include/fts.h -> fts.h
and is available under the BSD license.