UBUMIRROR - an 'intelligent' APT repository mirror
---------
Written by Jeff MacLoue <jeff@macloue.com>, released under terms of GPL.
==============
0. Quick Start
==============
!!! Please don't run this script as root, you may damage your system, your
files or your karma if you do.
To get a mirror of 32-bit Ubuntu Lucid Lynx which can be used for installation
and regular updates, put something like that in your cron (note the shell
expansion):
/usr/local/bin/ubumirror.pl -C lucid{,-updates,-security}/{main,restricted}
This will make a partial mirror of http://archive.ubuntu.com/ubuntu/ under
/var/www/htdocs/ubuntu/ so http://your-server/ubuntu/ can serve everything to
the network installer. /var/spool/ubumirror/ directory will be used for
temporary files in the process.
More advanced example is:
ubumirror.pl -u http://ua.archive.ubuntu.com/ubuntu/ -C {lucid,natty}{,-updates,-security}/{main,restricted}
Note the trailing slash in mirror URL, it is mandatory.
This will mirror both Lucid Lynx and Natty Narwhal from Ukraine's local Ubuntu
Archive.
============
1. The Task
============
The good thing about APT repository is a package pool. A single copy of a
package file is stored at every time - this reduces the distribution site size
and makes "continuous release" process easier.
The bad thing about APT repository is, well, a package pool. Mirroring an
archive with several Ubuntu versions and all the universe/multiverse stuff can
be a real pain if you are low (or, okay, greedy) on disk space or slow
Internet connection.
What's the point of doing an rsync of the entire archive if you have only
Lucid Lynx in your office? What's the point to mirror all the multiverse swamp
if you only need one or two packages there? So the task is rather simple: get
only files you absolutely need to have.
This sounds like an easy task - retrieve package index files for your
distribution and repositories required, parse them for references to package
pool, retrieve the packages. That's roughly all the ubumirror.pl script does,
plus removing the package files not referenced in the indices.
Of course there is little to no point to do this for one or two computers, it
is perfectly possible to live a good life without local repository mirror at
all. But, well, in a remote office with cheap ADSL and ten or fifteen Ubuntu
workstations distribution updates and new workstation installations are
complicated. A laptop with partial archive mirror is a real life-saver there.
===================
1a. Why ubumirror?
===================
There is a ready-made (and maybe even official) solution, apt-mirror
(http://apt-mirror.sourceforge.net). I tried it but it's not what I need - it
tends to invoke multiple wget instances by default (which, er, is rather nice
but I don't like it) and doesn't mirror .udeb files used for system
installation. So ubumirror.pl tries to be simplier and smarter at the same
time.
To sum up, ubumirror is simple and targeted specifically at Ubuntu while
apt-mirror is more advanced solution targeted at Debian.
=============
2. The Means
=============
The script uses wget for all the network operations. There is little use for
rsync as you need individual files and not the complete directories, and there
is no point in doing massive network operations in Perl.
Getopt::Std, File::Find, URI and IO::Zlib modules are used in operation. They
may or may not come with your distribution (they do with the Slackware 13.37 I
use, they are available in CentOS 5 as separate packages as well).
cp from GNU coreutils is used for recursive file copy.
It is in theory possible to port all this to a non-UNIX platform, patches are
welcome.
========================
3. Command-Line Options
========================
ubumirror.pl [-OPTIONS [-MORE_OPTIONS]] [--] REPOSITORY ...
--help and --version standard options are recognized.
The following single-character options are accepted:
-a <arch> Architecture (default i386)
-u <url> Base repository URL (default http://archive.ubuntu.com/ubuntu/)
-s <path> Spool directory to store work files (default /var/spool/ubumirror)
-d <path> Directory to store repository mirror (default /var/www/htdocs/ubuntu)
-w <cmd> How to invoke WGET (default /usr/bin/wget)
-C Don't do pool cleanup
-v Verbose output
-D Debug output
-- Stop processing for options
You need to specify at least one repository to proceed.
The repositories are specified as <distribution>/<repo>, e.g., lucid/main. As
mentioned above, it's best to use shell expansion capabilities to keep the
command line shorter.
=================
4. The Operation
=================
ubumirror starts with changing to the spool directory. Then it invokes wget to
retrieve package index and Release/Release.gpg files which APT uses.
For package indexes, Packages.gz is used as it is considerably smaller than
uncompressed Packages and still readable with standard IO::Zlib Perl module.
Then the retrieved Packages.gz are parsed to get all the Filename: and Size:
line pairs. A hash %Pool is built from this information.
Then the /var/www/htdocs/ubuntu/pool/ directory contents is compared with
%Pool - if the package isn't there or has different size it is marked for
retrieval.
Then wget is invoked again to retrieve them.
And finally after retrieving all the files the indices and other files from
spool directory are copied over /var/www/htdocs/ubuntu/dists/ - to get a
consistent mirror. Unlike apt-mirror, ubumirror is not very concerned with
keeping the mirror consistent at every time for sake of simplicity.
Optionally, if -C is specified on the command line, the /pool/ directory is
scanned again and everything not referenced in the package files is deleted to
save space. Please use with caution.
============================
5. BUGS, TODOs, Suggestions
============================
Probably over 9000. This is an initial public release, quality should be
considered alpha.
Bug reports and patches can be submitted at SourceForge or sent to me at
jeff@macloue.com.