apt-got builds and supervises a partial (or full) Debian mirror, that is filled on-the-fly by apt-get requests. But there's more! Its modular mirror engine is ready for customized mirroring algos. So you can easily make your own module! (Eg for apt4rpm)
WebSPHINX is a web crawler (robot, spider) Java class library, originally developed by Robert Miller of Carnegie Mellon University. Multithreaded, tollerant HTML parsing, URL filtering and page classification, pattern matching, mirroring, and more.
JoBo is a web site mirroring tool. It has a graphical UI but there is a also command line version. Supports robot exclusion protocol (but this can be disabled)