From: raffaele m. <raf...@at...> - 2011-10-26 13:47:04
|
i learned yesterday the existence of wget-warc (great job of @archiveteam) http://www.archiveteam.org/index.php?title=Wget_with_WARC_output https://github.com/alard/wget-warc/ the git trunk doesn't compile for me, but this works fine https://github.com/downloads/alard/wget-warc/wget-warc-20111017.tar.bz2 someone here ever tested it? i'm doing a small crawl right now, i'll test into wayback soon USAGE: $ /opt/wget-warc/bin/wget --help | grep warc --warc-file=FILENAME save request/response data to a .warc.gz file. --warc-header=STRING insert STRING into the warcinfo record. --warc-max-size=NUMBER set maximum size of WARC files to NUMBER. --warc-cdx write CDX index files. --warc-dedup=FILENAME do not store records listed in this CDX file. --no-warc-compression do not compress WARC files with GZIP. --no-warc-digests do not calculate SHA1 digests. --no-warc-keep-log do not store the log file in a WARC record. --warc-tempdir=DIRECTORY location for temporary files created by the greets -- raf...@at... |