nutch-content-exporter Code

Exporting crawled content from Nutch

Brought to you by: habernal

Tree [6bd1a9] master / History

HTTPS access

File	Date	Author	Commit
src	2015-03-11	habernal	[6bd1a9] gz -> bz2 (better compression)
LICENSE.txt	2015-01-12	habernal	[fa22e1] Info, small changes, licence, shade jar
README.txt	2015-01-12	habernal	[0c0659] Max file name lenght to 255
pom.xml	2015-03-10	habernal	[1edce4] Exporting Nutch segment into gzipped WARC

Read Me

Nutch Content Exporter - simple command line java program for exporting HTML pages crawled by
Apache Nutch to the file system.

Copyright (c) 2015 Ivan Habernal

Usage:
$mvn package
$java -jar target/nutchcontentexporter-1.0-SNAPSHOT.jar segment-dir output-dir

for example

$java -jar target/nutchcontentexporter-1.0-SNAPSHOT.jar /tmp/crawl/20150109134429/ /tmp/outhtml

where the input folder is the Nutch segment
$ tree /tmp/crawl/
/tmp/crawl/
└── 20150109134429
    ├── content
    │   └── part-00000
    │       ├── data
    │       └── index
    ├── crawl_fetch
    │   └── part-00000
    │       ├── data
    │       └── index
    ├── crawl_generate
    │   └── part-00000
    ├── crawl_parse
    │   └── part-00000
    ├── parse_data
    │   └── part-00000
    │       ├── data
    │       └── index
    └── parse_text
        └── part-00000
            ├── data
            └── index


The output files are stored under the original URL with all slashes ("/") replaced by three
underlines ("___"), e.g.

http://www.example.com/test.html -> www.example.com___test.html
http://www.example.com/test/ -> www.example.com___test___

nutch-content-exporter Code

Exporting crawled content from Nutch

Branches

Tree [6bd1a9] master / Download Snapshot History

Read Me

Tree [6bd1a9] master /

History