Menu

Tree [6bd1a9] master /
 History

HTTPS access


File Date Author Commit
 src 2015-03-11 habernal habernal [6bd1a9] gz -> bz2 (better compression)
 LICENSE.txt 2015-01-12 habernal habernal [fa22e1] Info, small changes, licence, shade jar
 README.txt 2015-01-12 habernal habernal [0c0659] Max file name lenght to 255
 pom.xml 2015-03-10 habernal habernal [1edce4] Exporting Nutch segment into gzipped WARC

Read Me

Nutch Content Exporter - simple command line java program for exporting HTML pages crawled by
Apache Nutch to the file system.

Copyright (c) 2015 Ivan Habernal

Usage:
$mvn package
$java -jar target/nutchcontentexporter-1.0-SNAPSHOT.jar segment-dir output-dir

for example

$java -jar target/nutchcontentexporter-1.0-SNAPSHOT.jar /tmp/crawl/20150109134429/ /tmp/outhtml

where the input folder is the Nutch segment
$ tree /tmp/crawl/
/tmp/crawl/
└── 20150109134429
    ├── content
    │   └── part-00000
    │       ├── data
    │       └── index
    ├── crawl_fetch
    │   └── part-00000
    │       ├── data
    │       └── index
    ├── crawl_generate
    │   └── part-00000
    ├── crawl_parse
    │   └── part-00000
    ├── parse_data
    │   └── part-00000
    │       ├── data
    │       └── index
    └── parse_text
        └── part-00000
            ├── data
            └── index


The output files are stored under the original URL with all slashes ("/") replaced by three
underlines ("___"), e.g.

http://www.example.com/test.html -> www.example.com___test.html
http://www.example.com/test/ -> www.example.com___test___
MongoDB Logo MongoDB