crawlzilla Wiki

Brought to you by: goldjay1231, jazzwang, shunfa, waue0920

Usage Log in to Edit

Crawlzilla has 2 mode management:

Dialog management: It offers low-level nodes management, such as: (1)check cluster status (2)datanode&tasktracker node management (3)datanode&taskjacker management (4)tomcat management (5)change tomcat port.
Web interface management: It offers (1)crawl setup (2)search engine management (3)index pool management.

Crawlzilla usage procedure:
crawlzilla_usage

$ /home/crawler/crawlzilla/system/crawlzilla

cluster_status

Enables all nodes to run datanode & taskjacker.

cluster_setup

choose_node_part

cluster_status

When you first login web interface, it need to change administrator password.

crawlzilla_usage

Go to the "crawl page" and input 3 parameters:

Index Pool name: To identify this search engine and index pool
Crawl URLs: input which URLs you want to crawl (ex. https://sourceforge.net/p/crawlzilla/wiki)
Crawl depth: choose depth for these URLs

crawlzilla_usage