Menu

Usage Log in to Edit

Usage

Crawlzilla has 2 mode management:
1. Dialog management: It offers low-level nodes management, such as: (1)check cluster status (2)datanode&tasktracker node management (3)datanode&taskjacker management (4)tomcat management (5)change tomcat port.
2. Web interface management: It offers (1)crawl setup (2)search engine management (3)index pool management.

Crawlzilla usage procedure:
crawlzilla_usage


1. First Usage

1.1 Executes crawlzilla to update hosts information

$ /home/crawler/crawlzilla/system/crawlzilla

1.2 Check cluster status

cluster_status

1.3 Setup cluster

Enables all nodes to run datanode & taskjacker.

cluster_setup

choose_node_part

1.4 Check all nodes datanode & taskjacker status

cluster_status


2. Crawlzilla Web Usage

2.1 Go to http://{master_IP}:8080/crawlzilla/

When you first login web interface, it need to change administrator password.

2.2 Crawl setup

crawlzilla_usage

Go to the "crawl page" and input 3 parameters:

1. Index Pool name: To identify this search engine and index pool

2. Crawl URLs: input which URLs you want to crawl (ex. https://sourceforge.net/p/crawlzilla/wiki)

3. Crawl depth: choose depth for these URLs

2.3 Check crawl status

crawlzilla_usage

2.4 Use search engine

2.5 Embed search engine to other page