Menu

Usage Log in to Edit

Usage

Crawlzilla has 2 mode management:
1. Dialog management: It offers low-level nodes management, such as: (1)check cluster status (2)datanode&tasktracker node management (3)datanode&taskjacker management (4)tomcat management (5)change tomcat port.
2. Web interface management: It offers (1)crawl setup (2)search engine management (3)index pool management.

Crawlzilla usage procedure:
crawlzilla_usage


1. First Usage

1.1 Executes crawlzilla to update hosts information

$ /home/crawler/crawlzilla/system/crawlzilla

1.2 Check cluster status

cluster_status

1.3 Setup cluster

Enables all nodes to run datanode & taskjacker.

cluster_setup

choose_node_part

1.4 Check all nodes datanode & taskjacker status

cluster_status


2. Crawlzilla Web Usage

2.1 Go to http://{master_IP}:8080/crawlzilla/

When you first login web interface, it need to change administrator password.

2.2 Crawl setup

crawlzilla_usage

Go to the "crawl page" and input 3 parameters:

1. Index Pool name: To identify this search engine and index pool

2. Crawl URLs: input which URLs you want to crawl (ex. https://sourceforge.net/p/crawlzilla/wiki)

3. Crawl depth: choose depth for these URLs

2.3 Check crawl status

crawlzilla_usage

2.4 Use search engine

2.5 Embed search engine to other page


Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.