Download Latest Version WIRE-Nic-1.0.tar.gz (1.5 MB)
Email in envelope

Get an email when there's a new version of WIRE-Nic

Home / docs
Name Modified Size InfoDownloads / Week
Parent folder
sample-domains.txt 2011-06-20 17 Bytes
README.txt 2011-06-20 4.1 kB
sample.conf 2011-06-20 33.5 kB
Totals: 3 Items   37.7 kB 0
///////////////-------------------------------------------------\\\\\\\\\\\\\\\\
<<<<<<<<<<<<<<	WIRE-Nic                                         >>>>>>>>>>>>>>>
\\\\\\\\\\\\\\\-------------------------------------------------////////////////

WIRE-Nic is a fork of the open souce project WIRE (Web Information Retrieval
Environment) that was developed by Center for Web Research from University of
Chile. More information about it can be found at 
http://www.cwr.cl/projects/WIRE/

This fork brings some bug fixes and modifications to the original system.


<<<<<<<<<<<<<<	Modifications                                    >>>>>>>>>>>>>>>

The main modifications that were made are:

	1 - The way it handles redirects. In the original project all redirects
	were handle like any other link. In this fork, the redirected pages are
	downloaded even if they doesn't belong to the domains list configureted
	for the execution.

	2 - The way the pages are stored. In the original project parsed pages
	were stored in big files indexed by de sofftware. In this fork the pages
	are also store directly on the file system in a hierarchical folder that
	represents the structure of the sites. The folder is called "sites" and
	it's placed in the execution workspace folder.
	This modification aded the variable "config/collection/sites-per-folder"
	to the configuration file and, it indicates the maximum number of sites
	folder that can be put in the same folder of the hierachical structure.

	3 - The way the list of domains to download are read. In the original
	project the list of accepted domains were placed directly in the 
	variable “seeder/accept/domain-suffixes” in the configuration file. In
	this fork, the list is placed in an external file which the path is 
	configured in this same variable.
	
	4 - Added support to HTTP 1.1 chunked transfer encoding.

	5 - In this fork the char encondig placed in the html file is loaded

	6 - In this fork all links are normalized following the RFC 3986 rules
	before beeing registered.

	7 - The exported files were modified. In this fork the name of the page
	and site are included as a field of the csv files exported.

	8 - The way the www prefix is handled. In the original project if a site
	URL with and without the www prefix were found only the first form that 
	was found were registered. In this fork, the URL is tested before being
	resgistered and if one of the forms isn't accessible, the other form is 
	included.

<<<<<<<<<<<<<<	Instalation Guide                                >>>>>>>>>>>>>>>

This instalation guide is based on the original documentation of the WIRE 
project, which can be found at: http://www.cwr.cl/projects/WIRE/doc/
However, their configurantion file must be substituted by the one given by this
project.

below there is a simple instalation guide that were tested in a 
Ubuntu server 10.04 LTS machine:


---> REQUIREMENTS <---
gcc: 
	$ apt-get install gcc

adns:
	faça o download da última versão na página:
	http://www.chiark.greenend.org.uk/~ian/adns/

	instalação:
	$ tar -xzf adns.tar.gz
	$ cd ands-1.2
	$ ./configure
	$ make
	$ make install
	
xml2:
	$ apt-get install xml2

swish-e:
	$ apt-get install swish-e

LaTeX:
	$ apt-get install texlive-latex-base texlive-latex-extra

gnuplot:
	$ apt-get install gnuplot

Perl5:
	$ apt-get install perl
	instalar o XML::LibXML perl module:
	$ cpan
	se perguntar sobre configuração, apenas confime a configuração automática
	no terminal do cpan,
	> install XML::LibXML
	> quit

djbdns:
	$ apt-get install djbdns

docbook-xsl:
	$ apt-get install docbook-xsl

libxml2-devel
	$ apt-get install libxml2-dev

g++:
	$ apt-get install g++

xmlto:
	$ apt-get install xmlto

gawk:
	$ apt-get install gawk
curl:
	$ apt-get install libcurl4-openssl-dev curl libcurl4-gnutls-dev


---> WIRE <---

	$ tar -xzf WIRE-Nic-1.x.tar.gz
	$ cd WIRE-Nic
	$ ./configure
	$ make
	$ make install

	comandos para aumentar o limite de threads do WIRE:
	$ ulimit -n 32000
	$ echo "* soft nofile 32000" >> /etc/security/limits.conf
        $ echo "* hard nofile 32000" >> /etc/security/limits.conf


Source: README.txt, updated 2011-06-20