| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| sample-domains.txt | 2011-06-20 | 17 Bytes | |
| README.txt | 2011-06-20 | 4.1 kB | |
| sample.conf | 2011-06-20 | 33.5 kB | |
| Totals: 3 Items | 37.7 kB | 0 | |
///////////////-------------------------------------------------\\\\\\\\\\\\\\\\
<<<<<<<<<<<<<< WIRE-Nic >>>>>>>>>>>>>>>
\\\\\\\\\\\\\\\-------------------------------------------------////////////////
WIRE-Nic is a fork of the open souce project WIRE (Web Information Retrieval
Environment) that was developed by Center for Web Research from University of
Chile. More information about it can be found at
http://www.cwr.cl/projects/WIRE/
This fork brings some bug fixes and modifications to the original system.
<<<<<<<<<<<<<< Modifications >>>>>>>>>>>>>>>
The main modifications that were made are:
1 - The way it handles redirects. In the original project all redirects
were handle like any other link. In this fork, the redirected pages are
downloaded even if they doesn't belong to the domains list configureted
for the execution.
2 - The way the pages are stored. In the original project parsed pages
were stored in big files indexed by de sofftware. In this fork the pages
are also store directly on the file system in a hierarchical folder that
represents the structure of the sites. The folder is called "sites" and
it's placed in the execution workspace folder.
This modification aded the variable "config/collection/sites-per-folder"
to the configuration file and, it indicates the maximum number of sites
folder that can be put in the same folder of the hierachical structure.
3 - The way the list of domains to download are read. In the original
project the list of accepted domains were placed directly in the
variable “seeder/accept/domain-suffixes” in the configuration file. In
this fork, the list is placed in an external file which the path is
configured in this same variable.
4 - Added support to HTTP 1.1 chunked transfer encoding.
5 - In this fork the char encondig placed in the html file is loaded
6 - In this fork all links are normalized following the RFC 3986 rules
before beeing registered.
7 - The exported files were modified. In this fork the name of the page
and site are included as a field of the csv files exported.
8 - The way the www prefix is handled. In the original project if a site
URL with and without the www prefix were found only the first form that
was found were registered. In this fork, the URL is tested before being
resgistered and if one of the forms isn't accessible, the other form is
included.
<<<<<<<<<<<<<< Instalation Guide >>>>>>>>>>>>>>>
This instalation guide is based on the original documentation of the WIRE
project, which can be found at: http://www.cwr.cl/projects/WIRE/doc/
However, their configurantion file must be substituted by the one given by this
project.
below there is a simple instalation guide that were tested in a
Ubuntu server 10.04 LTS machine:
---> REQUIREMENTS <---
gcc:
$ apt-get install gcc
adns:
faça o download da última versão na página:
http://www.chiark.greenend.org.uk/~ian/adns/
instalação:
$ tar -xzf adns.tar.gz
$ cd ands-1.2
$ ./configure
$ make
$ make install
xml2:
$ apt-get install xml2
swish-e:
$ apt-get install swish-e
LaTeX:
$ apt-get install texlive-latex-base texlive-latex-extra
gnuplot:
$ apt-get install gnuplot
Perl5:
$ apt-get install perl
instalar o XML::LibXML perl module:
$ cpan
se perguntar sobre configuração, apenas confime a configuração automática
no terminal do cpan,
> install XML::LibXML
> quit
djbdns:
$ apt-get install djbdns
docbook-xsl:
$ apt-get install docbook-xsl
libxml2-devel
$ apt-get install libxml2-dev
g++:
$ apt-get install g++
xmlto:
$ apt-get install xmlto
gawk:
$ apt-get install gawk
curl:
$ apt-get install libcurl4-openssl-dev curl libcurl4-gnutls-dev
---> WIRE <---
$ tar -xzf WIRE-Nic-1.x.tar.gz
$ cd WIRE-Nic
$ ./configure
$ make
$ make install
comandos para aumentar o limite de threads do WIRE:
$ ulimit -n 32000
$ echo "* soft nofile 32000" >> /etc/security/limits.conf
$ echo "* hard nofile 32000" >> /etc/security/limits.conf