Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
sample-domains.txt | 2011-06-20 | 17 Bytes | |
README.txt | 2011-06-20 | 4.1 kB | |
sample.conf | 2011-06-20 | 33.5 kB | |
Totals: 3 Items | 37.7 kB | 0 |
///////////////-------------------------------------------------\\\\\\\\\\\\\\\\ <<<<<<<<<<<<<< WIRE-Nic >>>>>>>>>>>>>>> \\\\\\\\\\\\\\\-------------------------------------------------//////////////// WIRE-Nic is a fork of the open souce project WIRE (Web Information Retrieval Environment) that was developed by Center for Web Research from University of Chile. More information about it can be found at http://www.cwr.cl/projects/WIRE/ This fork brings some bug fixes and modifications to the original system. <<<<<<<<<<<<<< Modifications >>>>>>>>>>>>>>> The main modifications that were made are: 1 - The way it handles redirects. In the original project all redirects were handle like any other link. In this fork, the redirected pages are downloaded even if they doesn't belong to the domains list configureted for the execution. 2 - The way the pages are stored. In the original project parsed pages were stored in big files indexed by de sofftware. In this fork the pages are also store directly on the file system in a hierarchical folder that represents the structure of the sites. The folder is called "sites" and it's placed in the execution workspace folder. This modification aded the variable "config/collection/sites-per-folder" to the configuration file and, it indicates the maximum number of sites folder that can be put in the same folder of the hierachical structure. 3 - The way the list of domains to download are read. In the original project the list of accepted domains were placed directly in the variable “seeder/accept/domain-suffixes” in the configuration file. In this fork, the list is placed in an external file which the path is configured in this same variable. 4 - Added support to HTTP 1.1 chunked transfer encoding. 5 - In this fork the char encondig placed in the html file is loaded 6 - In this fork all links are normalized following the RFC 3986 rules before beeing registered. 7 - The exported files were modified. In this fork the name of the page and site are included as a field of the csv files exported. 8 - The way the www prefix is handled. In the original project if a site URL with and without the www prefix were found only the first form that was found were registered. In this fork, the URL is tested before being resgistered and if one of the forms isn't accessible, the other form is included. <<<<<<<<<<<<<< Instalation Guide >>>>>>>>>>>>>>> This instalation guide is based on the original documentation of the WIRE project, which can be found at: http://www.cwr.cl/projects/WIRE/doc/ However, their configurantion file must be substituted by the one given by this project. below there is a simple instalation guide that were tested in a Ubuntu server 10.04 LTS machine: ---> REQUIREMENTS <--- gcc: $ apt-get install gcc adns: faça o download da última versão na página: http://www.chiark.greenend.org.uk/~ian/adns/ instalação: $ tar -xzf adns.tar.gz $ cd ands-1.2 $ ./configure $ make $ make install xml2: $ apt-get install xml2 swish-e: $ apt-get install swish-e LaTeX: $ apt-get install texlive-latex-base texlive-latex-extra gnuplot: $ apt-get install gnuplot Perl5: $ apt-get install perl instalar o XML::LibXML perl module: $ cpan se perguntar sobre configuração, apenas confime a configuração automática no terminal do cpan, > install XML::LibXML > quit djbdns: $ apt-get install djbdns docbook-xsl: $ apt-get install docbook-xsl libxml2-devel $ apt-get install libxml2-dev g++: $ apt-get install g++ xmlto: $ apt-get install xmlto gawk: $ apt-get install gawk curl: $ apt-get install libcurl4-openssl-dev curl libcurl4-gnutls-dev ---> WIRE <--- $ tar -xzf WIRE-Nic-1.x.tar.gz $ cd WIRE-Nic $ ./configure $ make $ make install comandos para aumentar o limite de threads do WIRE: $ ulimit -n 32000 $ echo "* soft nofile 32000" >> /etc/security/limits.conf $ echo "* hard nofile 32000" >> /etc/security/limits.conf