Home

Authors:

Introduction

Strigil project was started on Faculty of Mathematic and Physics on Charles University in Prague with a purpose to create an easily extendable and usable web scraping tool, that enables one to retrieve a data from textual or weak-structured documents, e.g. HTML, spreadsheet documents, etc.

Additionally, we propose a scraping language inspired by the XSL transformations designed to extract data from different kinds of documents. This scraping language is designed to work with an ontology to map scraped data directly to classes and attributes.

The output of the scraper are RDF triple which can be then inserted to a triple store. The tool is based on JAVA, thus it is easily portable to all widely used platforms.

Introduction
Table of Contents
Team
Download
Documentation
Framework Architecture
Publications

Team

Project leader is Mgr. Jakub Stárka, Ph.D. Team members:

Peter Zvirinsky
Tomáš Pošepný
Rastislav Kadleček
Jonáš Klimeš
Martin Major

Download

You can download a complete Strigil package or get the most recent sources from svn.

Documentation

Framework Architecture

Main Requirements

When we started the project we wanted to create a specialized web crawler with RDF output which can be connected to a database for storing all the downloaded data. The main requirements on this system were:

open
pluggable and easily extendable
cross-platform
distributed
scalable
Strigil Web Crawler

The whole system is divided into two logical parts Data Application (DA) and Download System (DS). The Data Application is responsible for the data processing, creating RDF output and contains all major data structures (UST, Frontier). The Download System on the other hand is responsible for the whole downloading process which means to distribute the downloads equally over the source servers with respect to the politeness rules.

The DA produces prioritized download requests which are sent to the DS. The DS then tries to complete all the requests with as much respect to the priorities as possible. After successful download the files are sent to the DA to be processed.

Data Application

The Data Application can be again spitted into two major parts. One is the data processing part as mentioned above and the second is the control part.

The control part is created by a Web Application, which provides a simple and easy to use control interface. The whole system can be managed via web browser. This the place where scraping scripts are created/uploaded and managed. The Web Application is connected to resources such as Onthology Repository which are necessary for creating the scraping scripts.

All the data processing and the usual crawler data structures (UST, Frontier) are included in the Scraper Engine component. Each running script is represented by a Scraping Activity which is a process that is managed by the Scraper Application. The Scraper Application is responsible for starting each scraping script, managing its run or putting it to sleep when it is done or it is waiting for the data provided by the Download System. All the download requests created by a Scraping Activity are passed to the URQ Controller where they are prioritized and afterwards sent to the Download System to be downloaded.

Download System

The DS contains 4 major components, these are: Download Manager, Downloader, DNS Resolver, Proxy. Every component can run on a different machine and can be replicated to increase download performance.

Download Manager

The Download Manager is the "brain" of the whole Download System. This is the entry point where the prioritized download requests from the Data Application arrive. All these requests are then stored in the Download Requests Buffer. These requests are then processed by the Scheduler, which is responsible for abiding all the politeness rules and download source restrictions. It is continuously distributing the download requests over the available Downloaders. This is done in such a way that every download requests is passed to one particular Downloader and gets assigned one Proxy trough which the download is redirected. After successful download the Data Application is receives a notification about which files were downloaded and were can they be found.

Of course there is always a possibility that a download request can not be executed for some time. This can be e.g. because the number of the downloads would exceed the allowed number for a given source and would end up in blocking the Proxy IP address. These requests are returned to the Data Application which must re-prioritized them with consideration of the time for which they can not be downloaded.

Downloader

Each download request from scheduler is sent to one particular Downloader. These requests are stored in Source Queues component and are waiting to be executed. Every download request has one Proxy assigned trough which the download will be redirected.

For every downloader it is necessary to maintain opened connections with all accessible Proxy-s, this is done by the Connection Controller component. It receives a request for connection to a particular Proxy which is then provided to the Downloader.

Proxy

Proxy servers are used to for a couple of reasons:

for security reasons we want to hide all the other parts of the system from the internet
provide higher throughput for one download source
we can easily redirect downloads between Proxy-s if one gets blocked
DNS Resolver

The last component is the DNS Resolver. It is a very simple component for caching DNS translations and minimize the amount of DNS requests. Before every download requests is sent to a Downloader, the domain name of the address of the downloaded file is translated to an IP address.

Deployment

The deployment model shows again that every component which was mentioned can run on a different machine and can be replicated. This way a high level of scalability can be provided. The throughput for one source can be increased by adding new Proxy servers to the system. The download performance can be increased by adding more Downloaders to the system. And when all of this is not enough even the number of Download Managers can be increased.

Publications

Stárka, J., Holubová, I., Nečaský, M.: Strigil: A Framework for Data Extraction in Semi-Structured Web Documents, 15th International Conference on Information Integration and Web-based Applications & Services, Vienna, Austria, 2013.

Strigil Home

Distributed web scraping tool