1. Summary
  2. Files
  3. Support
  4. Report Spam
  5. Create account
  6. Log in

What is WebSource ?

WebSource is a perl module abstracting application access to data available on the Web and accessible via a set of dynamically managed web pages. This means posting a query via an HTTP server, fetching the set of pages and extracting the relevant pieces of information.

It defines a set of generic operators which when correctly parameterized and combined together allow to define a specific web extraction task. The task can be described in an XML language called WetDL (for Web Extraction Task Description Language).

Given a WetDL description (a wsd file), the perl script ws-query by using the WebSource library allows to execute tasks such as querying Google, extracting country information from the CIA World Fact Book, etc.

How does it work ?

There are different elements of an extraction process which are specific to the task :

  • the parameters of the query and the access method to use
  • the part of a result page pointing to the next result page
  • the parts of the page containing the effective results
  • the structure of the results

WebSource works by taking a description of these elements and does the rest. It takes as input such a description and a query to the source (with the source specific parameters), sends the query to the HTTP server, retrieves the first result page, extracts the results and the link to the next result pages, fetches that other page, extracts the new results and the new link,... until it cannot find another link to a next page of results. Where can I get it ?

It is available either via cvs or from the project page on sourceforge.

New releases are made available from the project page : http://sourceforge.net/projects/WebSource/

Different parts of WebSource are also available on via svn on sorceforge. To checkout a given module <module> just type the following :

svn co https://!WebSource.svn.sourceforge.net/svnroot/WebSource/trunk/<module> !WebSource

The available modules are the following :

  • main : contains the WebSource perl modules and scripts
  • www : the WebSource.sourceforge.net web site pages (ie. this site)
  • desc : a set of example description files

How do I install the module ?

Get and decompress the archive, do a usal

perl Makefile.PL
make
make install

You can add a make test between the last two lines but it only does a 'use' test for the moment. How do I use a Web Source description ?

First you need to make such a description accessible from your system. The -s option of ws-query can take either a local filename or a URL.

Then you need to see which parameters the description takes (take a look at the <ws:options> element in the wsd file). Then you can execute the task with your parameters with ws-query. Where can I find wsd files ?

Some example wsd files are available from the WebSource cvs server. (See the Where do I get WebSource section). How do I make a perl script making use of the library ?

The easiest way to get started is by typing the following example perl script using the module allowing to query Google (by using the google.wsd file) :

#!/usr/bin/perl
use !WebSource;
use strict;
use warnings;

# Give a WebSource description URI
my $wsd = "/path/to/example.wsd";

# Build a !WebSource instance
my $ws = !WebSource->new(wsd => $wsd);

# Setup your query
my $query = join("+",@ARGV);

$ws->set_query(q => $query);
# Note : the format of the query is source dependant (for Google the
# only element we want to be user-defined is the actual google query "q"
# element )

# Retreive the results and print them out
while(my $res = $ws->next_result()) {
   print $res , "\n";
}

# Note : all of the results could have been fetched into an array by
# doing
#    @res = $ws->query(q => $query)
# but this meant waiting for all the result pages being downloaded and
# all the ressults extracted
# set_query and next_result use a lazy method so a page is downloaded
# and its results extracted only when needed
      

Where do I get more information ?

For more information you can consult the man pages of the project. You can start from the WebSource man page. The each type of operator is implemented as perl module which each have their man page. For example, the fetch operator is implemented by the WebSource::Fetcher module whose man page can be accessed via man WebSource::Fetcher. For continuous information on the evolution of the project a sourceforge mailing list named WebSource-info is available (see the WebSource amiling list page : http://sourceforge.net/mail/?group_id=126716 ).

Who initiated the project ?

The project was initiated by Benjamin Habegger during his Phd in Computer Science at the University of Nantes in the LINA CS Lab. More information about Benjamin can be found at http://www.habegger.fr/.

How can I contribute ?

You can contribute to this small project by suggesting ideas, code, and new description files by writing me an email. Contact Information

You may contact me at the following email address : habegger@…