Asynchronous scrapping: What do you think?

Status: Beta

Brought to you by: rbala, vnikic

Asynchronous scrapping: What do you think?

Forum: Open Discussion

Creator: Alex Wajda

Created: 2010-12-30

Updated: 2012-09-04

Alex Wajda - 2010-12-30

There are two common ways to crawl paginated web sites: either by navigating
them page by page using 'Next button' or calculating the total page count and
requesting every page in a loop modifying the URL each time by increasing the
page number in it going from 1 to the end. The later approach is not only more
robust it also technically allows you to crawl the site asynchronously
using several threads in parallel. Unfortunately WH does not support
concurrency.

So, here's an idea - we can create a dedicated WH processor (e.g. <asynch- exec=""> or something) which would execute the enclosed commands in a separate
thread. Surely we would have to have a purely immutable scrapper object model
and to clone the dynamic context before running a new thread. All of this
require some work, but it seems pretty doable to me.

Just want to hear your opinions if this kind of processor would be useful
(well, to me it is for sure) and what should I keep in mind while implementing
it. Any heads up?

Suggested example of usage:

<loop item="page_num"> <list>...</list> <body> <asynch-exec thread-pool-size="10"> <file action="write" path="${page_num}.xml"> <html-to-xml> <http url="[url]http://some.site.com/some-cool-stuff/page=$[/url]{page_num}"/> </html-to-xml> </file> </asynch-exec/> </body> </loop>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

newbee - 2010-12-30

I think it is a good idea, but not sure if it worth it... is there an
efficiency constraint that requires modification to WH model? I personally
have written close to a 100 WH scripts and some of them dynamically request
some other pages embedded in the initial page. Generally, if one needs to
request multiple sub-links from the initial page, the time is not an issue and
the "loop" method should work just fine.

To accomplish what you have in mind will require significant changes to WH as
it is (or it was) not thread safe. Not sure if you want to undertake it, with
no clear benefit.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alex Wajda - 2011-01-02

the time is not an issue

That's my point. Sometimes the time is an issue. Let's say I scrap an e-shop
which has around 50 thousands items and they claim their catalogue is updated
every day. Crawling this e-shop in a single thread mode takes from 3 days to
one week (depending on the site workload) which means I miss in average 3-5
updates per one item. Being able to crawl it in 3-5 threads may potentially
cut the overall time by 2-3 times which is (almost) what I need.

Don't you think it will work?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Taras Emelyanenko - 2011-05-08

I think a good variant will be to fork execution of function in new thread.
Something like <call name="function_name" thread="yes">

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.