Menu

Asynchronous scrapping: What do you think?

Alex Wajda
2010-12-30
2012-09-04
  • Alex Wajda

    Alex Wajda - 2010-12-30

    There are two common ways to crawl paginated web sites: either by navigating
    them page by page using 'Next button' or calculating the total page count and
    requesting every page in a loop modifying the URL each time by increasing the
    page number in it going from 1 to the end. The later approach is not only more
    robust it also technically allows you to crawl the site asynchronously
    using several threads in parallel. Unfortunately WH does not support
    concurrency.

    So, here's an idea - we can create a dedicated WH processor (e.g. <asynch- exec=""> or something) which would execute the enclosed commands in a separate
    thread. Surely we would have to have a purely immutable scrapper object model
    and to clone the dynamic context before running a new thread. All of this
    require some work, but it seems pretty doable to me.

    Just want to hear your opinions if this kind of processor would be useful
    (well, to me it is for sure) and what should I keep in mind while implementing
    it. Any heads up?

    Suggested example of usage:

    <loop item="page_num">
        <list>...</list>
        <body>
    
            <asynch-exec thread-pool-size="10">
    
                <file action="write" path="${page_num}.xml">
                    <html-to-xml>
                        <http url="[url]http://some.site.com/some-cool-stuff/page=$[/url]{page_num}"/>
                    </html-to-xml>
                </file>
    
            </asynch-exec/>
    
        </body>
    </loop>
    
     
  • newbee

    newbee - 2010-12-30

    I think it is a good idea, but not sure if it worth it... is there an
    efficiency constraint that requires modification to WH model? I personally
    have written close to a 100 WH scripts and some of them dynamically request
    some other pages embedded in the initial page. Generally, if one needs to
    request multiple sub-links from the initial page, the time is not an issue and
    the "loop" method should work just fine.

    To accomplish what you have in mind will require significant changes to WH as
    it is (or it was) not thread safe. Not sure if you want to undertake it, with
    no clear benefit.

     
  • Alex Wajda

    Alex Wajda - 2011-01-02

    the time is not an issue

    That's my point. Sometimes the time is an issue. Let's say I scrap an e-shop
    which has around 50 thousands items and they claim their catalogue is updated
    every day. Crawling this e-shop in a single thread mode takes from 3 days to
    one week (depending on the site workload) which means I miss in average 3-5
    updates per one item. Being able to crawl it in 3-5 threads may potentially
    cut the overall time by 2-3 times which is (almost) what I need.

    Don't you think it will work?

     
  • Taras Emelyanenko

    I think a good variant will be to fork execution of function in new thread.
    Something like <call name="function_name" thread="yes">

     

Log in to post a comment.