Re: [Archive-access-discuss] Crawling Flash and Javascript

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I've been working on an external browser processor to plug into the processor chain within Heritrix3.  It is in a pretty early stage of development, but is functional. 

The extractor processor is here: https://github.com/adam-miller/ExternalBrowserExtractorHTML
I've worked with two different headless browsers, phantomJS and zombieJS. So far, phantom has performed the best for me. My phantomJS script is here:
https://github.com/adam-miller/phantomBrowserExtractor

It will not run flash, but will run javascript and log all asynchronous requests to queue them in H3. So far, the main limitation with phantomJS is that it is going to request all of the content to render the page. This will cause duplicate requests since heritrix will be downloading the content on its own. I've been working with customizing phantomJS to prevent these duplicate requests, but I don't have any code for that online yet.

~Adam Miller

> 
>> From: Jon Walton <jon...@gm...>
>> Date: June 28, 2012 11:51:11 AM PDT
>> To: Erik Hetzner <eri...@uc...>
>> Cc: "arc...@li..." <arc...@li...>
>> Subject: Re: [Archive-access-discuss] Crawling Flash and Javascript
>> 
>> 
>> 
>> Hi Anne,
>> 
>> You might try the archive-crawler mailing list as well.
>> 
>> All of us have encountered these issues. Capturing javascript & flash
>> content is difficult. Replaying this content is even harder.
>> 
>> Whether it is a Heritrix or a Wayback issue depends: it’s probably
>> both. If you can figure out what content needs to be captured in order
>> for a site to work, you can then check your Heritrix crawl.log files
>> to see if that content was captured. Heritrix is highly configurable
>> and if you discover that Heritrix is not capturing the content you
>> want, you may be able to change the configuration to make it capture
>> what you want.
>> 
>> After you have ensured that you are capturing the content, you can
>> begin to evaluate whether Wayback is properly replaying the content.
>> Whether Wayback can or is properly replaying the content depends on
>> your Wayback configuration. For example, proxy mode can probably
>> replay most content correctly, while I doubt that client-side
>> rewriting will ever work very well.
>> 
>> Finally, the only real way to test if this is fixed is to try out the
>> new versions of Heritrix & Wayback and evaluate the results.
>> 
>> 
>> I am guessing, but it seems to me that not all web objects are being stored during the Heritrix crawl, due to the fact that Heritrix (any version) does not execute Javascript.    
>> 
>> Has anyone ever considered replacing the core Heritrix 3 web fetcher with something like HTMLUnit, which would execute Javascript via Rhino?    One way to implement this would be to create an optional web client, configured via Spring, which would execute javascript to better render a page at crawl time - resulting in the inclusion of these objects.
>> 
>> As you mentioned, this is probably something that has come up on the crawler list.
>> 
>> Jon
>> 
>> 
>>  
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and 
>> threat landscape has changed and how IT managers can respond. Discussions 
>> will include endpoint security, mobile security and the latest in malware 
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/_______________________________________________
>> Archive-access-discuss mailing list
>> Arc...@li...
>> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>