Re: [Archive-access-discuss] Crawling Flash and Javascript

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

>
> Hi Anne,
>
> You might try the archive-crawler mailing list as well.
>
> All of us have encountered these issues. Capturing javascript & flash
> content is difficult. Replaying this content is even harder.
>
> Whether it is a Heritrix or a Wayback issue depends: it’s probably
> both. If you can figure out what content needs to be captured in order
> for a site to work, you can then check your Heritrix crawl.log files
> to see if that content was captured. Heritrix is highly configurable
> and if you discover that Heritrix is not capturing the content you
> want, you may be able to change the configuration to make it capture
> what you want.
>
> After you have ensured that you are capturing the content, you can
> begin to evaluate whether Wayback is properly replaying the content.
> Whether Wayback can or is properly replaying the content depends on
> your Wayback configuration. For example, proxy mode can probably
> replay most content correctly, while I doubt that client-side
> rewriting will ever work very well.
>
> Finally, the only real way to test if this is fixed is to try out the
> new versions of Heritrix & Wayback and evaluate the results.
>
>
I am guessing, but it seems to me that not all web objects are being stored
during the Heritrix crawl, due to the fact that Heritrix (any version) does
not execute Javascript.

Has anyone ever considered replacing the core Heritrix 3 web fetcher with
something like HTMLUnit, which would execute Javascript via Rhino?    One
way to implement this would be to create an optional web client, configured
via Spring, which would execute javascript to better render a page at crawl
time - resulting in the inclusion of these objects.

As you mentioned, this is probably something that has come up on the
crawler list.

Jon