From: Erik H. <eri...@uc...> - 2012-06-28 16:55:56
|
At Wed, 27 Jun 2012 15:23:58 -0400, Leon, Anne wrote: > > Hi All, > > I have a question regarding crawling Flash and Javascript. > Currently, I am utilizing Heritrix 1.14.4 and Wayback 1.4.2 and I > have had issues capturing fully functioning websites. Websites that > utilize javascript heavily have banners missing or empty widget > boxes, and Flash content is virtually nonexistent. Within the next > few months we will be upgrading to the newest versions of both > programs, but I'm concerned that these problems will still exist. > > So, I'm wondering if any of you have encountered this issues and > what have you done to remedy them? Is this a Heritrix issue or a > Wayback issue? And lastly, did upgrading the software fix the > problems? Thank you all in advance. Hi Anne, You might try the archive-crawler mailing list as well. All of us have encountered these issues. Capturing javascript & flash content is difficult. Replaying this content is even harder. Whether it is a Heritrix or a Wayback issue depends: it’s probably both. If you can figure out what content needs to be captured in order for a site to work, you can then check your Heritrix crawl.log files to see if that content was captured. Heritrix is highly configurable and if you discover that Heritrix is not capturing the content you want, you may be able to change the configuration to make it capture what you want. After you have ensured that you are capturing the content, you can begin to evaluate whether Wayback is properly replaying the content. Whether Wayback can or is properly replaying the content depends on your Wayback configuration. For example, proxy mode can probably replay most content correctly, while I doubt that client-side rewriting will ever work very well. Finally, the only real way to test if this is fixed is to try out the new versions of Heritrix & Wayback and evaluate the results. Hope that helps! best, Erik |