|
From: stack <st...@ar...> - 2006-02-23 06:37:27
|
stack wrote: > (Forwarded discussion from the Heritrix list) > > ------------------------------------------------------------------------ > ... > Generally, the pages are shown fine with the exceptions of > javascripts that are retrieved from the live site instead of our arc > files. Also, WERA is unable to dynamically replace the links inside > the javascipts. > This leaking to the live web is a difficult problem. Perhaps this particular JS can be fixed in WERA but the variety of ways in which JS can be conjured, its unlikely all permutations will be guarded against. > Sample case: http://nwa.nb.no/wera/result.php?time=&url=http%3A%2F% > <http://nwa.nb.no/wera/result.php?time=&url=http%3A%2F%> > 2Fwww.nla.gov.au%2F&query=nla. > > Check the properties of the webpage to verify that you are still > within "http://nwa.nb.no". Click on "Exquisite Watercolors" link and > veiry that we are still viewing the arcfiles. Go back a page and try > any menu links. When you view the properties, it will show that you > are indeed browsing the live site instead of the arcfiles. > Have you considered setting your browser to go to your collection via a proxy (I don't think this mode is supported yet in WERA. I think its possible to set the wayback into a proxy mode). The proxy could ensure you never strayed off your ARC collection returning errors if resource is not found. > > What I wanted to accomplish are the following: > 1) Help WERA load the javascripts from our arcfiles instead of the > live sites by modifying the loading of the scripts from the html. > Instead of the relative /js/xxxx.js, we will change it to > http://localhost/wera/......../js/xxxx.js. (Sverre or Brad: Does the JS inserted at end of the page by WERA adding a base to the page not effect such JS URLs?). > > 2) Modify the relative links inside javascript files if WERA is not > capable of dynamically modifying them also. Would be sweet if any modifications you'd do in the rewriting of the ARC files was instead done for you by WERA (or wayback). St.Ack > > I am planning to use the dk.netarkivet.ArcUtils for this task. > > I know that my problem is a little bit off topic but I hope you could > give additional tips. > > Thanks again in advance. > > --- In arc...@ya..., stack <stack@...> wrote: > > > > alxartes wrote: > > > St. Ack thanks again for the reply. > > > > > > Most of the pages are not displayed the way it should be when > viewed > > > from the source. > > > > At this time, are you viewing the pages with WERA? Or how are they > > being viewed? > > > I guess it is because the css and javascripts file > > > are not being fetched properly at the loading of the html from the > > > arcfile. We arrived at this conclusion since we can directly > > > retrieved the css and js through WERA. > > So pages are showing fine when viewed with WERA (generally)? > > > > > > I am planning on modifying the htmls inside the arcfiles to > correct > > > this problem. > > I'm trying to understand. You want to rewrite ARC files changing > all > > links so they point back into ARCs (or back to a disk populated > with the > > documents from a set of ARCs)? You do not want to use WERA viewing > pages? > > > > > What tool can I use to expand the arcfiles so that I > > > can modify the files inside? and a tool that will bring the > arcfile > > > together once again? I think this is somewhat out of topic but I > am a > > > little bit out of time and would greatly appreciate any inputs. > > This section from dev. manual might be of use: > > http://crawler.archive.org/articles/developer_manual.html#arcs. > Talks > > about tools for reading ARCs. > > > > One approach would subclass ARCReader. This will get you a stream > onto > > ARCs > > > (http://crawler.archive.org/apidocs/org/archive/io/arc/ARCReader.html) > <http://crawler.archive.org/apidocs/org/archive/io/arc/ARCReader.html%29> > .. > > Use the adjacent ARCWriter to write new ARCs. Modifying the links > in > > pages, you'll first have to find them. You could start with the > > Extractors that are in Heritrix subclassing them to add a link > rewrite > > functionality. Such a tool has been asked for on this list in the > past > > but its a bit of job and in the end, you'll never successfully be > able > > to rewrite all links (Think URLs produced by JS in the page). > > > > Will a WERA (or the coming wayback, > > http://archive-access.sourceforge.net/projects/wayback/) > <http://archive-access.sourceforge.net/projects/wayback/%29> not > suffice? > > > > Yours, > > St.Ack > > > > > > > > > > > > Thanks again. > > > > > > > > > > > > --- In arc...@ya..., stack <stack@> wrote: > > > > > > > > alxartes wrote: > > > > > Thanks St. Ack. > > > > > > > > > > It is really worisome to see those errors especially when we > are > > > not > > > > > viewing the arcfiles properly in Wera. > > > > > > > > Can you say more about what 'not viewing the arcfiles properly > in > > > > Wera'? Are pages not being found or are missing > images/stylesheets? > > > > > > > > Regards the local-errors.log, I've upped priority on an RFE that > > > > proposes cleaning this log (and added your experience to the > > > issue): > > > > http://sourceforge.net/tracker/index.php? > > > func=detail&aid=1091580&group_id=73833&atid=539099. > > > > > > > > > > Here is an excerpt from the crawl.log: > > > > > > > > > > 84046144 http://www.hdb.gov.sg/hdbwww/ownkvb.mpeg > > > > > 84046144 http://www5.hdb.gov.sg/hdbwww/ownkvb.mpeg > > > > > 84046144 http://www7.hdb.gov.sg/hdbwww/ownkvb.mpeg > > > > > 84046144 https://www5.hdb.gov.sg/hdbwww/ownkvb.mpeg > > > > > 47097784 http://www.hdb.gov.sg/hdbwww/ownkvs.mpeg > > > > > 47097784 http://www5.hdb.gov.sg/hdbwww/ownkvs.mpeg > > > > > 47097784 http://www7.hdb.gov.sg/hdbwww/ownkvs.mpeg > > > > > 47097784 https://www5.hdb.gov.sg/hdbwww/ownkvs.mpeg > > > > > 22292823 http://www.hdb.gov.sg/hdbwww/fallingwindow.wmv > > > > > 22292823 http://www5.hdb.gov.sg/hdbwww/fallingwindow.wmv > > > > > 22292823 http://www7.hdb.gov.sg/hdbwww/fallingwindow.wmv > > > > > 22292823 https://www5.hdb.gov.sg/hdbwww/fallingwindow.wmv > > > > > > > > > > As you can see, a certain file is crawled 4 times. I have done > > > this > > > > > crawl using domain scope. Would pathscope with a seed of > > > > > http://www.hdb.gov.sg prevent the other sites to being > crawled? If > > > > > not, are there other ways to prevent it from happening? > > > > > > > > Yeah, the domain scope warns: "It will however reach subdomains > of > > > the > > > > seeds' original domains. www[#].host is considered to be the > same > > > as > > > > host." Explicitly stating 'www.hdb.gov.sg' doesn't look like it > > > will > > > > avoid the problem either reading the code. > > > > > > > > FYI, we're moving away from *scope scopes -- i.e. domainscope, > > > > pathscope, etc. -- toward decidingscope. The latter gives > > > you "more > > > > rope" designing scopes. > > > > > > > > It looks like the On*DecideRule though has same issue > with 'www'. > > > Looks > > > > like you can write a SURT form, something > like '(sg,gov,hdb,www)', > > > that > > > > will only include URIs with a host of 'www.hdb.gov.sg' (though > it > > > looks > > > > like http and https are flattened to be same scheme). > > > > > > > > I'll let others -- Igor or Gordon? -- respond. They can give a > > > better > > > > quality answer than I. > > > > > > > > Good stuff, > > > > St.Ack > > > > > > > > > > > > > > Thank you so much for your time. > > > > > > > > > > > > > > > > > > > > --- In arc...@ya..., stack <stack@> wrote: > > > > > > > > > > > > alxartes wrote: > > > > > > > Hi, > > > > > > > > > > > > > > I am investigating the log files of my crawls and found > the > > > error > > > > > > > below. I hope someone could explain what this means > because > > > the > > > > > other > > > > > > > javascripts are crawled fine. > > > > > > > > > > > > > > 2006-02-15T03:35:21.747Z > > > > > > > > > > http://www.macromedia.com/uber/js/omniture_s_code.js "Unsupported > > > > > > > scheme: javascript" > > > > > > > > > > > > > > > > javascript:,macromedia,dreamweaver,flash,shockwave,sdc,markme,sdc.shoc > > > > > kw > > > > > > > ave,infopoll,developerlocator.macromedia > > > > > > > > > > > > In short, the above is just stating that Heritrix does not > > > support > > > > > > fetching > the 'URI' "javascript:,macromedia,dreamweaver...". Its > > > > > not an > > > > > > 'error'. > > > > > > > > > > > > Heritrix is regexing over the content of > > > > > > 'http://www.macromedia.com/uber/js/omniture_s_code.js' > <http://www.macromedia.com/uber/js/omniture_s_code.js%27> > > > <http://www.macromedia.com/uber/js/omniture_s_code.js%27> > > > > > <http://www.macromedia.com/uber/js/omniture_s_code.js%27> > looking > > > for > > > > > > URIs. It found the string > > > > > > > > > > > > > > > > "javascript:,macromedia,dreamweaver,flash,shockwave,sdc,markme,sdc.s > > > > > hockwave,infopoll,developerlocator.macromedia" > > > > > > > > > > > > > > > > > > To the Heritrix regex, the above string looks like a likely > URI. > > > > > Its > > > > > > inside quotes > > > > > > and starts with what could be an URI scheme > > > (i.e. 'javascript:'). > > > > > > > > > > > > So, the candidate URI is passed to our URI parser class, > > > > > > org.archive.net.UURIFactory. This class takes configuration > in > > > > > > heritrix.properties about which URI schemes Heritrix will > > > accept. > > > > > > Here's relevant extract: > > > > > > > > > > > > > > > > > > > > > ###################################################################### > > > > > ######## > > > > > > # U U R > > > > > > > > > > > > > > > I # > > > > > > > > > > > > > > > ###################################################################### > > > > > ######### > > > > > > Any scheme not listed in the below will generate an > > > > > UnsupportedUriScheme > > > > > > # exception. Make the list empty to support all schemes. > > > > > > org.archive.net.UURIFactory.schemes = http, https, dns, > invalid > > > > > > > > > > > > (We don't currently have a 'UnsupportedUriScheme' > exception. We > > > > > should > > > > > > add one). > > > > > > > > > > > > Here is where the test is done: > > > > > > > > > > http://crawler.archive.org/xref/org/archive/net/UURIFactory.html#443 > > > > > > > > > > > > Because 'javascript' scheme is not in above supported > schemes > > > list > > > > > (nor > > > > > > in the list of schemes to ignore which appears later in > > > > > > heritrix.properties), it generates a URIException with > > > > > an 'unsupported > > > > > > scheme' message. > > > > > > > > > > > > We could do with some clean up in here. Currently all URI > > > > > exceptions > > > > > > are lumped into URIException. We could add subclasses of > URIE > > > so > > > > > the > > > > > > non-errors get logged at a different level: e.g. FINE for > > > > > unsupported > > > > > > scheme exceptions. > > > > > > > > > > > > St.Ack > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > SPONSORED LINKS > > > > > Computer security > > > > > <http://groups.yahoo.com/gads? > > > > t=ms&k=Computer+security&w1=Computer+security&w2=Computer+training&c=2 > > > &s=46&.sig=BHmcxBg5sKfN9-gcWnJWDg> > > > > > Computer training > > > > > <http://groups.yahoo.com/gads? > > > > t=ms&k=Computer+training&w1=Computer+security&w2=Computer+training&c=2 > > > &s=46&.sig=v0JjJWA4s7mLnWQWdFxuTQ> > > > > > > > > > > > > > > > > > > > > -------------------------------------------------------------- > ---- > > > ------ > > > > > YAHOO! GROUPS LINKS > > > > > > > > > > * Visit your group "archive-crawler > > > > > <http://groups.yahoo.com/group/archive-crawler>" on the > web. > > > > > > > > > > * To unsubscribe from this group, send an email to: > > > > > arc...@ya... > > > > > <mailto:arc...@ya...? > > > subject=Unsubscribe> > > > > > > > > > > * Your use of Yahoo! Groups is subject to the Yahoo! > Terms of > > > > > Service <http://docs.yahoo.com/info/terms/>. > > > > > > > > > > > > > > > -------------------------------------------------------------- > ---- > > > ------ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > SPONSORED LINKS > > > Computer security > > > <http://groups.yahoo.com/gads? > t=ms&k=Computer+security&w1=Computer+security&w2=Computer+training&c=2 > &s=46&.sig=BHmcxBg5sKfN9-gcWnJWDg> > > > Computer training > > > <http://groups.yahoo.com/gads? > t=ms&k=Computer+training&w1=Computer+security&w2=Computer+training&c=2 > &s=46&.sig=v0JjJWA4s7mLnWQWdFxuTQ> > > > > > > > > > > > > ------------------------------------------------------------------ > ------ > > > YAHOO! GROUPS LINKS > > > > > > * Visit your group "archive-crawler > > > <http://groups.yahoo.com/group/archive-crawler>" on the web. > > > > > > * To unsubscribe from this group, send an email to: > > > arc...@ya... > > > <mailto:arc...@ya...? > subject=Unsubscribe> > > > > > > * Your use of Yahoo! Groups is subject to the Yahoo! Terms of > > > Service <http://docs.yahoo.com/info/terms/>. > > > > > > > > > ------------------------------------------------------------------ > ------ > > > > > > > > > > > > SPONSORED LINKS > Computer security > <http://groups.yahoo.com/gads?t=ms&k=Computer+security&w1=Computer+security&w2=Computer+training&c=2&s=46&.sig=BHmcxBg5sKfN9-gcWnJWDg> > Computer training > <http://groups.yahoo.com/gads?t=ms&k=Computer+training&w1=Computer+security&w2=Computer+training&c=2&s=46&.sig=v0JjJWA4s7mLnWQWdFxuTQ> > > > > ------------------------------------------------------------------------ > YAHOO! GROUPS LINKS > > * Visit your group "archive-crawler > <http://groups.yahoo.com/group/archive-crawler>" on the web. > > * To unsubscribe from this group, send an email to: > arc...@ya... > <mailto:arc...@ya...?subject=Unsubscribe> > > * Your use of Yahoo! Groups is subject to the Yahoo! Terms of > Service <http://docs.yahoo.com/info/terms/>. > > > ------------------------------------------------------------------------ > |