|
From: stack <st...@ar...> - 2006-02-23 05:27:50
Attachments:
[archive-crawler] Re: Help with Uri-errors.log
|
(Forwarded discussion from the Heritrix list) |
|
From: stack <st...@ar...> - 2006-02-23 06:37:27
|
stack wrote: > (Forwarded discussion from the Heritrix list) > > ------------------------------------------------------------------------ > ... > Generally, the pages are shown fine with the exceptions of > javascripts that are retrieved from the live site instead of our arc > files. Also, WERA is unable to dynamically replace the links inside > the javascipts. > This leaking to the live web is a difficult problem. Perhaps this particular JS can be fixed in WERA but the variety of ways in which JS can be conjured, its unlikely all permutations will be guarded against. > Sample case: http://nwa.nb.no/wera/result.php?time=&url=http%3A%2F% > <http://nwa.nb.no/wera/result.php?time=&url=http%3A%2F%> > 2Fwww.nla.gov.au%2F&query=nla. > > Check the properties of the webpage to verify that you are still > within "http://nwa.nb.no". Click on "Exquisite Watercolors" link and > veiry that we are still viewing the arcfiles. Go back a page and try > any menu links. When you view the properties, it will show that you > are indeed browsing the live site instead of the arcfiles. > Have you considered setting your browser to go to your collection via a proxy (I don't think this mode is supported yet in WERA. I think its possible to set the wayback into a proxy mode). The proxy could ensure you never strayed off your ARC collection returning errors if resource is not found. > > What I wanted to accomplish are the following: > 1) Help WERA load the javascripts from our arcfiles instead of the > live sites by modifying the loading of the scripts from the html. > Instead of the relative /js/xxxx.js, we will change it to > http://localhost/wera/......../js/xxxx.js. (Sverre or Brad: Does the JS inserted at end of the page by WERA adding a base to the page not effect such JS URLs?). > > 2) Modify the relative links inside javascript files if WERA is not > capable of dynamically modifying them also. Would be sweet if any modifications you'd do in the rewriting of the ARC files was instead done for you by WERA (or wayback). St.Ack > > I am planning to use the dk.netarkivet.ArcUtils for this task. > > I know that my problem is a little bit off topic but I hope you could > give additional tips. > > Thanks again in advance. > > --- In arc...@ya..., stack <stack@...> wrote: > > > > alxartes wrote: > > > St. Ack thanks again for the reply. > > > > > > Most of the pages are not displayed the way it should be when > viewed > > > from the source. > > > > At this time, are you viewing the pages with WERA? Or how are they > > being viewed? > > > I guess it is because the css and javascripts file > > > are not being fetched properly at the loading of the html from the > > > arcfile. We arrived at this conclusion since we can directly > > > retrieved the css and js through WERA. > > So pages are showing fine when viewed with WERA (generally)? > > > > > > I am planning on modifying the htmls inside the arcfiles to > correct > > > this problem. > > I'm trying to understand. You want to rewrite ARC files changing > all > > links so they point back into ARCs (or back to a disk populated > with the > > documents from a set of ARCs)? You do not want to use WERA viewing > pages? > > > > > What tool can I use to expand the arcfiles so that I > > > can modify the files inside? and a tool that will bring the > arcfile > > > together once again? I think this is somewhat out of topic but I > am a > > > little bit out of time and would greatly appreciate any inputs. > > This section from dev. manual might be of use: > > http://crawler.archive.org/articles/developer_manual.html#arcs. > Talks > > about tools for reading ARCs. > > > > One approach would subclass ARCReader. This will get you a stream > onto > > ARCs > > > (http://crawler.archive.org/apidocs/org/archive/io/arc/ARCReader.html) > <http://crawler.archive.org/apidocs/org/archive/io/arc/ARCReader.html%29> > .. > > Use the adjacent ARCWriter to write new ARCs. Modifying the links > in > > pages, you'll first have to find them. You could start with the > > Extractors that are in Heritrix subclassing them to add a link > rewrite > > functionality. Such a tool has been asked for on this list in the > past > > but its a bit of job and in the end, you'll never successfully be > able > > to rewrite all links (Think URLs produced by JS in the page). > > > > Will a WERA (or the coming wayback, > > http://archive-access.sourceforge.net/projects/wayback/) > <http://archive-access.sourceforge.net/projects/wayback/%29> not > suffice? > > > > Yours, > > St.Ack > > > > > > > > > > > > Thanks again. > > > > > > > > > > > > --- In arc...@ya..., stack <stack@> wrote: > > > > > > > > alxartes wrote: > > > > > Thanks St. Ack. > > > > > > > > > > It is really worisome to see those errors especially when we > are > > > not > > > > > viewing the arcfiles properly in Wera. > > > > > > > > Can you say more about what 'not viewing the arcfiles properly > in > > > > Wera'? Are pages not being found or are missing > images/stylesheets? > > > > > > > > Regards the local-errors.log, I've upped priority on an RFE that > > > > proposes cleaning this log (and added your experience to the > > > issue): > > > > http://sourceforge.net/tracker/index.php? > > > func=detail&aid=1091580&group_id=73833&atid=539099. > > > > > > > > > > Here is an excerpt from the crawl.log: > > > > > > > > > > 84046144 http://www.hdb.gov.sg/hdbwww/ownkvb.mpeg > > > > > 84046144 http://www5.hdb.gov.sg/hdbwww/ownkvb.mpeg > > > > > 84046144 http://www7.hdb.gov.sg/hdbwww/ownkvb.mpeg > > > > > 84046144 https://www5.hdb.gov.sg/hdbwww/ownkvb.mpeg > > > > > 47097784 http://www.hdb.gov.sg/hdbwww/ownkvs.mpeg > > > > > 47097784 http://www5.hdb.gov.sg/hdbwww/ownkvs.mpeg > > > > > 47097784 http://www7.hdb.gov.sg/hdbwww/ownkvs.mpeg > > > > > 47097784 https://www5.hdb.gov.sg/hdbwww/ownkvs.mpeg > > > > > 22292823 http://www.hdb.gov.sg/hdbwww/fallingwindow.wmv > > > > > 22292823 http://www5.hdb.gov.sg/hdbwww/fallingwindow.wmv > > > > > 22292823 http://www7.hdb.gov.sg/hdbwww/fallingwindow.wmv > > > > > 22292823 https://www5.hdb.gov.sg/hdbwww/fallingwindow.wmv > > > > > > > > > > As you can see, a certain file is crawled 4 times. I have done > > > this > > > > > crawl using domain scope. Would pathscope with a seed of > > > > > http://www.hdb.gov.sg prevent the other sites to being > crawled? If > > > > > not, are there other ways to prevent it from happening? > > > > > > > > Yeah, the domain scope warns: "It will however reach subdomains > of > > > the > > > > seeds' original domains. www[#].host is considered to be the > same > > > as > > > > host." Explicitly stating 'www.hdb.gov.sg' doesn't look like it > > > will > > > > avoid the problem either reading the code. > > > > > > > > FYI, we're moving away from *scope scopes -- i.e. domainscope, > > > > pathscope, etc. -- toward decidingscope. The latter gives > > > you "more > > > > rope" designing scopes. > > > > > > > > It looks like the On*DecideRule though has same issue > with 'www'. > > > Looks > > > > like you can write a SURT form, something > like '(sg,gov,hdb,www)', > > > that > > > > will only include URIs with a host of 'www.hdb.gov.sg' (though > it > > > looks > > > > like http and https are flattened to be same scheme). > > > > > > > > I'll let others -- Igor or Gordon? -- respond. They can give a > > > better > > > > quality answer than I. > > > > > > > > Good stuff, > > > > St.Ack > > > > > > > > > > > > > > Thank you so much for your time. > > > > > > > > > > > > > > > > > > > > --- In arc...@ya..., stack <stack@> wrote: > > > > > > > > > > > > alxartes wrote: > > > > > > > Hi, > > > > > > > > > > > > > > I am investigating the log files of my crawls and found > the > > > error > > > > > > > below. I hope someone could explain what this means > because > > > the > > > > > other > > > > > > > javascripts are crawled fine. > > > > > > > > > > > > > > 2006-02-15T03:35:21.747Z > > > > > > > > > > http://www.macromedia.com/uber/js/omniture_s_code.js "Unsupported > > > > > > > scheme: javascript" > > > > > > > > > > > > > > > > javascript:,macromedia,dreamweaver,flash,shockwave,sdc,markme,sdc.shoc > > > > > kw > > > > > > > ave,infopoll,developerlocator.macromedia > > > > > > > > > > > > In short, the above is just stating that Heritrix does not > > > support > > > > > > fetching > the 'URI' "javascript:,macromedia,dreamweaver...". Its > > > > > not an > > > > > > 'error'. > > > > > > > > > > > > Heritrix is regexing over the content of > > > > > > 'http://www.macromedia.com/uber/js/omniture_s_code.js' > <http://www.macromedia.com/uber/js/omniture_s_code.js%27> > > > <http://www.macromedia.com/uber/js/omniture_s_code.js%27> > > > > > <http://www.macromedia.com/uber/js/omniture_s_code.js%27> > looking > > > for > > > > > > URIs. It found the string > > > > > > > > > > > > > > > > "javascript:,macromedia,dreamweaver,flash,shockwave,sdc,markme,sdc.s > > > > > hockwave,infopoll,developerlocator.macromedia" > > > > > > > > > > > > > > > > > > To the Heritrix regex, the above string looks like a likely > URI. > > > > > Its > > > > > > inside quotes > > > > > > and starts with what could be an URI scheme > > > (i.e. 'javascript:'). > > > > > > > > > > > > So, the candidate URI is passed to our URI parser class, > > > > > > org.archive.net.UURIFactory. This class takes configuration > in > > > > > > heritrix.properties about which URI schemes Heritrix will > > > accept. > > > > > > Here's relevant extract: > > > > > > > > > > > > > > > > > > > > > ###################################################################### > > > > > ######## > > > > > > # U U R > > > > > > > > > > > > > > > I # > > > > > > > > > > > > > > > ###################################################################### > > > > > ######### > > > > > > Any scheme not listed in the below will generate an > > > > > UnsupportedUriScheme > > > > > > # exception. Make the list empty to support all schemes. > > > > > > org.archive.net.UURIFactory.schemes = http, https, dns, > invalid > > > > > > > > > > > > (We don't currently have a 'UnsupportedUriScheme' > exception. We > > > > > should > > > > > > add one). > > > > > > > > > > > > Here is where the test is done: > > > > > > > > > > http://crawler.archive.org/xref/org/archive/net/UURIFactory.html#443 > > > > > > > > > > > > Because 'javascript' scheme is not in above supported > schemes > > > list > > > > > (nor > > > > > > in the list of schemes to ignore which appears later in > > > > > > heritrix.properties), it generates a URIException with > > > > > an 'unsupported > > > > > > scheme' message. > > > > > > > > > > > > We could do with some clean up in here. Currently all URI > > > > > exceptions > > > > > > are lumped into URIException. We could add subclasses of > URIE > > > so > > > > > the > > > > > > non-errors get logged at a different level: e.g. FINE for > > > > > unsupported > > > > > > scheme exceptions. > > > > > > > > > > > > St.Ack > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > SPONSORED LINKS > > > > > Computer security > > > > > <http://groups.yahoo.com/gads? > > > > t=ms&k=Computer+security&w1=Computer+security&w2=Computer+training&c=2 > > > &s=46&.sig=BHmcxBg5sKfN9-gcWnJWDg> > > > > > Computer training > > > > > <http://groups.yahoo.com/gads? > > > > t=ms&k=Computer+training&w1=Computer+security&w2=Computer+training&c=2 > > > &s=46&.sig=v0JjJWA4s7mLnWQWdFxuTQ> > > > > > > > > > > > > > > > > > > > > -------------------------------------------------------------- > ---- > > > ------ > > > > > YAHOO! GROUPS LINKS > > > > > > > > > > * Visit your group "archive-crawler > > > > > <http://groups.yahoo.com/group/archive-crawler>" on the > web. > > > > > > > > > > * To unsubscribe from this group, send an email to: > > > > > arc...@ya... > > > > > <mailto:arc...@ya...? > > > subject=Unsubscribe> > > > > > > > > > > * Your use of Yahoo! Groups is subject to the Yahoo! > Terms of > > > > > Service <http://docs.yahoo.com/info/terms/>. > > > > > > > > > > > > > > > -------------------------------------------------------------- > ---- > > > ------ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > SPONSORED LINKS > > > Computer security > > > <http://groups.yahoo.com/gads? > t=ms&k=Computer+security&w1=Computer+security&w2=Computer+training&c=2 > &s=46&.sig=BHmcxBg5sKfN9-gcWnJWDg> > > > Computer training > > > <http://groups.yahoo.com/gads? > t=ms&k=Computer+training&w1=Computer+security&w2=Computer+training&c=2 > &s=46&.sig=v0JjJWA4s7mLnWQWdFxuTQ> > > > > > > > > > > > > ------------------------------------------------------------------ > ------ > > > YAHOO! GROUPS LINKS > > > > > > * Visit your group "archive-crawler > > > <http://groups.yahoo.com/group/archive-crawler>" on the web. > > > > > > * To unsubscribe from this group, send an email to: > > > arc...@ya... > > > <mailto:arc...@ya...? > subject=Unsubscribe> > > > > > > * Your use of Yahoo! Groups is subject to the Yahoo! Terms of > > > Service <http://docs.yahoo.com/info/terms/>. > > > > > > > > > ------------------------------------------------------------------ > ------ > > > > > > > > > > > > SPONSORED LINKS > Computer security > <http://groups.yahoo.com/gads?t=ms&k=Computer+security&w1=Computer+security&w2=Computer+training&c=2&s=46&.sig=BHmcxBg5sKfN9-gcWnJWDg> > Computer training > <http://groups.yahoo.com/gads?t=ms&k=Computer+training&w1=Computer+security&w2=Computer+training&c=2&s=46&.sig=v0JjJWA4s7mLnWQWdFxuTQ> > > > > ------------------------------------------------------------------------ > YAHOO! GROUPS LINKS > > * Visit your group "archive-crawler > <http://groups.yahoo.com/group/archive-crawler>" on the web. > > * To unsubscribe from this group, send an email to: > arc...@ya... > <mailto:arc...@ya...?subject=Unsubscribe> > > * Your use of Yahoo! Groups is subject to the Yahoo! Terms of > Service <http://docs.yahoo.com/info/terms/>. > > > ------------------------------------------------------------------------ > |
|
From: stack <st...@ar...> - 2006-02-23 18:31:23
|
stack wrote: > stack wrote: >> (Forwarded discussion from the Heritrix list) >> >> ------------------------------------------------------------------------ >> ... >> Generally, the pages are shown fine with the exceptions of >> javascripts that are retrieved from the live site instead of our arc >> files. Also, WERA is unable to dynamically replace the links inside >> the javascipts. >> Related, here are all current issues regards JS (the first originally reported by Charles of LU): "[ 1312214 ] [wera/wayback js] More redirects to llive web (look at it)." https://sourceforge.net/tracker/index.php?func=detail&aid=1312214&group_id=118427&atid=681137. "[ 1280447 ] [wera/wayback js] Link rewritng not working well for frames" https://sourceforge.net/tracker/index.php?func=detail&aid=1280447&group_id=118427&atid=681137 "[ 1421112 ] WERA web page display Menus in JS" https://sourceforge.net/tracker/index.php?func=detail&aid=1421112&group_id=118427&atid=681137 St.Ack |
|
From: Sverre B. <sve...@nb...> - 2006-03-14 15:08:30
|
Hi there. Sorry for not responding earlier to all the issues discussed in the recent weeks. I'm working on adding URL canonicalization and proxy mode support in WERA and the results so far are promising. Some comments below. I'll prepare a new release for this week. I'll even try to convince the people maintaining our web servers to add the proxy setup on the wera demo site. Please ask more questions, this time i promise to get back to you a bit sooner. Regards Sverre On Wed, 2006-02-22 at 22:36 -0800, stack wrote: > stack wrote: > > (Forwarded discussion from the Heritrix list) > > > > ------------------------------------------------------------------------ > > ... > > Generally, the pages are shown fine with the exceptions of > > javascripts that are retrieved from the live site instead of our arc > > files. Also, WERA is unable to dynamically replace the links inside > > the javascipts. > > > This leaking to the live web is a difficult problem. Perhaps this > particular JS can be fixed in WERA but the variety of ways in which JS > can be conjured, its unlikely all permutations will be guarded against. No way we're gonna catch all JS-generated URL's by using Javascript or server side parsing. At least i'm not going to invest a lot of time in bullet proofing the JS, others are welcome though ;-) I'd prefer a combination JS and/or server side parsing and a proxy solution that catches "the rest", i.e. the leakages out to internet. > > > Sample case: http://nwa.nb.no/wera/result.php?time=&url=http%3A%2F% > > <http://nwa.nb.no/wera/result.php?time=&url=http%3A%2F%> > > 2Fwww.nla.gov.au%2F&query=nla. > > > > Check the properties of the webpage to verify that you are still > > within "http://nwa.nb.no". Click on "Exquisite Watercolors" link and > > veiry that we are still viewing the arcfiles. Go back a page and try > > any menu links. When you view the properties, it will show that you > > are indeed browsing the live site instead of the arcfiles. > > The proxy mode i'm working does handle the above case. > Have you considered setting your browser to go to your collection via a > proxy (I don't think this mode is supported yet in WERA. I think its > possible to set the wayback into a proxy mode). The proxy could ensure > you never strayed off your ARC collection returning errors if resource > is not found. > > > > > What I wanted to accomplish are the following: > > 1) Help WERA load the javascripts from our arcfiles instead of the > > live sites by modifying the loading of the scripts from the html. > > Instead of the relative /js/xxxx.js, we will change it to > > http://localhost/wera/......../js/xxxx.js. > > (Sverre or Brad: Does the JS inserted at end of the page by WERA adding > a base to the page not effect such JS URLs?) The JS injected by wera should take care of this (eh, i'm not an expert on javascript - i did modify IA's original Wayback JS to fit WERA, but i do not have thorough understanding of it). A big problem with WERA as it is now is that you have no easy way of telling what is fetched from the internet and what is fetched through WERA. Cutting the browser off from internet by using a proxy that redirects the leaking links back to WERA makes it a lot easier to debug and improve the JS' rewriting. > . > > > > 2) Modify the relative links inside javascript files if WERA is not > > capable of dynamically modifying them also. If you really need to change the javascript files before feeding them to the client i would recommend that you implement this in WERA rather than start messing with the ARC files. If you look in the Wera config file you'll see that there are different handlers for different mime types ($conf_document_handler). The text/html handler injects the JS for rewriting links. Any other mime type is handled by a passthrough handler. If the javascripts are stored in the archive with one (or more) distinguished mime type you could write a handler espacially for this/these. > Would be sweet if any modifications you'd do in the rewriting of the ARC > files was instead done for you by WERA (or wayback). > > St.Ack > > > > > I am planning to use the dk.netarkivet.ArcUtils for this task. > > > > I know that my problem is a little bit off topic but I hope you could > > give additional tips. > > > > Thanks again in advance. > > > > --- In arc...@ya..., stack <stack@...> wrote: > > > > > > alxartes wrote: > > > > St. Ack thanks again for the reply. > > > > > > > > Most of the pages are not displayed the way it should be when > > viewed > > > > from the source. When you view the source of the web page displayed in the lower frame of the wera timeline view you will not see the links rewritten. The source you see is hte source before the JS "kicks in". This can be a bit annoying when it comes to debugging the JS ;-) > > > > > > At this time, are you viewing the pages with WERA? Or how are they > > > being viewed? > > > > I guess it is because the css and javascripts file > > > > are not being fetched properly at the loading of the html from the > > > > arcfile. We arrived at this conclusion since we can directly > > > > retrieved the css and js through WERA. > > > So pages are showing fine when viewed with WERA (generally)? > > > > > > > > I am planning on modifying the htmls inside the arcfiles to > > correct > > > > this problem. > > > I'm trying to understand. You want to rewrite ARC files changing > > all > > > links so they point back into ARCs (or back to a disk populated > > with the > > > documents from a set of ARCs)? You do not want to use WERA viewing > > pages? > > > > > > > What tool can I use to expand the arcfiles so that I > > > > can modify the files inside? and a tool that will bring the > > arcfile > > > > together once again? I think this is somewhat out of topic but I > > am a > > > > little bit out of time and would greatly appreciate any inputs. > > > This section from dev. manual might be of use: > > > http://crawler.archive.org/articles/developer_manual.html#arcs. > > Talks > > > about tools for reading ARCs. > > > > > > One approach would subclass ARCReader. This will get you a stream > > onto > > > ARCs > > > > > (http://crawler.archive.org/apidocs/org/archive/io/arc/ARCReader.html) > > <http://crawler.archive.org/apidocs/org/archive/io/arc/ARCReader.html%29> > > .. > > > Use the adjacent ARCWriter to write new ARCs. Modifying the links > > in > > > pages, you'll first have to find them. You could start with the > > > Extractors that are in Heritrix subclassing them to add a link > > rewrite > > > functionality. Such a tool has been asked for on this list in the > > past > > > but its a bit of job and in the end, you'll never successfully be > > able > > > to rewrite all links (Think URLs produced by JS in the page). > > > > > > Will a WERA (or the coming wayback, > > > http://archive-access.sourceforge.net/projects/wayback/) > > <http://archive-access.sourceforge.net/projects/wayback/%29> not > > suffice? > > > > > > Yours, > > > St.Ack > > > > > > > > > > > > > > > > > Thanks again. > > > > > > > > > > > > > > > > --- In arc...@ya..., stack <stack@> wrote: > > > > > > > > > > alxartes wrote: > > > > > > Thanks St. Ack. > > > > > > > > > > > > It is really worisome to see those errors especially when we > > are > > > > not > > > > > > viewing the arcfiles properly in Wera. > > > > > > > > > > Can you say more about what 'not viewing the arcfiles properly > > in > > > > > Wera'? Are pages not being found or are missing > > images/stylesheets? > > > > > > > > > > Regards the local-errors.log, I've upped priority on an RFE that > > > > > proposes cleaning this log (and added your experience to the > > > > issue): > > > > > http://sourceforge.net/tracker/index.php? > > > > func=detail&aid=1091580&group_id=73833&atid=539099. > > > > > > > > > > > > Here is an excerpt from the crawl.log: > > > > > > > > > > > > 84046144 http://www.hdb.gov.sg/hdbwww/ownkvb.mpeg > > > > > > 84046144 http://www5.hdb.gov.sg/hdbwww/ownkvb.mpeg > > > > > > 84046144 http://www7.hdb.gov.sg/hdbwww/ownkvb.mpeg > > > > > > 84046144 https://www5.hdb.gov.sg/hdbwww/ownkvb.mpeg > > > > > > 47097784 http://www.hdb.gov.sg/hdbwww/ownkvs.mpeg > > > > > > 47097784 http://www5.hdb.gov.sg/hdbwww/ownkvs.mpeg > > > > > > 47097784 http://www7.hdb.gov.sg/hdbwww/ownkvs.mpeg > > > > > > 47097784 https://www5.hdb.gov.sg/hdbwww/ownkvs.mpeg > > > > > > 22292823 http://www.hdb.gov.sg/hdbwww/fallingwindow.wmv > > > > > > 22292823 http://www5.hdb.gov.sg/hdbwww/fallingwindow.wmv > > > > > > 22292823 http://www7.hdb.gov.sg/hdbwww/fallingwindow.wmv > > > > > > 22292823 https://www5.hdb.gov.sg/hdbwww/fallingwindow.wmv > > > > > > > > > > > > As you can see, a certain file is crawled 4 times. I have done > > > > this > > > > > > crawl using domain scope. Would pathscope with a seed of > > > > > > http://www.hdb.gov.sg prevent the other sites to being > > crawled? If > > > > > > not, are there other ways to prevent it from happening? > > > > > > > > > > Yeah, the domain scope warns: "It will however reach subdomains > > of > > > > the > > > > > seeds' original domains. www[#].host is considered to be the > > same > > > > as > > > > > host." Explicitly stating 'www.hdb.gov.sg' doesn't look like it > > > > will > > > > > avoid the problem either reading the code. > > > > > > > > > > FYI, we're moving away from *scope scopes -- i.e. domainscope, > > > > > pathscope, etc. -- toward decidingscope. The latter gives > > > > you "more > > > > > rope" designing scopes. > > > > > > > > > > It looks like the On*DecideRule though has same issue > > with 'www'. > > > > Looks > > > > > like you can write a SURT form, something > > like '(sg,gov,hdb,www)', > > > > that > > > > > will only include URIs with a host of 'www.hdb.gov.sg' (though > > it > > > > looks > > > > > like http and https are flattened to be same scheme). > > > > > > > > > > I'll let others -- Igor or Gordon? -- respond. They can give a > > > > better > > > > > quality answer than I. > > > > > > > > > > Good stuff, > > > > > St.Ack > > > > > > > > > > > > > > > > > Thank you so much for your time. > > > > > > > > > > > > > > > > > > > > > > > > --- In arc...@ya..., stack <stack@> wrote: > > > > > > > > > > > > > > alxartes wrote: > > > > > > > > Hi, > > > > > > > > > > > > > > > > I am investigating the log files of my crawls and found > > the > > > > error > > > > > > > > below. I hope someone could explain what this means > > because > > > > the > > > > > > other > > > > > > > > javascripts are crawled fine. > > > > > > > > > > > > > > > > 2006-02-15T03:35:21.747Z > > > > > > > > > > > > http://www.macromedia.com/uber/js/omniture_s_code.js "Unsupported > > > > > > > > scheme: javascript" > > > > > > > > > > > > > > > > > > > > javascript:,macromedia,dreamweaver,flash,shockwave,sdc,markme,sdc.shoc > > > > > > kw > > > > > > > > ave,infopoll,developerlocator.macromedia > > > > > > > > > > > > > > In short, the above is just stating that Heritrix does not > > > > support > > > > > > > fetching > > the 'URI' "javascript:,macromedia,dreamweaver...". Its > > > > > > not an > > > > > > > 'error'. > > > > > > > > > > > > > > Heritrix is regexing over the content of > > > > > > > 'http://www.macromedia.com/uber/js/omniture_s_code.js' > > <http://www.macromedia.com/uber/js/omniture_s_code.js%27> > > > > <http://www.macromedia.com/uber/js/omniture_s_code.js%27> > > > > > > <http://www.macromedia.com/uber/js/omniture_s_code.js%27> > > looking > > > > for > > > > > > > URIs. It found the string > > > > > > > > > > > > > > > > > > > > "javascript:,macromedia,dreamweaver,flash,shockwave,sdc,markme,sdc.s > > > > > > hockwave,infopoll,developerlocator.macromedia" > > > > > > > > > > > > > > > > > > > > > To the Heritrix regex, the above string looks like a likely > > URI. > > > > > > Its > > > > > > > inside quotes > > > > > > > and starts with what could be an URI scheme > > > > (i.e. 'javascript:'). > > > > > > > > > > > > > > So, the candidate URI is passed to our URI parser class, > > > > > > > org.archive.net.UURIFactory. This class takes configuration > > in > > > > > > > heritrix.properties about which URI schemes Heritrix will > > > > accept. > > > > > > > Here's relevant extract: > > > > > > > > > > > > > > > > > > > > > > > > > > ###################################################################### > > > > > > ######## > > > > > > > # U U R > > > > > > > > > > > > > > > > > > > I # > > > > > > > > > > > > > > > > > > > ###################################################################### > > > > > > ######### > > > > > > > Any scheme not listed in the below will generate an > > > > > > UnsupportedUriScheme > > > > > > > # exception. Make the list empty to support all schemes. > > > > > > > org.archive.net.UURIFactory.schemes = http, https, dns, > > invalid > > > > > > > > > > > > > > (We don't currently have a 'UnsupportedUriScheme' > > exception. We > > > > > > should > > > > > > > add one). > > > > > > > > > > > > > > Here is where the test is done: > > > > > > > > > > > > > http://crawler.archive.org/xref/org/archive/net/UURIFactory.html#443 > > > > > > > > > > > > > > Because 'javascript' scheme is not in above supported > > schemes > > > > list > > > > > > (nor > > > > > > > in the list of schemes to ignore which appears later in > > > > > > > heritrix.properties), it generates a URIException with > > > > > > an 'unsupported > > > > > > > scheme' message. > > > > > > > > > > > > > > We could do with some clean up in here. Currently all URI > > > > > > exceptions > > > > > > > are lumped into URIException. We could add subclasses of > > URIE > > > > so > > > > > > the > > > > > > > non-errors get logged at a different level: e.g. FINE for > > > > > > unsupported > > > > > > > scheme exceptions. > > > > > > > > > > > > > > St.Ack > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > SPONSORED LINKS > > > > > > Computer security > > > > > > <http://groups.yahoo.com/gads? > > > > > > t=ms&k=Computer+security&w1=Computer+security&w2=Computer+training&c=2 > > > > &s=46&.sig=BHmcxBg5sKfN9-gcWnJWDg> > > > > > > Computer training > > > > > > <http://groups.yahoo.com/gads? > > > > > > t=ms&k=Computer+training&w1=Computer+security&w2=Computer+training&c=2 > > > > &s=46&.sig=v0JjJWA4s7mLnWQWdFxuTQ> > > > > > > > > > > > > > > > > > > > > > > > > -------------------------------------------------------------- > > ---- > > > > ------ > > > > > > YAHOO! GROUPS LINKS > > > > > > > > > > > > * Visit your group "archive-crawler > > > > > > <http://groups.yahoo.com/group/archive-crawler>" on the > > web. > > > > > > > > > > > > * To unsubscribe from this group, send an email to: > > > > > > arc...@ya... > > > > > > <mailto:arc...@ya...? > > > > subject=Unsubscribe> > > > > > > > > > > > > * Your use of Yahoo! Groups is subject to the Yahoo! > > Terms of > > > > > > Service <http://docs.yahoo.com/info/terms/>. > > > > > > > > > > > > > > > > > > -------------------------------------------------------------- > > ---- > > > > ------ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > SPONSORED LINKS > > > > Computer security > > > > <http://groups.yahoo.com/gads? > > t=ms&k=Computer+security&w1=Computer+security&w2=Computer+training&c=2 > > &s=46&.sig=BHmcxBg5sKfN9-gcWnJWDg> > > > > Computer training > > > > <http://groups.yahoo.com/gads? > > t=ms&k=Computer+training&w1=Computer+security&w2=Computer+training&c=2 > > &s=46&.sig=v0JjJWA4s7mLnWQWdFxuTQ> > > > > > > > > > > > > > > > > ------------------------------------------------------------------ > > ------ > > > > YAHOO! GROUPS LINKS > > > > > > > > * Visit your group "archive-crawler > > > > <http://groups.yahoo.com/group/archive-crawler>" on the web. > > > > > > > > * To unsubscribe from this group, send an email to: > > > > arc...@ya... > > > > <mailto:arc...@ya...? > > subject=Unsubscribe> > > > > > > > > * Your use of Yahoo! Groups is subject to the Yahoo! Terms of > > > > Service <http://docs.yahoo.com/info/terms/>. > > > > > > > > > > > > ------------------------------------------------------------------ > > ------ > > > > > > > > > > > > > > > > > > > > > SPONSORED LINKS > > Computer security > > <http://groups.yahoo.com/gads?t=ms&k=Computer+security&w1=Computer+security&w2=Computer+training&c=2&s=46&.sig=BHmcxBg5sKfN9-gcWnJWDg> > > Computer training > > <http://groups.yahoo.com/gads?t=ms&k=Computer+training&w1=Computer+security&w2=Computer+training&c=2&s=46&.sig=v0JjJWA4s7mLnWQWdFxuTQ> > > > > > > > > ------------------------------------------------------------------------ > > YAHOO! GROUPS LINKS > > > > * Visit your group "archive-crawler > > <http://groups.yahoo.com/group/archive-crawler>" on the web. > > > > * To unsubscribe from this group, send an email to: > > arc...@ya... > > <mailto:arc...@ya...?subject=Unsubscribe> > > > > * Your use of Yahoo! Groups is subject to the Yahoo! Terms of > > Service <http://docs.yahoo.com/info/terms/>. > > > > > > ------------------------------------------------------------------------ > > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting language > that extends applications into web and mobile media. Attend the live webcast > and join the prime developer group breaking into this new coding territory! > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |