You can subscribe to this list here.
| 2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(4) |
Sep
(5) |
Oct
(17) |
Nov
(30) |
Dec
(3) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2006 |
Jan
(4) |
Feb
(14) |
Mar
(8) |
Apr
(11) |
May
(2) |
Jun
(13) |
Jul
(9) |
Aug
(2) |
Sep
(2) |
Oct
(9) |
Nov
(20) |
Dec
(9) |
| 2007 |
Jan
(6) |
Feb
(4) |
Mar
(6) |
Apr
(7) |
May
(6) |
Jun
(6) |
Jul
(4) |
Aug
(3) |
Sep
(9) |
Oct
(26) |
Nov
(23) |
Dec
(2) |
| 2008 |
Jan
(17) |
Feb
(19) |
Mar
(16) |
Apr
(27) |
May
(3) |
Jun
(21) |
Jul
(21) |
Aug
(8) |
Sep
(13) |
Oct
(7) |
Nov
(8) |
Dec
(8) |
| 2009 |
Jan
(18) |
Feb
(14) |
Mar
(27) |
Apr
(14) |
May
(10) |
Jun
(14) |
Jul
(18) |
Aug
(30) |
Sep
(18) |
Oct
(12) |
Nov
(5) |
Dec
(26) |
| 2010 |
Jan
(27) |
Feb
(3) |
Mar
(8) |
Apr
(4) |
May
(6) |
Jun
(13) |
Jul
(25) |
Aug
(11) |
Sep
(2) |
Oct
(4) |
Nov
(7) |
Dec
(6) |
| 2011 |
Jan
(25) |
Feb
(17) |
Mar
(25) |
Apr
(23) |
May
(15) |
Jun
(12) |
Jul
(8) |
Aug
(13) |
Sep
(4) |
Oct
(17) |
Nov
(7) |
Dec
(6) |
| 2012 |
Jan
(4) |
Feb
(7) |
Mar
(1) |
Apr
(10) |
May
(11) |
Jun
(5) |
Jul
(7) |
Aug
(1) |
Sep
(1) |
Oct
(5) |
Nov
(6) |
Dec
(13) |
| 2013 |
Jan
(9) |
Feb
(7) |
Mar
(3) |
Apr
(1) |
May
(3) |
Jun
(19) |
Jul
(3) |
Aug
(3) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
| 2014 |
Jan
(11) |
Feb
(1) |
Mar
|
Apr
(2) |
May
(6) |
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
| 2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
(4) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
| 2016 |
Jan
(4) |
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| 2018 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
(1) |
Dec
|
| 2019 |
Jan
(2) |
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
|
Sep
(1) |
Oct
(1) |
Nov
|
Dec
|
|
From: Brad T. <br...@ar...> - 2006-04-06 17:53:32
|
cc'd archive-access-discuss so others can comment, etc. Lukas Matejka wrote: > Dne =E8t 6. dubna 2006 05:06 jste napsal(a): >=20 >>>you can visit >>>http://raptor.webarchiv.cz:8080/wayback/20060214122556/www.agromanual.= cz/ >>>cz/ there are question marks instead of special czech chraters (but i'= m >>>not sure if your browser is able to show any czech chracter) >>> >>>i think that there would be bug in sending character to output.. >>>what's exactly done with page when it is restored from archive? >> >>I'm pretty sure this was caused by a character encoding bug (aka, a >>total lack of proper handling of non ASCII documents...) which I'm >>hoping is now fixed in HEAD. >> >>If you get a chance to try out the fix, and let me know if you're still >>seeing the problem, that would be great. >> >=20 >=20 > i've just applied patch and it seems to work well! >=20 > good work! Great! Glad to hear it's working better, and thanks! I'm hoping to=20 release an 0.4.1 today which includes this fix. >=20 > l. >=20 > p.s. > what does exatctly mean description "--> (redirect) (new version) " in= =20 > results? >=20 Each line in the index indicates among other things: HTTP redirect(Location) URL document digest If the page is thought to redirect to another URL, this is indicated in=20 the search result page with a "(redirect)". If the digest has changed from the value for the previous version of the=20 same URL, then "(new version)" is indicated. (At least this is how it's supposed to be working.) We haven't yet spent much time making the rendering JSPs very pretty. If=20 you have suggestions, etc, please forward on. >=20 >>Thanks! >> >>Brad >=20 >=20 >=20 |
|
From: Oskar G. <osk...@kb...> - 2006-04-06 14:19:26
|
Hi everyone! Let me first introduce me to those of you who don't know me already. My name is Oskar Grenholm and I work as a programmer at The National Library of Sweden. I mainly work with things related to our web archive here. Lately I have made some minor improvemtents to the way the proxy-mode works in the Open Wayback Machine. Those changes have made it possible to surf not only the most recent copy of a page in the web archive, but instead any copy available. This can be done with just the Wayback Machine, but to aid (and perhaps simplify) the surfing I have also started working on a Firefox extension that will help the user with common tasks often encountered when surfing a web archive. Among the things this WAX Toolbar does is providing a search field for searching the Wayback Machine for different URL:s OR do a full-text search from a NutchWAX index (if one is available of course). You can also use the toolbar to switch between proxy-mode and the regular Internet, and when in proxy-mode easily go back and forth in time. The changes made to the Wayback are not many. The main idea is that you have a BDB index that holds mappings between id:s (a unique id if the toolbar was used, otherwise the ip-address the request was made from) and a preferred time to surf at. This timestamp is set either when you choose a page to visit from the search interface in the WB or by the WAX Toolbar. Then for each request made to the proxy the WB will look up this timestamp and return the page that is the closest in time. Patches for these changes are attached to this e-mail. Four of the files are earlier existing files that have been modified somewhat and two of them are new (BDBMapper.java and Redirect.jsp). Attached is also a tar-file containing the source for the Firefox extension. If you untar this and enter the directory you can just run 'ant' and a file named WaxToolbar.xpi will be built. That is the actual Firefox extension and it can be installed as any other extension (i,e,. double-clicking it from within Firefox). When the extension is installed (and after a re-start of Firefox) a new toolbar will be there. In the Tools menu there will also be a WAX Toolbar Configuration option. Using this you can set the proxy to use (the WB) and a server running NutchWAX. Finally I have attached an example of a web.xml that can be used when running the WB with these new changes and the WAX Toolbar. In it some new stuff has been added, namely a parameter specifying the redirect path (the Redirect.jsp mentioned above) and a servlet called xmlquery that runs in parallell with the normal query interface and is used by the extension to find the times a page has been archived. So, let the feedback begin! Regards, Oskar. |
|
From: Brad T. <br...@ar...> - 2006-04-03 21:49:54
|
Hello archive-access! I wanted to take a few minutes to introduce myself and the new Wayback project, which has been mentioned on this list, but never formally announced. This project is designed to eventually be the Internet Archive's standard tool for querying and replaying archived content. The current production Wayback Machine (web.archive.org) software allows Internet users to view archived documents from the Internet Archive's web collection, which contains over 60 billion resources. This new Wayback project seeks to replace the classic Wayback Machine's functionality in an open-source, extensible and redistributable Java package. There are dramatic variations in the ways that people want to use this software. At one end of the spectrum is the user who simply wants to look at content they've just crawled with the Heritrix web crawler on their personal workstation. At the other end is the Internet Archive, needing to serve hundreds of requests per second against their 20 million ARC file collection. In between are everything from users experimenting with full-text searching technologies, and others trying out new methods of replaying archived content using browser extensions. To address these varying requirements, a good deal of the projects focus is to leverage modularity and extensibility, so various components can be swapped out and combined to satisfy diverse installation needs. The very early (and unannounced) 0.2.0 release enabled two methods of replaying content in ARC format, the "standard" archival URL mode, and also a new proxy mode, where a user configures their browser to proxy requests through a Wayback server. This proxy mode addresses many, if not most, of the problems reported with the production Wayback Machine's archival URL replay mechanism. The 0.2.0 version operated only in a standalone mode, requiring that all ARC files be located on the same machine running the Wayback software. We have just release a new version, 0.4.0, of the Wayback software, which you can read about in more detail at the project's home page: http://archive-access.sourceforge.net/projects/wayback/ This version has solidified some of the internal workings of the software, addressed the usual set of bugs found in new codebases, and also includes some major new capabilities. The first major feature is the ability to access documents from ARC files stored on remote servers, which has significant scaling ramifications. There have also been substantial improvements in both the query UI capabilities, and in replaying documents. Also, the Wayback software can now be queried using an Opensearch API, and preliminary development has been completed to allow requests to be satisfied using a NutchWAX full-text index. We plan to release 0.6.0 in the next couple of months, which will include better packaging, and substantial UI improvements, to make the Wayback software feature comparable with the WERA application. The current major features present in WERA that have not yet been developed in the Wayback are: * clickable "timeline" view in replay mode * very slick install application * vastly better documentation * better support/testing for international character-sets This is my first Java project, so I'm very appreciative of coaching and suggestions on coding style and things I'm doing wrong. Please let me know if you have problems, suggestions, or questions, and thanks in advance for the feedback! Brad Tofel |
|
From: alexis a. <alx...@ya...> - 2006-03-30 03:48:00
|
Hi, I encountered the problem while trying to search WERA using Chinese letters as my criteria. The search result will always contain a WMV file. Initially, I thought Nutch is indexing wmv files which Stack clarified that it does not. Stack suggested that I check the anchors.jsp to check if there are anchors to that file. However,The anchors.jsp of the file does not have anything in it. I consult the crawl.log file and found these lines: 2006-03-18T04:10:33.604Z 200 1056898 http://info.channelnewsasia.com/rovingdv/boundaries/shthong.wmv LE http://www.channelnewsasia.com/boundaries/videos.htm text/plain #030 20060318041033072+335 AAYVISCYYWBMBZWKBSFSB7Z6XXAKWC2F 3t 2006-03-18T04:10:36.251Z 200 2700490 http://info.channelnewsasia.com/rovingdv/boundaries/asha.wmv LE http://www.channelnewsasia.com/boundaries/videos.htm text/plain #008 20060318041034946+792 CB44IW2LSWLMD6AWIYWOXZFK3HZ4AABU - 2006-03-18T04:10:38.954Z 200 1393066 http://info.channelnewsasia.com/rovingdv/boundaries/drtan.wmv LE http://www.channelnewsasia.com/boundaries/videos.htm text/plain #001 20060318041038254+436 INDK5OOERVNJXO464GLIIT2EPUDIOXZP - 2006-03-18T04:10:41.199Z 200 891220 http://info.channelnewsasia.com/rovingdv/boundaries/chiam.wmv LE http://www.channelnewsasia.com/boundaries/videos.htm text/plain #042 20060318041040700+334 5T3Q2FSH2BSOCI7WSUZKYBYZ6RXLF74H - When I look at the html page that contains the file, I found these codes for one of the files: <object id="MediaPlayer" classid="CLSID:22d6f312-b0f6-11d0-94ab-0080c74c7e95" codebase="http://activex.microsoft.com/activex/controls/mplayer/en/nsm p2inf.cab#Version=5,1,52,701" standby="Loading Microsoft Windows Media Player components..." type="application/x-oleobject" width="200" height="190" border="0" vspace="0" hspace="0" align="top"> <param name="FileName" value="http://info.channelnewsasia.com/rovingdv/boundaries/chiam.wmv"> <param name="AnimationatStart" value="true"> <param name="TransparentatStart" value="true"> <param name="AutoStart" value="false"> <param name="ShowControls" value="1"> <embed type="application/x-mplayer2" pluginspage="http://www.microsoft.com/windows95/downloads/contents/wur ecommended/s_wufeatured/mediaplayer/default.asp" showcontrols=1 width=200 height=190 src="http://info.channelnewsasia.com/rovingdv/boundaries/chiam.wmv" border="0" vspace="0" hspace="0" align="top"> </embed> </object> It seems to me that Heritrix was not able to categorized the said file correctly that is why it is indexed. However, Stack mentioned about nutch doing some more test to check if the file is indeed a text/html. I hope you guys can check double check my problem. Best Regards, Alexis Artes --------------------------------- New Yahoo! Messenger with Voice. Call regular phones from your PC and save big. |
|
From: Brad T. <br...@ar...> - 2006-03-20 21:20:20
|
The old Wayback Machine rewrites these link tags on the server side, before transmitting to the clients. I believe this is because JS modification of this during page load has no effect. Not positive, but should be easy to test. I just added a feature about 2 weeks ago to the new Java Wayback Machine to do this rewriting on the server side. (also does FRAMEs, and a couple other tag types, too..) Sverre Bang wrote: > On Thu, 2006-03-16 at 23:23 +0800, Boon Ling Aw wrote: > >>Hello, >> >>When using WERA to view website archive, a JS script is inserted by WERA to >>ensure that links point to WERA rather than out to the Internet. >> >>However, this redirection of link is not being applied to CSS style sheets >>used in web page? >> >><link rel="stylesheet" href="http://... .../styles.css" type="text/css" /> >> >>The CSS page does exists within the archive. However, its refering to the >>internet's copy when viewed using WERA. >> >>Any reason for this? > > No, must be a bug in the JS responsible for rewriting links. We'd > appreciate any contributions in improving the JS rewriter from anyone > having the necessary JS skills - i know i don't ;-). > > Sverre > > >>Thanks in advance for any replies... >> >> >> >> >>------------------------------------------------------- >>This SF.Net email is sponsored by xPML, a groundbreaking scripting language >>that extends applications into web and mobile media. Attend the live webcast >>and join the prime developer group breaking into this new coding territory! >>http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 >>_______________________________________________ >>Archive-access-discuss mailing list >>Arc...@li... >>https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting language > that extends applications into web and mobile media. Attend the live webcast > and join the prime developer group breaking into this new coding territory! > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From: Michael S. <st...@du...> - 2006-03-20 18:51:07
|
NutchWAX-0.4.3 fixes the following bugs (Hope my little ascii table makes it across): +------------------------------------------------------------------------+ | ID | Type | Summary | Open Date | By | Filer | |---------+------+--------------------+------------+----------+----------| | | | Index '.arc' (as | 2006-03-20 | | | | 1454710 | Fix | well as | 08:54 | stack-sf | stack-sf | | | | '.arc.gz'). | | | | |---------+------+--------------------+------------+----------+----------| | 1454714 | Fix | Null mimetype | 2006-03-20 | stack-sf | stack-sf | | | | stops indexing | 09:00 | | | |---------+------+--------------------+------------+----------+----------| | | | xml output | 2006-03-20 | | | | 1429788 | Fix | destroyed by html | 08:59 | stack-sf | stack-sf | | | | entity encoding | | | | +------------------------------------------------------------------------+ For sure this will be last release before the 0.6.0 move up on to the nutch mapreduce platform. Yours, St.Ack |
|
From: Sverre B. <sve...@nb...> - 2006-03-20 15:21:57
|
This note is to announce a new release candidate 0.4.2-RC1 of WERA (WEb aRchive Access), the web archive collection search and navigation tool. Release 0.4.2-RC1 includes url canonicalization and proxy support. See release notes and manual for details. Release notes: http://sourceforge.net/project/shownotes.php?release_id=403083&group_id=118427 Online manual: http://nwa.nb.no/wera/articles/manual.html Download: http://sourceforge.net/project/showfiles.php?group_id=118427&package_id=167210&release_id=403083 The WERA home page: http://archive-access.sourceforge.net/projects/wera/ Demo of WERA 0.4.2-RC1 is available at http://nwa.nb.no/wera/ (proxy setup is not supported by the web server at nwa.nb.no,- i'll let you know when that is ready) Yours, Sverre Bang |
|
From: Sverre B. <sve...@nb...> - 2006-03-17 11:43:28
|
On Thu, 2006-03-16 at 23:23 +0800, Boon Ling Aw wrote: > Hello, > > When using WERA to view website archive, a JS script is inserted by WERA to > ensure that links point to WERA rather than out to the Internet. > > However, this redirection of link is not being applied to CSS style sheets > used in web page? > > <link rel="stylesheet" href="http://... .../styles.css" type="text/css" /> > > The CSS page does exists within the archive. However, its refering to the > internet's copy when viewed using WERA. > > Any reason for this? No, must be a bug in the JS responsible for rewriting links. We'd appreciate any contributions in improving the JS rewriter from anyone having the necessary JS skills - i know i don't ;-). Sverre > > Thanks in advance for any replies... > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting language > that extends applications into web and mobile media. Attend the live webcast > and join the prime developer group breaking into this new coding territory! > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From: Michael S. <st...@ar...> - 2006-03-16 16:46:59
|
(Forwarding to the list because likely of general interest) Sverre Bang wrote: > ... > > On Thu, 2006-03-16 at 05:39 +0000, Banu Gandhi wrote: > >> Hi Sverre, >> >> We are using NutchWAX as an indexing tool for our Web Archive(WERA and >> HERITRIX). >> >> We wish to implement multiple indexing, as well as incremental >> indexing we keep our crwal files as an seperate server. >> >> * 1.I have some questions when I try to do Incremental >> Indexing. >> I mount the Arcfiles from another server. I have created queue. When I >> segmenting it, It shows error message that there is no such file in >> the queue fodler eventhough the arc files are linked properly in arcs >> folder. >> Please paste in the error Banu and the commands you run (You're not using the indexarcs wrapper script?). That'll help with diagnosis. >> >> When I try to implement the update statements from new segments, I got >> the message that "FS not specified default LOCAL" How can I specify >> this as not local. But the Update message shows update finished >> successfully. >> Be careful here. You probably want LOCAL for your case, at least for the moment. The alternative is NDFS, the nutch distributed file system that has since evolved in later versions of nutch -- nutchwax is based on nutch 0.7 --- to become DFS, part of the new hadoop apache project. Its phrasing is ominous as though you've left out some important specification but its just an emission to tell you which FS its about to use. >> >> The same message is shown when I update segments from db. >> After that If I check the ars folder of old segments, I can't see the >> new arc files. >> The arcs folder or the queue folder? >> >> Can you explain me where I made the mistake. >> >> * 2. Can we maitain multiple indexed folder, meaning mutliple >> arc file folder in the same machine, it is indexed under diff >> folder.Is WERA can access all the indexed folder for search >> results.... >> I think WERA passes the ARCRetriever a full path so multiple folders should be possible (Sverre)? Do you have an idea of how many ARC files you'll be dealing with? But there'll be upper limits to how many ARCs you can keep on a single machine.... so a means of keeping them distributed over multiple machines is needed. The open source wayback will have such a facility and we'll slot it into place when ready in place of ARCRetriever. >> >> * 3. Regarding the scalability of NutchWax,If I don't want to >> index the image file for full text searching. I wish to have >> just URL link to the images. How can we do that? >> Thats what currently happens. image/* and their like are passed to the default parser. All it does is add to the index meta info such as URL, type, etc. These resource types are not 'indexed' in the way text/* are. >> >> And also please let me know where I can I find the functionality parrt >> of all the folders as well as scripts of NutchWAX other than FAQ. >> I'm not clear what you're asking above. Please retry. Thanks Banu, St.Ack >> >> Thanks in advance. >> >> Best Regards, >> Banu >> >> >> ______________________________________________________________________ >> Jiyo cricket on Yahoo! India cricket >> Yahoo! Messenger Mobile Stay in touch with your buddies all the time. >> |
|
From: Boon L. A. <aw...@ho...> - 2006-03-16 15:23:19
|
Hello, When using WERA to view website archive, a JS script is inserted by WERA to ensure that links point to WERA rather than out to the Internet. However, this redirection of link is not being applied to CSS style sheets used in web page? <link rel="stylesheet" href="http://... .../styles.css" type="text/css" /> The CSS page does exists within the archive. However, its refering to the internet's copy when viewed using WERA. Any reason for this? Thanks in advance for any replies... |
|
From: Sverre B. <sve...@nb...> - 2006-03-14 15:08:30
|
Hi there. Sorry for not responding earlier to all the issues discussed in the recent weeks. I'm working on adding URL canonicalization and proxy mode support in WERA and the results so far are promising. Some comments below. I'll prepare a new release for this week. I'll even try to convince the people maintaining our web servers to add the proxy setup on the wera demo site. Please ask more questions, this time i promise to get back to you a bit sooner. Regards Sverre On Wed, 2006-02-22 at 22:36 -0800, stack wrote: > stack wrote: > > (Forwarded discussion from the Heritrix list) > > > > ------------------------------------------------------------------------ > > ... > > Generally, the pages are shown fine with the exceptions of > > javascripts that are retrieved from the live site instead of our arc > > files. Also, WERA is unable to dynamically replace the links inside > > the javascipts. > > > This leaking to the live web is a difficult problem. Perhaps this > particular JS can be fixed in WERA but the variety of ways in which JS > can be conjured, its unlikely all permutations will be guarded against. No way we're gonna catch all JS-generated URL's by using Javascript or server side parsing. At least i'm not going to invest a lot of time in bullet proofing the JS, others are welcome though ;-) I'd prefer a combination JS and/or server side parsing and a proxy solution that catches "the rest", i.e. the leakages out to internet. > > > Sample case: http://nwa.nb.no/wera/result.php?time=&url=http%3A%2F% > > <http://nwa.nb.no/wera/result.php?time=&url=http%3A%2F%> > > 2Fwww.nla.gov.au%2F&query=nla. > > > > Check the properties of the webpage to verify that you are still > > within "http://nwa.nb.no". Click on "Exquisite Watercolors" link and > > veiry that we are still viewing the arcfiles. Go back a page and try > > any menu links. When you view the properties, it will show that you > > are indeed browsing the live site instead of the arcfiles. > > The proxy mode i'm working does handle the above case. > Have you considered setting your browser to go to your collection via a > proxy (I don't think this mode is supported yet in WERA. I think its > possible to set the wayback into a proxy mode). The proxy could ensure > you never strayed off your ARC collection returning errors if resource > is not found. > > > > > What I wanted to accomplish are the following: > > 1) Help WERA load the javascripts from our arcfiles instead of the > > live sites by modifying the loading of the scripts from the html. > > Instead of the relative /js/xxxx.js, we will change it to > > http://localhost/wera/......../js/xxxx.js. > > (Sverre or Brad: Does the JS inserted at end of the page by WERA adding > a base to the page not effect such JS URLs?) The JS injected by wera should take care of this (eh, i'm not an expert on javascript - i did modify IA's original Wayback JS to fit WERA, but i do not have thorough understanding of it). A big problem with WERA as it is now is that you have no easy way of telling what is fetched from the internet and what is fetched through WERA. Cutting the browser off from internet by using a proxy that redirects the leaking links back to WERA makes it a lot easier to debug and improve the JS' rewriting. > . > > > > 2) Modify the relative links inside javascript files if WERA is not > > capable of dynamically modifying them also. If you really need to change the javascript files before feeding them to the client i would recommend that you implement this in WERA rather than start messing with the ARC files. If you look in the Wera config file you'll see that there are different handlers for different mime types ($conf_document_handler). The text/html handler injects the JS for rewriting links. Any other mime type is handled by a passthrough handler. If the javascripts are stored in the archive with one (or more) distinguished mime type you could write a handler espacially for this/these. > Would be sweet if any modifications you'd do in the rewriting of the ARC > files was instead done for you by WERA (or wayback). > > St.Ack > > > > > I am planning to use the dk.netarkivet.ArcUtils for this task. > > > > I know that my problem is a little bit off topic but I hope you could > > give additional tips. > > > > Thanks again in advance. > > > > --- In arc...@ya..., stack <stack@...> wrote: > > > > > > alxartes wrote: > > > > St. Ack thanks again for the reply. > > > > > > > > Most of the pages are not displayed the way it should be when > > viewed > > > > from the source. When you view the source of the web page displayed in the lower frame of the wera timeline view you will not see the links rewritten. The source you see is hte source before the JS "kicks in". This can be a bit annoying when it comes to debugging the JS ;-) > > > > > > At this time, are you viewing the pages with WERA? Or how are they > > > being viewed? > > > > I guess it is because the css and javascripts file > > > > are not being fetched properly at the loading of the html from the > > > > arcfile. We arrived at this conclusion since we can directly > > > > retrieved the css and js through WERA. > > > So pages are showing fine when viewed with WERA (generally)? > > > > > > > > I am planning on modifying the htmls inside the arcfiles to > > correct > > > > this problem. > > > I'm trying to understand. You want to rewrite ARC files changing > > all > > > links so they point back into ARCs (or back to a disk populated > > with the > > > documents from a set of ARCs)? You do not want to use WERA viewing > > pages? > > > > > > > What tool can I use to expand the arcfiles so that I > > > > can modify the files inside? and a tool that will bring the > > arcfile > > > > together once again? I think this is somewhat out of topic but I > > am a > > > > little bit out of time and would greatly appreciate any inputs. > > > This section from dev. manual might be of use: > > > http://crawler.archive.org/articles/developer_manual.html#arcs. > > Talks > > > about tools for reading ARCs. > > > > > > One approach would subclass ARCReader. This will get you a stream > > onto > > > ARCs > > > > > (http://crawler.archive.org/apidocs/org/archive/io/arc/ARCReader.html) > > <http://crawler.archive.org/apidocs/org/archive/io/arc/ARCReader.html%29> > > .. > > > Use the adjacent ARCWriter to write new ARCs. Modifying the links > > in > > > pages, you'll first have to find them. You could start with the > > > Extractors that are in Heritrix subclassing them to add a link > > rewrite > > > functionality. Such a tool has been asked for on this list in the > > past > > > but its a bit of job and in the end, you'll never successfully be > > able > > > to rewrite all links (Think URLs produced by JS in the page). > > > > > > Will a WERA (or the coming wayback, > > > http://archive-access.sourceforge.net/projects/wayback/) > > <http://archive-access.sourceforge.net/projects/wayback/%29> not > > suffice? > > > > > > Yours, > > > St.Ack > > > > > > > > > > > > > > > > > Thanks again. > > > > > > > > > > > > > > > > --- In arc...@ya..., stack <stack@> wrote: > > > > > > > > > > alxartes wrote: > > > > > > Thanks St. Ack. > > > > > > > > > > > > It is really worisome to see those errors especially when we > > are > > > > not > > > > > > viewing the arcfiles properly in Wera. > > > > > > > > > > Can you say more about what 'not viewing the arcfiles properly > > in > > > > > Wera'? Are pages not being found or are missing > > images/stylesheets? > > > > > > > > > > Regards the local-errors.log, I've upped priority on an RFE that > > > > > proposes cleaning this log (and added your experience to the > > > > issue): > > > > > http://sourceforge.net/tracker/index.php? > > > > func=detail&aid=1091580&group_id=73833&atid=539099. > > > > > > > > > > > > Here is an excerpt from the crawl.log: > > > > > > > > > > > > 84046144 http://www.hdb.gov.sg/hdbwww/ownkvb.mpeg > > > > > > 84046144 http://www5.hdb.gov.sg/hdbwww/ownkvb.mpeg > > > > > > 84046144 http://www7.hdb.gov.sg/hdbwww/ownkvb.mpeg > > > > > > 84046144 https://www5.hdb.gov.sg/hdbwww/ownkvb.mpeg > > > > > > 47097784 http://www.hdb.gov.sg/hdbwww/ownkvs.mpeg > > > > > > 47097784 http://www5.hdb.gov.sg/hdbwww/ownkvs.mpeg > > > > > > 47097784 http://www7.hdb.gov.sg/hdbwww/ownkvs.mpeg > > > > > > 47097784 https://www5.hdb.gov.sg/hdbwww/ownkvs.mpeg > > > > > > 22292823 http://www.hdb.gov.sg/hdbwww/fallingwindow.wmv > > > > > > 22292823 http://www5.hdb.gov.sg/hdbwww/fallingwindow.wmv > > > > > > 22292823 http://www7.hdb.gov.sg/hdbwww/fallingwindow.wmv > > > > > > 22292823 https://www5.hdb.gov.sg/hdbwww/fallingwindow.wmv > > > > > > > > > > > > As you can see, a certain file is crawled 4 times. I have done > > > > this > > > > > > crawl using domain scope. Would pathscope with a seed of > > > > > > http://www.hdb.gov.sg prevent the other sites to being > > crawled? If > > > > > > not, are there other ways to prevent it from happening? > > > > > > > > > > Yeah, the domain scope warns: "It will however reach subdomains > > of > > > > the > > > > > seeds' original domains. www[#].host is considered to be the > > same > > > > as > > > > > host." Explicitly stating 'www.hdb.gov.sg' doesn't look like it > > > > will > > > > > avoid the problem either reading the code. > > > > > > > > > > FYI, we're moving away from *scope scopes -- i.e. domainscope, > > > > > pathscope, etc. -- toward decidingscope. The latter gives > > > > you "more > > > > > rope" designing scopes. > > > > > > > > > > It looks like the On*DecideRule though has same issue > > with 'www'. > > > > Looks > > > > > like you can write a SURT form, something > > like '(sg,gov,hdb,www)', > > > > that > > > > > will only include URIs with a host of 'www.hdb.gov.sg' (though > > it > > > > looks > > > > > like http and https are flattened to be same scheme). > > > > > > > > > > I'll let others -- Igor or Gordon? -- respond. They can give a > > > > better > > > > > quality answer than I. > > > > > > > > > > Good stuff, > > > > > St.Ack > > > > > > > > > > > > > > > > > Thank you so much for your time. > > > > > > > > > > > > > > > > > > > > > > > > --- In arc...@ya..., stack <stack@> wrote: > > > > > > > > > > > > > > alxartes wrote: > > > > > > > > Hi, > > > > > > > > > > > > > > > > I am investigating the log files of my crawls and found > > the > > > > error > > > > > > > > below. I hope someone could explain what this means > > because > > > > the > > > > > > other > > > > > > > > javascripts are crawled fine. > > > > > > > > > > > > > > > > 2006-02-15T03:35:21.747Z > > > > > > > > > > > > http://www.macromedia.com/uber/js/omniture_s_code.js "Unsupported > > > > > > > > scheme: javascript" > > > > > > > > > > > > > > > > > > > > javascript:,macromedia,dreamweaver,flash,shockwave,sdc,markme,sdc.shoc > > > > > > kw > > > > > > > > ave,infopoll,developerlocator.macromedia > > > > > > > > > > > > > > In short, the above is just stating that Heritrix does not > > > > support > > > > > > > fetching > > the 'URI' "javascript:,macromedia,dreamweaver...". Its > > > > > > not an > > > > > > > 'error'. > > > > > > > > > > > > > > Heritrix is regexing over the content of > > > > > > > 'http://www.macromedia.com/uber/js/omniture_s_code.js' > > <http://www.macromedia.com/uber/js/omniture_s_code.js%27> > > > > <http://www.macromedia.com/uber/js/omniture_s_code.js%27> > > > > > > <http://www.macromedia.com/uber/js/omniture_s_code.js%27> > > looking > > > > for > > > > > > > URIs. It found the string > > > > > > > > > > > > > > > > > > > > "javascript:,macromedia,dreamweaver,flash,shockwave,sdc,markme,sdc.s > > > > > > hockwave,infopoll,developerlocator.macromedia" > > > > > > > > > > > > > > > > > > > > > To the Heritrix regex, the above string looks like a likely > > URI. > > > > > > Its > > > > > > > inside quotes > > > > > > > and starts with what could be an URI scheme > > > > (i.e. 'javascript:'). > > > > > > > > > > > > > > So, the candidate URI is passed to our URI parser class, > > > > > > > org.archive.net.UURIFactory. This class takes configuration > > in > > > > > > > heritrix.properties about which URI schemes Heritrix will > > > > accept. > > > > > > > Here's relevant extract: > > > > > > > > > > > > > > > > > > > > > > > > > > ###################################################################### > > > > > > ######## > > > > > > > # U U R > > > > > > > > > > > > > > > > > > > I # > > > > > > > > > > > > > > > > > > > ###################################################################### > > > > > > ######### > > > > > > > Any scheme not listed in the below will generate an > > > > > > UnsupportedUriScheme > > > > > > > # exception. Make the list empty to support all schemes. > > > > > > > org.archive.net.UURIFactory.schemes = http, https, dns, > > invalid > > > > > > > > > > > > > > (We don't currently have a 'UnsupportedUriScheme' > > exception. We > > > > > > should > > > > > > > add one). > > > > > > > > > > > > > > Here is where the test is done: > > > > > > > > > > > > > http://crawler.archive.org/xref/org/archive/net/UURIFactory.html#443 > > > > > > > > > > > > > > Because 'javascript' scheme is not in above supported > > schemes > > > > list > > > > > > (nor > > > > > > > in the list of schemes to ignore which appears later in > > > > > > > heritrix.properties), it generates a URIException with > > > > > > an 'unsupported > > > > > > > scheme' message. > > > > > > > > > > > > > > We could do with some clean up in here. Currently all URI > > > > > > exceptions > > > > > > > are lumped into URIException. We could add subclasses of > > URIE > > > > so > > > > > > the > > > > > > > non-errors get logged at a different level: e.g. FINE for > > > > > > unsupported > > > > > > > scheme exceptions. > > > > > > > > > > > > > > St.Ack > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > SPONSORED LINKS > > > > > > Computer security > > > > > > <http://groups.yahoo.com/gads? > > > > > > t=ms&k=Computer+security&w1=Computer+security&w2=Computer+training&c=2 > > > > &s=46&.sig=BHmcxBg5sKfN9-gcWnJWDg> > > > > > > Computer training > > > > > > <http://groups.yahoo.com/gads? > > > > > > t=ms&k=Computer+training&w1=Computer+security&w2=Computer+training&c=2 > > > > &s=46&.sig=v0JjJWA4s7mLnWQWdFxuTQ> > > > > > > > > > > > > > > > > > > > > > > > > -------------------------------------------------------------- > > ---- > > > > ------ > > > > > > YAHOO! GROUPS LINKS > > > > > > > > > > > > * Visit your group "archive-crawler > > > > > > <http://groups.yahoo.com/group/archive-crawler>" on the > > web. > > > > > > > > > > > > * To unsubscribe from this group, send an email to: > > > > > > arc...@ya... > > > > > > <mailto:arc...@ya...? > > > > subject=Unsubscribe> > > > > > > > > > > > > * Your use of Yahoo! Groups is subject to the Yahoo! > > Terms of > > > > > > Service <http://docs.yahoo.com/info/terms/>. > > > > > > > > > > > > > > > > > > -------------------------------------------------------------- > > ---- > > > > ------ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > SPONSORED LINKS > > > > Computer security > > > > <http://groups.yahoo.com/gads? > > t=ms&k=Computer+security&w1=Computer+security&w2=Computer+training&c=2 > > &s=46&.sig=BHmcxBg5sKfN9-gcWnJWDg> > > > > Computer training > > > > <http://groups.yahoo.com/gads? > > t=ms&k=Computer+training&w1=Computer+security&w2=Computer+training&c=2 > > &s=46&.sig=v0JjJWA4s7mLnWQWdFxuTQ> > > > > > > > > > > > > > > > > ------------------------------------------------------------------ > > ------ > > > > YAHOO! GROUPS LINKS > > > > > > > > * Visit your group "archive-crawler > > > > <http://groups.yahoo.com/group/archive-crawler>" on the web. > > > > > > > > * To unsubscribe from this group, send an email to: > > > > arc...@ya... > > > > <mailto:arc...@ya...? > > subject=Unsubscribe> > > > > > > > > * Your use of Yahoo! Groups is subject to the Yahoo! Terms of > > > > Service <http://docs.yahoo.com/info/terms/>. > > > > > > > > > > > > ------------------------------------------------------------------ > > ------ > > > > > > > > > > > > > > > > > > > > > SPONSORED LINKS > > Computer security > > <http://groups.yahoo.com/gads?t=ms&k=Computer+security&w1=Computer+security&w2=Computer+training&c=2&s=46&.sig=BHmcxBg5sKfN9-gcWnJWDg> > > Computer training > > <http://groups.yahoo.com/gads?t=ms&k=Computer+training&w1=Computer+security&w2=Computer+training&c=2&s=46&.sig=v0JjJWA4s7mLnWQWdFxuTQ> > > > > > > > > ------------------------------------------------------------------------ > > YAHOO! GROUPS LINKS > > > > * Visit your group "archive-crawler > > <http://groups.yahoo.com/group/archive-crawler>" on the web. > > > > * To unsubscribe from this group, send an email to: > > arc...@ya... > > <mailto:arc...@ya...?subject=Unsubscribe> > > > > * Your use of Yahoo! Groups is subject to the Yahoo! Terms of > > Service <http://docs.yahoo.com/info/terms/>. > > > > > > ------------------------------------------------------------------------ > > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting language > that extends applications into web and mobile media. Attend the live webcast > and join the prime developer group breaking into this new coding territory! > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From: Kaisa K. <kau...@cs...> - 2006-02-28 12:28:29
|
Hello, can you listen music files using Wera, if you have various music formats in your archive? kaisa |
|
From: stack <st...@ar...> - 2006-02-23 18:31:23
|
stack wrote: > stack wrote: >> (Forwarded discussion from the Heritrix list) >> >> ------------------------------------------------------------------------ >> ... >> Generally, the pages are shown fine with the exceptions of >> javascripts that are retrieved from the live site instead of our arc >> files. Also, WERA is unable to dynamically replace the links inside >> the javascipts. >> Related, here are all current issues regards JS (the first originally reported by Charles of LU): "[ 1312214 ] [wera/wayback js] More redirects to llive web (look at it)." https://sourceforge.net/tracker/index.php?func=detail&aid=1312214&group_id=118427&atid=681137. "[ 1280447 ] [wera/wayback js] Link rewritng not working well for frames" https://sourceforge.net/tracker/index.php?func=detail&aid=1280447&group_id=118427&atid=681137 "[ 1421112 ] WERA web page display Menus in JS" https://sourceforge.net/tracker/index.php?func=detail&aid=1421112&group_id=118427&atid=681137 St.Ack |
|
From: stack <st...@ar...> - 2006-02-23 17:40:55
|
Charles Foetz wrote: > Hello St.Ack, Lukas and everyone else, Good to hear from you again Charles. We still owe you a response to the long list of issues you found in the=20 WERA+NutchWAX combo. A good few have been addressed but others remain=20 still. > =20 > Long time since I posted any news concerning Luxembourg's web=20 > archiving efforts - as you know, we are very limited in human=20 > resources (we only have 2 IT people at the national library) and=20 > therefore need to find a balance between many different projects. > =20 > Last time we were forced to put our web archiving project on hold due=20 > to the known limitations of the WERA Access tool (no canonicalization=20 > of URLs, no handling of redirects, encoding issues)... As a prototype=20 > project we had archived at several dates the sites of 7 political=20 > parties during local elections. The two limitations above made it=20 > impossible for WERA to access most parts of 4 out of these 7 archived=20 > sites (links to "http://site.com" instead of "http://www.site.com"=20 > were quite common, for instance), we therefore had pretty much nothing=20 > to "show" and had didn't go further than the prototype. Your report on canonicalization failures was captured as this issue:=20 https://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1312202&gro= up_id=3D118427&atid=3D681137. =20 We should make WERA requery with the 'www' stripped (or prepended) if=20 gets a 404 out of the index. > =20 > I am now wondering what the plans are for WERA... are the issues above=20 > likely to be fixed any time soon or are they considered low priority?=20 > Is a new release planned or is the focus on other tools at the moment=20 > (I realise you guys also struggle at many fronts at the same time) ? > If you think we could help on the development of WERA ourselves and=20 > maybe should be having a go at trying to fix the issues above, let me=20 > know. > Another question: via the archive-access-cvs list, I noticed a lot of=20 > updates on the wayback project. What is this project? An open-source=20 > implementation of the Wayback machine (I've heard this mentioned=20 > before)? Has there been a release and at which stage is it? alpha?=20 > beta? working version? Where does this project fit in? Should it be=20 > seen as an alternative to WERA? > =20 Sverre knows the WERA story best. I'll let him speak to the above.=20 Would be sweet if we could fix sufficent for you to launch at least a=20 prototype. The long term plan is to transition from WERA on to the new wayback. =20 Sverre points this out at the end of this 'What is WERA?' document in=20 the future of WERA:=20 http://archive-access.sourceforge.net/projects/wera/articles/what-is-wera= .html#N100AE. =20 For description of the new wayback, see=20 http://archive-access.sourceforge.net/projects/wayback/. The front page=20 does a good job situating the project. Its alpha software currently=20 though a pending release release will move it past this designation (Let=20 me kick our Brad and get him to introduce the wayback on this list). =20 Wayback is currently focusing on scaling and being able to act as=20 replacement for http://web.archive.org wayback for small collections. IMO, we're a ways yet from the wayback replacing WERA. While it already=20 has capabilities in excess of the WERA+ARCRetriever in certain regards,=20 its focus is elsewhere -- at least for now -- and it lacks core WERA UI=20 functionality, the quality documentation, and the sweet installer. St.Ack > Best regards, > =20 > Charlie Foetz > Biblioth=E8que nationale Luxembourg > Sp=E9cialiste de la gestion =E9lectronique de l'information > =20 |
|
From: Charles F. <Cha...@bn...> - 2006-02-23 09:43:36
|
Hello St.Ack, Lukas and everyone else, Long time since I posted any news concerning Luxembourg's web archiving = efforts - as you know, we are very limited in human resources (we only = have 2 IT people at the national library) and therefore need to find a = balance between many different projects. Last time we were forced to put our web archiving project on hold due to = the known limitations of the WERA Access tool (no canonicalization of = URLs, no handling of redirects, encoding issues)... As a prototype = project we had archived at several dates the sites of 7 political = parties during local elections. The two limitations above made it = impossible for WERA to access most parts of 4 out of these 7 archived = sites (links to "http://site.com" instead of "http://www.site.com" were = quite common, for instance), we therefore had pretty much nothing to = "show" and had didn't go further than the prototype.=20 I am now wondering what the plans are for WERA... are the issues above = likely to be fixed any time soon or are they considered low priority? Is = a new release planned or is the focus on other tools at the moment (I = realise you guys also struggle at many fronts at the same time) ? If you think we could help on the development of WERA ourselves and = maybe should be having a go at trying to fix the issues above, let me = know.=20 Another question: via the archive-access-cvs list, I noticed a lot of = updates on the wayback project. What is this project? An open-source = implementation of the Wayback machine (I've heard this mentioned = before)? Has there been a release and at which stage is it? alpha? beta? = working version? Where does this project fit in? Should it be seen as an = alternative to WERA?=20 Best regards, Charlie Foetz Biblioth=E8que nationale Luxembourg Sp=E9cialiste de la gestion =E9lectronique de l'information |
|
From: stack <st...@ar...> - 2006-02-23 06:37:27
|
stack wrote: > (Forwarded discussion from the Heritrix list) > > ------------------------------------------------------------------------ > ... > Generally, the pages are shown fine with the exceptions of > javascripts that are retrieved from the live site instead of our arc > files. Also, WERA is unable to dynamically replace the links inside > the javascipts. > This leaking to the live web is a difficult problem. Perhaps this particular JS can be fixed in WERA but the variety of ways in which JS can be conjured, its unlikely all permutations will be guarded against. > Sample case: http://nwa.nb.no/wera/result.php?time=&url=http%3A%2F% > <http://nwa.nb.no/wera/result.php?time=&url=http%3A%2F%> > 2Fwww.nla.gov.au%2F&query=nla. > > Check the properties of the webpage to verify that you are still > within "http://nwa.nb.no". Click on "Exquisite Watercolors" link and > veiry that we are still viewing the arcfiles. Go back a page and try > any menu links. When you view the properties, it will show that you > are indeed browsing the live site instead of the arcfiles. > Have you considered setting your browser to go to your collection via a proxy (I don't think this mode is supported yet in WERA. I think its possible to set the wayback into a proxy mode). The proxy could ensure you never strayed off your ARC collection returning errors if resource is not found. > > What I wanted to accomplish are the following: > 1) Help WERA load the javascripts from our arcfiles instead of the > live sites by modifying the loading of the scripts from the html. > Instead of the relative /js/xxxx.js, we will change it to > http://localhost/wera/......../js/xxxx.js. (Sverre or Brad: Does the JS inserted at end of the page by WERA adding a base to the page not effect such JS URLs?). > > 2) Modify the relative links inside javascript files if WERA is not > capable of dynamically modifying them also. Would be sweet if any modifications you'd do in the rewriting of the ARC files was instead done for you by WERA (or wayback). St.Ack > > I am planning to use the dk.netarkivet.ArcUtils for this task. > > I know that my problem is a little bit off topic but I hope you could > give additional tips. > > Thanks again in advance. > > --- In arc...@ya..., stack <stack@...> wrote: > > > > alxartes wrote: > > > St. Ack thanks again for the reply. > > > > > > Most of the pages are not displayed the way it should be when > viewed > > > from the source. > > > > At this time, are you viewing the pages with WERA? Or how are they > > being viewed? > > > I guess it is because the css and javascripts file > > > are not being fetched properly at the loading of the html from the > > > arcfile. We arrived at this conclusion since we can directly > > > retrieved the css and js through WERA. > > So pages are showing fine when viewed with WERA (generally)? > > > > > > I am planning on modifying the htmls inside the arcfiles to > correct > > > this problem. > > I'm trying to understand. You want to rewrite ARC files changing > all > > links so they point back into ARCs (or back to a disk populated > with the > > documents from a set of ARCs)? You do not want to use WERA viewing > pages? > > > > > What tool can I use to expand the arcfiles so that I > > > can modify the files inside? and a tool that will bring the > arcfile > > > together once again? I think this is somewhat out of topic but I > am a > > > little bit out of time and would greatly appreciate any inputs. > > This section from dev. manual might be of use: > > http://crawler.archive.org/articles/developer_manual.html#arcs. > Talks > > about tools for reading ARCs. > > > > One approach would subclass ARCReader. This will get you a stream > onto > > ARCs > > > (http://crawler.archive.org/apidocs/org/archive/io/arc/ARCReader.html) > <http://crawler.archive.org/apidocs/org/archive/io/arc/ARCReader.html%29> > .. > > Use the adjacent ARCWriter to write new ARCs. Modifying the links > in > > pages, you'll first have to find them. You could start with the > > Extractors that are in Heritrix subclassing them to add a link > rewrite > > functionality. Such a tool has been asked for on this list in the > past > > but its a bit of job and in the end, you'll never successfully be > able > > to rewrite all links (Think URLs produced by JS in the page). > > > > Will a WERA (or the coming wayback, > > http://archive-access.sourceforge.net/projects/wayback/) > <http://archive-access.sourceforge.net/projects/wayback/%29> not > suffice? > > > > Yours, > > St.Ack > > > > > > > > > > > > Thanks again. > > > > > > > > > > > > --- In arc...@ya..., stack <stack@> wrote: > > > > > > > > alxartes wrote: > > > > > Thanks St. Ack. > > > > > > > > > > It is really worisome to see those errors especially when we > are > > > not > > > > > viewing the arcfiles properly in Wera. > > > > > > > > Can you say more about what 'not viewing the arcfiles properly > in > > > > Wera'? Are pages not being found or are missing > images/stylesheets? > > > > > > > > Regards the local-errors.log, I've upped priority on an RFE that > > > > proposes cleaning this log (and added your experience to the > > > issue): > > > > http://sourceforge.net/tracker/index.php? > > > func=detail&aid=1091580&group_id=73833&atid=539099. > > > > > > > > > > Here is an excerpt from the crawl.log: > > > > > > > > > > 84046144 http://www.hdb.gov.sg/hdbwww/ownkvb.mpeg > > > > > 84046144 http://www5.hdb.gov.sg/hdbwww/ownkvb.mpeg > > > > > 84046144 http://www7.hdb.gov.sg/hdbwww/ownkvb.mpeg > > > > > 84046144 https://www5.hdb.gov.sg/hdbwww/ownkvb.mpeg > > > > > 47097784 http://www.hdb.gov.sg/hdbwww/ownkvs.mpeg > > > > > 47097784 http://www5.hdb.gov.sg/hdbwww/ownkvs.mpeg > > > > > 47097784 http://www7.hdb.gov.sg/hdbwww/ownkvs.mpeg > > > > > 47097784 https://www5.hdb.gov.sg/hdbwww/ownkvs.mpeg > > > > > 22292823 http://www.hdb.gov.sg/hdbwww/fallingwindow.wmv > > > > > 22292823 http://www5.hdb.gov.sg/hdbwww/fallingwindow.wmv > > > > > 22292823 http://www7.hdb.gov.sg/hdbwww/fallingwindow.wmv > > > > > 22292823 https://www5.hdb.gov.sg/hdbwww/fallingwindow.wmv > > > > > > > > > > As you can see, a certain file is crawled 4 times. I have done > > > this > > > > > crawl using domain scope. Would pathscope with a seed of > > > > > http://www.hdb.gov.sg prevent the other sites to being > crawled? If > > > > > not, are there other ways to prevent it from happening? > > > > > > > > Yeah, the domain scope warns: "It will however reach subdomains > of > > > the > > > > seeds' original domains. www[#].host is considered to be the > same > > > as > > > > host." Explicitly stating 'www.hdb.gov.sg' doesn't look like it > > > will > > > > avoid the problem either reading the code. > > > > > > > > FYI, we're moving away from *scope scopes -- i.e. domainscope, > > > > pathscope, etc. -- toward decidingscope. The latter gives > > > you "more > > > > rope" designing scopes. > > > > > > > > It looks like the On*DecideRule though has same issue > with 'www'. > > > Looks > > > > like you can write a SURT form, something > like '(sg,gov,hdb,www)', > > > that > > > > will only include URIs with a host of 'www.hdb.gov.sg' (though > it > > > looks > > > > like http and https are flattened to be same scheme). > > > > > > > > I'll let others -- Igor or Gordon? -- respond. They can give a > > > better > > > > quality answer than I. > > > > > > > > Good stuff, > > > > St.Ack > > > > > > > > > > > > > > Thank you so much for your time. > > > > > > > > > > > > > > > > > > > > --- In arc...@ya..., stack <stack@> wrote: > > > > > > > > > > > > alxartes wrote: > > > > > > > Hi, > > > > > > > > > > > > > > I am investigating the log files of my crawls and found > the > > > error > > > > > > > below. I hope someone could explain what this means > because > > > the > > > > > other > > > > > > > javascripts are crawled fine. > > > > > > > > > > > > > > 2006-02-15T03:35:21.747Z > > > > > > > > > > http://www.macromedia.com/uber/js/omniture_s_code.js "Unsupported > > > > > > > scheme: javascript" > > > > > > > > > > > > > > > > javascript:,macromedia,dreamweaver,flash,shockwave,sdc,markme,sdc.shoc > > > > > kw > > > > > > > ave,infopoll,developerlocator.macromedia > > > > > > > > > > > > In short, the above is just stating that Heritrix does not > > > support > > > > > > fetching > the 'URI' "javascript:,macromedia,dreamweaver...". Its > > > > > not an > > > > > > 'error'. > > > > > > > > > > > > Heritrix is regexing over the content of > > > > > > 'http://www.macromedia.com/uber/js/omniture_s_code.js' > <http://www.macromedia.com/uber/js/omniture_s_code.js%27> > > > <http://www.macromedia.com/uber/js/omniture_s_code.js%27> > > > > > <http://www.macromedia.com/uber/js/omniture_s_code.js%27> > looking > > > for > > > > > > URIs. It found the string > > > > > > > > > > > > > > > > "javascript:,macromedia,dreamweaver,flash,shockwave,sdc,markme,sdc.s > > > > > hockwave,infopoll,developerlocator.macromedia" > > > > > > > > > > > > > > > > > > To the Heritrix regex, the above string looks like a likely > URI. > > > > > Its > > > > > > inside quotes > > > > > > and starts with what could be an URI scheme > > > (i.e. 'javascript:'). > > > > > > > > > > > > So, the candidate URI is passed to our URI parser class, > > > > > > org.archive.net.UURIFactory. This class takes configuration > in > > > > > > heritrix.properties about which URI schemes Heritrix will > > > accept. > > > > > > Here's relevant extract: > > > > > > > > > > > > > > > > > > > > > ###################################################################### > > > > > ######## > > > > > > # U U R > > > > > > > > > > > > > > > I # > > > > > > > > > > > > > > > ###################################################################### > > > > > ######### > > > > > > Any scheme not listed in the below will generate an > > > > > UnsupportedUriScheme > > > > > > # exception. Make the list empty to support all schemes. > > > > > > org.archive.net.UURIFactory.schemes = http, https, dns, > invalid > > > > > > > > > > > > (We don't currently have a 'UnsupportedUriScheme' > exception. We > > > > > should > > > > > > add one). > > > > > > > > > > > > Here is where the test is done: > > > > > > > > > > http://crawler.archive.org/xref/org/archive/net/UURIFactory.html#443 > > > > > > > > > > > > Because 'javascript' scheme is not in above supported > schemes > > > list > > > > > (nor > > > > > > in the list of schemes to ignore which appears later in > > > > > > heritrix.properties), it generates a URIException with > > > > > an 'unsupported > > > > > > scheme' message. > > > > > > > > > > > > We could do with some clean up in here. Currently all URI > > > > > exceptions > > > > > > are lumped into URIException. We could add subclasses of > URIE > > > so > > > > > the > > > > > > non-errors get logged at a different level: e.g. FINE for > > > > > unsupported > > > > > > scheme exceptions. > > > > > > > > > > > > St.Ack > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > SPONSORED LINKS > > > > > Computer security > > > > > <http://groups.yahoo.com/gads? > > > > t=ms&k=Computer+security&w1=Computer+security&w2=Computer+training&c=2 > > > &s=46&.sig=BHmcxBg5sKfN9-gcWnJWDg> > > > > > Computer training > > > > > <http://groups.yahoo.com/gads? > > > > t=ms&k=Computer+training&w1=Computer+security&w2=Computer+training&c=2 > > > &s=46&.sig=v0JjJWA4s7mLnWQWdFxuTQ> > > > > > > > > > > > > > > > > > > > > -------------------------------------------------------------- > ---- > > > ------ > > > > > YAHOO! GROUPS LINKS > > > > > > > > > > * Visit your group "archive-crawler > > > > > <http://groups.yahoo.com/group/archive-crawler>" on the > web. > > > > > > > > > > * To unsubscribe from this group, send an email to: > > > > > arc...@ya... > > > > > <mailto:arc...@ya...? > > > subject=Unsubscribe> > > > > > > > > > > * Your use of Yahoo! Groups is subject to the Yahoo! > Terms of > > > > > Service <http://docs.yahoo.com/info/terms/>. > > > > > > > > > > > > > > > -------------------------------------------------------------- > ---- > > > ------ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > SPONSORED LINKS > > > Computer security > > > <http://groups.yahoo.com/gads? > t=ms&k=Computer+security&w1=Computer+security&w2=Computer+training&c=2 > &s=46&.sig=BHmcxBg5sKfN9-gcWnJWDg> > > > Computer training > > > <http://groups.yahoo.com/gads? > t=ms&k=Computer+training&w1=Computer+security&w2=Computer+training&c=2 > &s=46&.sig=v0JjJWA4s7mLnWQWdFxuTQ> > > > > > > > > > > > > ------------------------------------------------------------------ > ------ > > > YAHOO! GROUPS LINKS > > > > > > * Visit your group "archive-crawler > > > <http://groups.yahoo.com/group/archive-crawler>" on the web. > > > > > > * To unsubscribe from this group, send an email to: > > > arc...@ya... > > > <mailto:arc...@ya...? > subject=Unsubscribe> > > > > > > * Your use of Yahoo! Groups is subject to the Yahoo! Terms of > > > Service <http://docs.yahoo.com/info/terms/>. > > > > > > > > > ------------------------------------------------------------------ > ------ > > > > > > > > > > > > SPONSORED LINKS > Computer security > <http://groups.yahoo.com/gads?t=ms&k=Computer+security&w1=Computer+security&w2=Computer+training&c=2&s=46&.sig=BHmcxBg5sKfN9-gcWnJWDg> > Computer training > <http://groups.yahoo.com/gads?t=ms&k=Computer+training&w1=Computer+security&w2=Computer+training&c=2&s=46&.sig=v0JjJWA4s7mLnWQWdFxuTQ> > > > > ------------------------------------------------------------------------ > YAHOO! GROUPS LINKS > > * Visit your group "archive-crawler > <http://groups.yahoo.com/group/archive-crawler>" on the web. > > * To unsubscribe from this group, send an email to: > arc...@ya... > <mailto:arc...@ya...?subject=Unsubscribe> > > * Your use of Yahoo! Groups is subject to the Yahoo! Terms of > Service <http://docs.yahoo.com/info/terms/>. > > > ------------------------------------------------------------------------ > |
|
From: stack <st...@ar...> - 2006-02-23 05:27:50
|
(Forwarded discussion from the Heritrix list) |
|
From: stack <st...@ar...> - 2006-02-17 16:49:52
|
Lukas Matejka wrote:
> first i made links through command 'setup' than i used 'segment' to create
> segments from arcs and than i wanted to use 'links' to process pages and
> links to webdb, but command 'links' uses
>
> ${NUTCH}/bin/nutch admin "${nutchdb}" -create
> before updating db from segments and updating back segments from db
>
> shall I create new WebDB or continue on an old one(for example disabling this
> creating command)?
>
>
For incremental indexing, you will want to keep updating the one webdb
rather than create it anew each incremental indexing, so yes, modify
the indexarcs.sh script so it doesn't invoke create of the webdb. You
will likely also need to change the steps that follow so that it passes
only the segments that are part of the incremental update set rather
than all segments (Currently its written as 'segments/*').
Tell us more about the size of your incremental updates? How frequently
are you planning to do them and how much data are you adding? Our
experience trying to do frequent updates has not been good: index
merging and webdb updating all can take a long time to complete. Tell
me more about the rates of update you are considering and meantime I'll
try and get some figures on our experience posted.
The story should be better in new nutch though I guess index merge works
effectively as it did, a pure lucene operations.
On NutchWAX 0.6.0 release status, a NutchWAX running on top of MapReduce
nutch, development is going well. We've been using a rack of 35 or so
(very) slow processors to test indexing collections of 100M and more.
We're having some robustness and performance issues but they are being
addressed. We're still looking at an end-of-March/start-of-April
release. Will keep the list posted.
St.Ack
|
|
From: Lukas M. <mat...@ce...> - 2006-02-17 15:10:13
|
Hi,
i'm just testing inceremental indexing and i want to ask for a little
help(really simple for you:))
i've used file in nutchwax/bin/indexarc.sh...
first i made links through command 'setup' than i used 'segment' to create
segments from arcs and than i wanted to use 'links' to process pages and
links to webdb, but command 'links' uses
${NUTCH}/bin/nutch admin "${nutchdb}" -create
before updating db from segments and updating back segments from db
shall I create new WebDB or continue on an old one(for example disabling this
creating command)?
putting all indexes together is running well
thanks for advise
l.
|
|
From: Lukas M. <mat...@ce...> - 2006-02-14 17:22:33
|
> > > > > ---------- P=F8eposlan=E1 zpr=E1va ---------- > From: stack <st...@ar...> > To: stack <st...@ar...> > Date: Sat, 11 Feb 2006 10:59:07 -0800 > Subject: Re: [Archive-access-discuss] Re: nutchwax > uk=E1=B9: > > I committed code to undo any html entity encoding found in text to be > emitted by OpenSearchServlet. I committed on the nutchwax 'release-0_4' > branch so be careful you get this branch from CVS rather than HEAD if > building from source. If you just want the WAR with the fix, its > available here: http://archive.org/~stack/nutchwax.war. Let me know if > you want me to make up a complete nutchwax tarball. Let me know if the > fix works for you (Here's the bug: > https://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1429788&gro= up_id=3D >118427&atid=3D681137 ). it works very well! good work. I've just downloaded nutchwax.war and .. it seems to be ok:) =2Dlm > > This is a band-aid fix until the core issue gets addressed in nutch. > I'll work on trying to get this done this week. > > This is a pretty serious issue. Text snippets -- i.e. the 'description' > field in the XML -- that have anything but plain ASCII are mangled > showing ugly numeric character representations, 'ŗ', etc., in place > of legit UTF-8 characters. Its was also possible to by-pass our > legit-xml character checking encoding illegal characters: e.g. ''. > If the fix works for you Luk=E1=B9, I'll make a new release of nutchwax w= ith > the bandaid incorporated later this week (Hopefully by the release of > the 0.6.0 mapreduce version of NutchWAX, will have the real fix > incorporated). > > Good stuff, > St.Ack > > stack wrote: > > Lukas Matejka wrote: > >> Dne =E8t 9. =FAnora 2006 18:51 stack napsal(a): > >>> .... > >>> I see the 0x07 Bell character in the original page. Below is an 'od' > >>> dump of the relevant section with the ascii line underwritten by its > >>> hex > >>> representation. The last line has the 0x07 character. > >> > >> you're absolutely right with bell character, but i think there is one > >> another different thing. I'll try to explain. > >> > >> i will search word 'kniha' (which means book) trough > > http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dkniha&start=3D0&hitsPe= rPage=3D >10&hitsPerDup=3D1&dedupField=3Dexacturl > > >> answer is valid XML(57472 hits of word kniha), but in result > >> in entity 'description' there are html entites that represent czech > >> characters with diacritics and that's the problem. Original site > >> doesn't contain these html entities but regular czech characters. > >> > >> Interesting thing is that in entity 'title' shows czech characters > >> well, but in entity 'description' like html entites(for instance html > >> entity ý represents special character y with dash). > >> > >> Have you any idea where could be problem? > > > > Thanks for the extra info Lukas. > > > > Digging in, I see that the generation of summaries runs the text > > through org.apache.nutch.html.Entities. Here is the code for the > > Entities#encode method that all summary text is run through: > > > > static final public String encode(String s) { > > int length =3D s.length(); > > StringBuffer buffer =3D new StringBuffer(length * 2); > > for (int i =3D 0; i < length; i++) { > > char c =3D s.charAt(i); > > int j =3D (int)c; > > if (j < 0x100 && encoder[j] !=3D null) { > > buffer.append(encoder[j]); // have a named encoding > > buffer.append(';'); > > } else if (j < 0x80) { > > buffer.append(c); // use ASCII value > > } else { > > buffer.append("&#"); // use numeric encoding > > buffer.append((int)c); > > buffer.append(';'); > > } > > } > > return buffer.toString(); > > } > > > > Any character that is super-ASCII gets a numeric character encoding. > > Assuming all is UTF-8 in nutch, then we probably don't want HTML > > entity encoding when we're outputtting UTF-8 XML. In fact, looks like > > we don't want any html entity encoding at all when outputting XML. > > > > The call to Entities#encode is buried in nutch inside the Fragment > > inner class of Summary. It would take a good bit of work making up a > > NutchBean that called an alternate Summary-maker when outputting XML. > > > > Meantime, I have a quick fix that adds HTML entity decoding to the > > Nutchwax OpenSearchServlet. Let me do some more testing and hopefully > > I can commit later today. I'll let you know. > > > > St.Ack > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: Splunk Inc. Do you grep through log > > files > > for problems? Stop! Download the new AJAX search engine that makes > > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > > http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=103432&bid#0486&dat=121642 > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log > files for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://sel.as-us.falkag.net/sel?cmdlnk&kid=103432&bid#0486&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss =2D-=20 =2D----------------------------- Bc.Lukas Matejka email:mat...@ce... GSM:+420777093233 |
|
From: stack <st...@ar...> - 2006-02-11 19:00:22
|
Luk=E1=9A: I committed code to undo any html entity encoding found in text to be=20 emitted by OpenSearchServlet. I committed on the nutchwax 'release-0_4'=20 branch so be careful you get this branch from CVS rather than HEAD if=20 building from source. If you just want the WAR with the fix, its=20 available here: http://archive.org/~stack/nutchwax.war. Let me know if=20 you want me to make up a complete nutchwax tarball. Let me know if the=20 fix works for you (Here's the bug:=20 https://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1429788&gro= up_id=3D118427&atid=3D681137). This is a band-aid fix until the core issue gets addressed in nutch. =20 I'll work on trying to get this done this week. This is a pretty serious issue. Text snippets -- i.e. the 'description'=20 field in the XML -- that have anything but plain ASCII are mangled=20 showing ugly numeric character representations, 'ŗ', etc., in place=20 of legit UTF-8 characters. Its was also possible to by-pass our=20 legit-xml character checking encoding illegal characters: e.g. ''. =20 If the fix works for you Luk=E1=9A, I'll make a new release of nutchwax w= ith=20 the bandaid incorporated later this week (Hopefully by the release of=20 the 0.6.0 mapreduce version of NutchWAX, will have the real fix=20 incorporated). Good stuff, St.Ack stack wrote: > Lukas Matejka wrote: >> Dne =E8t 9. =FAnora 2006 18:51 stack napsal(a): >> =20 >>> .... >>> I see the 0x07 Bell character in the original page. Below is an 'od' >>> dump of the relevant section with the ascii line underwritten by its=20 >>> hex >>> representation. The last line has the 0x07 character. >>> =20 >> >> you're absolutely right with bell character, but i think there is one=20 >> another different thing. I'll try to explain. >> >> i will search word 'kniha' (which means book) trough >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dkniha&start=3D0&hit= sPerPage=3D10&hitsPerDup=3D1&dedupField=3Dexacturl=20 >> >> >> answer is valid XML(57472 hits of word kniha), but in result >> in entity 'description' there are html entites that represent czech=20 >> characters with diacritics and that's the problem. Original site=20 >> doesn't contain these html entities but regular czech characters. >> >> Interesting thing is that in entity 'title' shows czech characters=20 >> well, but in entity 'description' like html entites(for instance html=20 >> entity ý represents special character y with dash). >> >> Have you any idea where could be problem? >> >> =20 > Thanks for the extra info Lukas. > > Digging in, I see that the generation of summaries runs the text=20 > through org.apache.nutch.html.Entities. Here is the code for the=20 > Entities#encode method that all summary text is run through: > > static final public String encode(String s) { > int length =3D s.length(); > StringBuffer buffer =3D new StringBuffer(length * 2); > for (int i =3D 0; i < length; i++) { > char c =3D s.charAt(i); > int j =3D (int)c; > if (j < 0x100 && encoder[j] !=3D null) { > buffer.append(encoder[j]); // have a named encoding > buffer.append(';'); > } else if (j < 0x80) { > buffer.append(c); // use ASCII value > } else { > buffer.append("&#"); // use numeric encoding > buffer.append((int)c); > buffer.append(';'); > } > } > return buffer.toString(); > } > > Any character that is super-ASCII gets a numeric character encoding. =20 > Assuming all is UTF-8 in nutch, then we probably don't want HTML=20 > entity encoding when we're outputtting UTF-8 XML. In fact, looks like=20 > we don't want any html entity encoding at all when outputting XML. > > The call to Entities#encode is buried in nutch inside the Fragment=20 > inner class of Summary. It would take a good bit of work making up a=20 > NutchBean that called an alternate Summary-maker when outputting XML. > > Meantime, I have a quick fix that adds HTML entity decoding to the=20 > Nutchwax OpenSearchServlet. Let me do some more testing and hopefully=20 > I can commit later today. I'll let you know. > > St.Ack > > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log=20 > files > for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=103432&bid#0486&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From: stack <st...@ar...> - 2006-02-10 19:33:03
|
Lukas Matejka wrote: > Dne =E8t 9. =FAnora 2006 18:51 stack napsal(a): > =20 >> .... >> I see the 0x07 Bell character in the original page. Below is an 'od' >> dump of the relevant section with the ascii line underwritten by its h= ex >> representation. The last line has the 0x07 character. >> =20 > > you're absolutely right with bell character, but i think there is one a= nother=20 > different thing. I'll try to explain. > > i will search word 'kniha' (which means book) trough > http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dkniha&start=3D0&hits= PerPage=3D10&hitsPerDup=3D1&dedupField=3Dexacturl > > answer is valid XML(57472 hits of word kniha), but in result > in entity 'description' there are html entites that represent czech cha= racters=20 > with diacritics and that's the problem. Original site doesn't contain t= hese=20 > html entities but regular czech characters. > > Interesting thing is that in entity 'title' shows czech characters well= , but=20 > in entity 'description' like html entites(for instance html entity &yac= ute;=20 > represents special character y with dash). > > Have you any idea where could be problem? > > =20 Thanks for the extra info Lukas. Digging in, I see that the generation of summaries runs the text through=20 org.apache.nutch.html.Entities. Here is the code for the=20 Entities#encode method that all summary text is run through: static final public String encode(String s) { int length =3D s.length(); StringBuffer buffer =3D new StringBuffer(length * 2); for (int i =3D 0; i < length; i++) { char c =3D s.charAt(i); int j =3D (int)c; if (j < 0x100 && encoder[j] !=3D null) { buffer.append(encoder[j]); // have a named encoding buffer.append(';'); } else if (j < 0x80) { buffer.append(c); // use ASCII value } else { buffer.append("&#"); // use numeric encoding buffer.append((int)c); buffer.append(';'); } } return buffer.toString(); } Any character that is super-ASCII gets a numeric character encoding. =20 Assuming all is UTF-8 in nutch, then we probably don't want HTML entity=20 encoding when we're outputtting UTF-8 XML. In fact, looks like we don't=20 want any html entity encoding at all when outputting XML. The call to Entities#encode is buried in nutch inside the Fragment inner=20 class of Summary. It would take a good bit of work making up a=20 NutchBean that called an alternate Summary-maker when outputting XML. Meantime, I have a quick fix that adds HTML entity decoding to the=20 Nutchwax OpenSearchServlet. Let me do some more testing and hopefully I=20 can commit later today. I'll let you know. St.Ack |
|
From: Lukas M. <mat...@ce...> - 2006-02-10 14:52:20
|
Dne =E8t 9. =FAnora 2006 18:51 stack napsal(a): > Luk=E1=9A Mat=ECjka wrote: > > Hi, > > > > i still can't handle this issue.. > > Pardon the late reply Luk=E1=9A. > > Here seems to be a page with problematic characters: > http://dig.vkol.cz/vz/vz01_12.htm. I get it by following Sverre's recipe > below adding hitsPerDup=3D0, etc. > > If I get the page via opensearchservlet, firefox complains about '' > character in description field. The ascii 'Bell' character is illegal > in XML, even though its represented by numeric character reference > (Here's the grammer for XML Char: > http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char). > > I see the 0x07 Bell character in the original page. Below is an 'od' > dump of the relevant section with the ascii line underwritten by its hex > representation. The last line has the 0x07 character. you're absolutely right with bell character, but i think there is one anoth= er=20 different thing. I'll try to explain. i will search word 'kniha' (which means book) trough http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dkniha&start=3D0&hitsPerP= age=3D10&hitsPerDup=3D1&dedupField=3Dexacturl answer is valid XML(57472 hits of word kniha), but in result in entity 'description' there are html entites that represent czech charact= ers=20 with diacritics and that's the problem. Original site doesn't contain these= =20 html entities but regular czech characters. Interesting thing is that in entity 'title' shows czech characters well, bu= t=20 in entity 'description' like html entites(for instance html entity ý= =20 represents special character y with dash). Have you any idea where could be problem? l. > > .... > > 0008832 8 0 . < / p > < p > nl < b > M K > 3830 2e3c 2f70 3e3c 703e 0a3c 623e 4d4b > 0008848 sp c8 R < / b > sp z a f8 a d i l o > 20c8 523c 2f62 3e20 7a61 f861 6469 6c6f > 0008864 sp n a sp s e z n a m sp n e j c e > 206e 6120 7365 7a6e 616d 206e 656a 6365 > 0008880 n n ec j b9 ed c h sp d o k l a d f9 > 6e6e ec6a b9ed 6368 2064 6f6b 6c61 64f9 > 0008896 . bel sp / B i b l e sp b o s k o v > 2e07 202f 4269 626c 6520 626f 736b 6f76 > > ... > > That the illegal character shows in the description text as a character > reference, then its probably been encoded earlier in the processing of > the document. > > Regardless, the opensearchservlet should probably look for such illegal > encodings and just strip them (Its doing this already for raw > characters). Let me try and fix this. > > St.Ack > > > does anybody know how to help? > > Can NutchWAX produce output with html entities?(Output from NutchWAX > > shloud be utf,shouldn't be?) Because (in cases written below) invalid x= ml > > is caused by special characters in html entties. > > > > thanks for any help > > > > -lm > > > > ______________________________________________________________ > > > >> Od: sve...@nb... > >> Komu: stack <st...@ar...> > >> CC: Luk=E1=9A Mat=ECjka <mat...@ce...> > >> Datum: 12.01.2006 10:38 > >> P=F8edm=ECt: Re: nutchwax > >> > >> Hi Michael, Luk=E1=9A .. > >> > >> On Thursday 12 January 2006 01:33, stack wrote: > >>> Luk=E1=9A Mat=ECjka wrote: > >> > >> ... > >> > >>>> what's the difference between these cases? > >>>> > >>>> 1) > >> > >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%= C3%B > >>D > >> > >>>> &start=3D0&hitsPerDup=3D0&hitsPerPage=3D10&dedupField=3Dexacturl ->o= utput is > >> > >> not > >> > >>>> valid xml(called from WERA) > >>>> > >>>> 2) > >> > >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc= k%C3 > >>% > >> > >>>> BD&start=3D0&hitsPerPage=3D10&hitsPerDup=3D1&dedupField=3Dexacturl o= utput is > >> > >> valid > >> > >>>> xml(called from Nutchwax search.jsp) > >> > >> If i try the above urls i find quite the opposite! Case 1 produces val= id > >> XML, > >> case 2 produces invalid XML. > >> > >> Test results: > >> > >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%= C3%B > >>D -> valid XML > >> > >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc= k%C3 > >>%BD -> valid XML > >> > >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%= C3%B > >>D&hitsPerDup=3D0&dedupField=3Dexacturl -> valid XML > >> > >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc= k%C3 > >>%BD&hitsPerDup=3D0&dedupField=3Dexacturl -> valid XML > >> > >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%= C3%B > >>D&hitsPerDup=3D1&dedupField=3Dexacturl -> INVALID XML > >> > >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc= k%C3 > >>%BD&hitsPerDup=3D1&dedupField=3Dexacturl -> INVALID XML > >> > >> Setting hitsPerDup=3D2 results in valid XML > >> > >> Conclusion: > >> A specific record in the index contains invalid XML chars, and it is > >> only part > >> of the result list when hitsPerDup=3D1. Setting hitsPerDup=3D0 and sta= rt=3D10 > >> will > >> produce a result list including the invalid XML. chars record. > >> > >> I don't know if the above was of any help to you, i just had to say > >> something > >> about it ;-) > >> > >> Sverre > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: Splunk Inc. Do you grep through log > > files for problems? Stop! Download the new AJAX search engine that > > makes searching your log files as easy as surfing the web. DOWNLOAD > > SPLUNK! http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=103432&bid#0486&dat= =121642 > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log > files for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=103432&bid#0486&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss =2D-=20 =2D----------------------------- Bc.Lukas Matejka email:mat...@ce... GSM:+420777093233 |
|
From: stack <st...@ar...> - 2006-02-09 17:52:17
|
Luk=E1=9A Mat=ECjka wrote: > Hi, > > i still can't handle this issue.. > =20 Pardon the late reply Luk=E1=9A. Here seems to be a page with problematic characters:=20 http://dig.vkol.cz/vz/vz01_12.htm. I get it by following Sverre's recipe=20 below adding hitsPerDup=3D0, etc.=20 If I get the page via opensearchservlet, firefox complains about ''=20 character in description field. The ascii 'Bell' character is illegal=20 in XML, even though its represented by numeric character reference=20 (Here's the grammer for XML Char:=20 http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char). I see the 0x07 Bell character in the original page. Below is an 'od'=20 dump of the relevant section with the ascii line underwritten by its hex=20 representation. The last line has the 0x07 character. .... 0008832 8 0 . < / p > < p > nl < b > M K 3830 2e3c 2f70 3e3c 703e 0a3c 623e 4d4b 0008848 sp c8 R < / b > sp z a f8 a d i l o 20c8 523c 2f62 3e20 7a61 f861 6469 6c6f 0008864 sp n a sp s e z n a m sp n e j c e 206e 6120 7365 7a6e 616d 206e 656a 6365 0008880 n n ec j b9 ed c h sp d o k l a d f9 6e6e ec6a b9ed 6368 2064 6f6b 6c61 64f9 0008896 . bel sp / B i b l e sp b o s k o v 2e07 202f 4269 626c 6520 626f 736b 6f76 ... That the illegal character shows in the description text as a character=20 reference, then its probably been encoded earlier in the processing of=20 the document. Regardless, the opensearchservlet should probably look for such illegal=20 encodings and just strip them (Its doing this already for raw=20 characters). Let me try and fix this. St.Ack > does anybody know how to help? > Can NutchWAX produce output with html entities?(Output from NutchWAX sh= loud be utf,shouldn't be?) > Because (in cases written below) invalid xml is caused by special chara= cters in html entties. > > thanks for any help > > -lm > > ______________________________________________________________ > =20 >> Od: sve...@nb... >> Komu: stack <st...@ar...> >> CC: Luk=E1=9A Mat=ECjka <mat...@ce...> >> Datum: 12.01.2006 10:38 >> P=F8edm=ECt: Re: nutchwax >> >> Hi Michael, Luk=E1=9A .. >> >> On Thursday 12 January 2006 01:33, stack wrote: >> =20 >>> Luk=E1=9A Mat=ECjka wrote: >>> =20 >> ... >> =20 >>>> what's the difference between these cases? >>>> >>>> 1) >>>> >>>> =20 >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%= C3%BD >> =20 >>>> &start=3D0&hitsPerDup=3D0&hitsPerPage=3D10&dedupField=3Dexacturl ->o= utput is >>>> =20 >> not >> =20 >>>> valid xml(called from WERA) >>>> >>>> 2) >>>> >>>> =20 >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc= k%C3% >> =20 >>>> BD&start=3D0&hitsPerPage=3D10&hitsPerDup=3D1&dedupField=3Dexacturl o= utput is >>>> =20 >> valid >> =20 >>>> xml(called from Nutchwax search.jsp) >>>> =20 >> If i try the above urls i find quite the opposite! Case 1 produces val= id >> XML,=20 >> case 2 produces invalid XML. >> >> Test results: >> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%= C3%BD >> -> valid XML >> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc= k%C3%BD >> -> valid XML >> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%= C3%BD&hitsPerDup=3D0&dedupField=3Dexacturl >> -> valid XML >> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc= k%C3%BD&hitsPerDup=3D0&dedupField=3Dexacturl >> -> valid XML >> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%= C3%BD&hitsPerDup=3D1&dedupField=3Dexacturl >> -> INVALID XML >> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc= k%C3%BD&hitsPerDup=3D1&dedupField=3Dexacturl >> -> INVALID XML >> >> Setting hitsPerDup=3D2 results in valid XML >> >> Conclusion:=20 >> A specific record in the index contains invalid XML chars, and it is o= nly >> part=20 >> of the result list when hitsPerDup=3D1. Setting hitsPerDup=3D0 and sta= rt=3D10 >> will=20 >> produce a result list including the invalid XML. chars record. >> >> I don't know if the above was of any help to you, i just had to say >> something=20 >> about it ;-) >> >> Sverre >> >> =20 > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log = files > for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=103432&bid#0486&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > =20 |
|
From:
<mat...@ce...> - 2006-02-09 12:08:33
|
Hi, i still can't handle this issue.. does anybody know how to help? Can NutchWAX produce output with html entities?(Output from NutchWAX sh= loud be utf,shouldn't be?) Because (in cases written below) invalid xml is caused by special chara= cters in html entties. thanks for any help -lm ______________________________________________________________ > Od: sve...@nb... > Komu: stack <st...@ar...> > CC: Luk=E1=9A Mat=ECjka <mat...@ce...> > Datum: 12.01.2006 10:38 > P=F8edm=ECt: Re: nutchwax > > Hi Michael, Luk=E1=9A .. >=20 > On Thursday 12 January 2006 01:33, stack wrote: > > Luk=E1=9A Mat=ECjka wrote: > ... > > > what's the difference between these cases? > > > > > > 1) > > > > http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck= %C3%BD > > >&start=3D0&hitsPerDup=3D0&hitsPerPage=3D10&dedupField=3Dexacturl -= >output is > not > > > valid xml(called from WERA) > > > > > > 2) > > > > http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20lou= ck%C3% > > >BD&start=3D0&hitsPerPage=3D10&hitsPerDup=3D1&dedupField=3Dexacturl= output is > valid > > > xml(called from Nutchwax search.jsp) > > >=20 > If i try the above urls i find quite the opposite! Case 1 produces va= lid > XML,=20 > case 2 produces invalid XML. >=20 > Test results: >=20 > http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck= %C3%BD > -> valid XML >=20 > http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20lou= ck%C3%BD > -> valid XML >=20 > http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck= %C3%BD&hitsPerDup=3D0&dedupField=3Dexacturl > -> valid XML >=20 > http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20lou= ck%C3%BD&hitsPerDup=3D0&dedupField=3Dexacturl > -> valid XML >=20 > http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck= %C3%BD&hitsPerDup=3D1&dedupField=3Dexacturl > -> INVALID XML >=20 > http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20lou= ck%C3%BD&hitsPerDup=3D1&dedupField=3Dexacturl > -> INVALID XML >=20 > Setting hitsPerDup=3D2 results in valid XML >=20 > Conclusion:=20 > A specific record in the index contains invalid XML chars, and it is = only > part=20 > of the result list when hitsPerDup=3D1. Setting hitsPerDup=3D0 and st= art=3D10 > will=20 > produce a result list including the invalid XML. chars record. >=20 > I don't know if the above was of any help to you, i just had to say > something=20 > about it ;-) >=20 > Sverre > |