archive-access-discuss Mailing List for Web Archive Access Utilities (Page 40)

Brought to you by: binzino, bradtofel, gojomo, ia_igor, and 5 others

archive-access-discuss — General discussion about archive-access projects

You can subscribe to this list here.

2005	Jan	Feb	Mar	Apr	May	Jun	Jul (1)	Aug (4)	Sep (5)	Oct (17)	Nov (30)	Dec (3)
2006	Jan (4)	Feb (14)	Mar (8)	Apr (11)	May (2)	Jun (13)	Jul (9)	Aug (2)	Sep (2)	Oct (9)	Nov (20)	Dec (9)
2007	Jan (6)	Feb (4)	Mar (6)	Apr (7)	May (6)	Jun (6)	Jul (4)	Aug (3)	Sep (9)	Oct (26)	Nov (23)	Dec (2)
2008	Jan (17)	Feb (19)	Mar (16)	Apr (27)	May (3)	Jun (21)	Jul (21)	Aug (8)	Sep (13)	Oct (7)	Nov (8)	Dec (8)
2009	Jan (18)	Feb (14)	Mar (27)	Apr (14)	May (10)	Jun (14)	Jul (18)	Aug (30)	Sep (18)	Oct (12)	Nov (5)	Dec (26)
2010	Jan (27)	Feb (3)	Mar (8)	Apr (4)	May (6)	Jun (13)	Jul (25)	Aug (11)	Sep (2)	Oct (4)	Nov (7)	Dec (6)
2011	Jan (25)	Feb (17)	Mar (25)	Apr (23)	May (15)	Jun (12)	Jul (8)	Aug (13)	Sep (4)	Oct (17)	Nov (7)	Dec (6)
2012	Jan (4)	Feb (7)	Mar (1)	Apr (10)	May (11)	Jun (5)	Jul (7)	Aug (1)	Sep (1)	Oct (5)	Nov (6)	Dec (13)
2013	Jan (9)	Feb (7)	Mar (3)	Apr (1)	May (3)	Jun (19)	Jul (3)	Aug (3)	Sep	Oct (1)	Nov (1)	Dec (1)
2014	Jan (11)	Feb (1)	Mar	Apr (2)	May (6)	Jun	Jul	Aug (1)	Sep	Oct (1)	Nov (1)	Dec (1)
2015	Jan	Feb	Mar	Apr	May	Jun (1)	Jul (4)	Aug	Sep	Oct	Nov	Dec (1)
2016	Jan (4)	Feb (3)	Mar	Apr	May	Jun	Jul (1)	Aug	Sep	Oct (1)	Nov	Dec
2018	Jan	Feb	Mar	Apr (1)	May (1)	Jun	Jul (2)	Aug	Sep (1)	Oct	Nov (1)	Dec
2019	Jan (2)	Feb (1)	Mar	Apr	May	Jun (2)	Jul	Aug	Sep (1)	Oct (1)	Nov	Dec

Flat | Threaded

<< < 1 .. 38 39 40 41 42 43 > >> (Page 40 of 43)

[Archive-access-discuss] Re: wayback

From: Brad T. <br...@ar...> - 2006-04-06 17:53:32

cc'd archive-access-discuss so others can comment, etc.

Lukas Matejka wrote:
> Dne =E8t 6. dubna 2006 05:06 jste napsal(a):
>=20
>>>you can visit
>>>http://raptor.webarchiv.cz:8080/wayback/20060214122556/www.agromanual.=
cz/
>>>cz/ there are question marks instead of special czech chraters (but i'=
m
>>>not sure if your browser is able to show any czech chracter)
>>>
>>>i think that there would be bug in sending character to output..
>>>what's exactly done with page when it is restored from archive?
>>
>>I'm pretty sure this was caused by a character encoding bug (aka, a
>>total lack of proper handling of non ASCII documents...) which I'm
>>hoping is now fixed in HEAD.
>>
>>If you get a chance to try out the fix, and let me know if you're still
>>seeing the problem, that would be great.
>>
>=20
>=20
> i've just applied patch and it seems to work well!
>=20
> good work!

Great! Glad to hear it's working better, and thanks! I'm hoping to=20
release an 0.4.1 today which includes this fix.

>=20
> l.
>=20
> p.s.
> what does exatctly mean description "-->  (redirect) (new version) " in=
=20
> results?
>=20

Each line in the index indicates among other things:
	HTTP redirect(Location) URL
	document digest

If the page is thought to redirect to another URL, this is indicated in=20
the search result page with a "(redirect)".

If the digest has changed from the value for the previous version of the=20
same URL, then "(new version)" is indicated.

(At least this is how it's supposed to be working.)

We haven't yet spent much time making the rendering JSPs very pretty. If=20
you have suggestions, etc, please forward on.

>=20
>>Thanks!
>>
>>Brad
>=20
>=20
>=20

[Archive-access-discuss] Improvements to proxy-mode in the Open Wayback

From: Oskar G. <osk...@kb...> - 2006-04-06 14:19:26

Attachments: BDBMap.java.diff ResultURIConverter.java.diff Timestamp.java.diff Redirect.jsp.diff ReplayFilter.java.diff QueryServlet.java.diff web.xml waxtoolbar.tar

Hi everyone!

Let me first introduce me to those of you who don't know me already.
My name is Oskar Grenholm and I work as a programmer at The National Library 
of Sweden. I mainly work with things related to our web archive here.

Lately I have made some minor improvemtents to the way the proxy-mode works in 
the Open Wayback Machine. Those changes have made it possible to surf not 
only the most recent copy of a page in the web archive, but instead any copy 
available. 
This can be done with just the Wayback Machine, but to aid (and perhaps 
simplify) the surfing I have also started working on a Firefox extension that 
will help the user with common tasks often encountered when surfing a web 
archive. Among the things this WAX Toolbar does is providing a search field 
for searching the Wayback Machine for different URL:s OR do a full-text 
search from a NutchWAX index (if one is available of course). You can also 
use the toolbar to switch between proxy-mode and the regular Internet, and 
when in proxy-mode easily go back and forth in time.

The changes made to the Wayback are not many. The main idea is that you have a 
BDB index that holds mappings between id:s (a unique id if the toolbar was 
used, otherwise the ip-address the request was made from) and a preferred 
time to surf at. This timestamp is set either when you choose a page to visit 
from the search interface in the WB or by the WAX Toolbar.
Then for each request made to the proxy the WB will look up this timestamp and 
return the page that is the closest in time.

Patches for these changes are attached to this e-mail. Four of the files are 
earlier existing files that have been modified somewhat and two of them are 
new (BDBMapper.java and Redirect.jsp).

Attached is also a tar-file containing the source for the Firefox extension.
If you untar this and enter the directory you can just run 'ant' and a file 
named WaxToolbar.xpi will be built. That is the actual Firefox extension and 
it can be installed as any other extension (i,e,. double-clicking it from 
within Firefox).
When the extension is installed (and after a re-start of Firefox) a new 
toolbar will be there. In the Tools menu there will also be a WAX Toolbar 
Configuration option. Using this you can set the proxy to use (the WB) and a 
server running NutchWAX.

Finally I have attached an example of a web.xml that can be used when running 
the WB with these new changes and the WAX Toolbar. In it some new stuff has 
been added, namely a parameter specifying the redirect path (the Redirect.jsp 
mentioned above) and a servlet called xmlquery that runs in parallell with 
the normal query interface and is used by the extension to find the times a 
page has been archived.

So, let the feedback begin!

Regards, Oskar.

[Archive-access-discuss] Belated announcement of Wayback project

From: Brad T. <br...@ar...> - 2006-04-03 21:49:54

Hello archive-access!

I wanted to take a few minutes to introduce myself and the new Wayback 
project, which has been mentioned on this list, but never formally 
announced. This project is designed to eventually be the Internet 
Archive's standard tool for querying and replaying archived content. The 
current production Wayback Machine (web.archive.org) software allows 
Internet users to view archived documents from the Internet Archive's 
web collection, which contains over 60 billion resources. This new 
Wayback project seeks to replace the classic Wayback Machine's 
functionality in an open-source, extensible and redistributable Java 
package.

There are dramatic variations in the ways that people want to use this 
software. At one end of the spectrum is the user who simply wants to 
look at content they've just crawled with the Heritrix web crawler on 
their personal workstation. At the other end is the Internet Archive, 
needing to serve hundreds of requests per second against their 20 
million ARC file collection. In between are everything from users 
experimenting with full-text searching technologies, and others trying 
out new methods of replaying archived content using browser extensions.

To address these varying requirements, a good deal of the projects focus 
is to leverage modularity and extensibility, so various components can 
be swapped out and combined to satisfy diverse installation needs.

The very early (and unannounced) 0.2.0 release enabled two methods of 
replaying content in ARC format, the "standard" archival URL mode, and 
also a new proxy mode, where a user configures their browser to proxy 
requests through a Wayback server. This proxy mode addresses many, if 
not most, of the problems reported with the production Wayback Machine's 
  archival URL replay mechanism. The 0.2.0 version operated only in a 
standalone mode, requiring that all ARC files be located on the same 
machine running the Wayback software.

We have just release a new version, 0.4.0, of the Wayback software, 
which you can read about in more detail at the project's home page:

     http://archive-access.sourceforge.net/projects/wayback/

This version has solidified some of the internal workings of the 
software, addressed the usual set of bugs found in new codebases, and 
also includes some major new capabilities. The first major feature is 
the ability to access documents from ARC files stored on remote servers, 
which has significant scaling ramifications. There have also been 
substantial improvements in both the query UI capabilities, and in 
replaying documents. Also, the Wayback software can now be queried using 
an Opensearch API, and preliminary development has been completed to 
allow requests to be satisfied using a NutchWAX full-text index.

We plan to release 0.6.0 in the next couple of months, which will 
include better packaging, and substantial UI improvements, to make the 
Wayback software feature comparable with the WERA application. The 
current major features present in WERA that have not yet been developed 
in the Wayback are:

     * clickable "timeline" view in replay mode
     * very slick install application
     * vastly better documentation
     * better support/testing for international character-sets


This is my first Java project, so I'm very appreciative of coaching and 
suggestions on coding style and things I'm doing wrong.

Please let me know if you have problems, suggestions, or questions, and 
thanks in advance for the feedback!

Brad Tofel

[Archive-access-discuss] wmv file categorized as text/plain indexed?

From: alexis a. <alx...@ya...> - 2006-03-30 03:48:00

Hi,
   
  I encountered the problem while trying to search WERA using Chinese
letters as my criteria. The search result will always contain a WMV
file. Initially, I thought Nutch is indexing wmv files which Stack
clarified that it does not. Stack suggested that I check the
anchors.jsp to check if there are anchors to that file. However,The
anchors.jsp of the file does not have anything in it.
   
  I consult the crawl.log file and found these lines:
2006-03-18T04:10:33.604Z   200    1056898
http://info.channelnewsasia.com/rovingdv/boundaries/shthong.wmv LE
http://www.channelnewsasia.com/boundaries/videos.htm text/plain #030
20060318041033072+335 AAYVISCYYWBMBZWKBSFSB7Z6XXAKWC2F 3t
2006-03-18T04:10:36.251Z   200    2700490
http://info.channelnewsasia.com/rovingdv/boundaries/asha.wmv LE
http://www.channelnewsasia.com/boundaries/videos.htm text/plain #008
20060318041034946+792 CB44IW2LSWLMD6AWIYWOXZFK3HZ4AABU -
2006-03-18T04:10:38.954Z   200    1393066
http://info.channelnewsasia.com/rovingdv/boundaries/drtan.wmv LE
http://www.channelnewsasia.com/boundaries/videos.htm text/plain #001
20060318041038254+436 INDK5OOERVNJXO464GLIIT2EPUDIOXZP -
2006-03-18T04:10:41.199Z   200     891220
http://info.channelnewsasia.com/rovingdv/boundaries/chiam.wmv LE
http://www.channelnewsasia.com/boundaries/videos.htm text/plain #042
20060318041040700+334 5T3Q2FSH2BSOCI7WSUZKYBYZ6RXLF74H -
   
  When I look at the html page that contains the file, I found these
codes for one of the files:
  <object id="MediaPlayer"
classid="CLSID:22d6f312-b0f6-11d0-94ab-0080c74c7e95"
codebase="http://activex.microsoft.com/activex/controls/mplayer/en/nsm
p2inf.cab#Version=5,1,52,701"
standby="Loading Microsoft Windows Media Player components..."
type="application/x-oleobject" width="200" height="190" border="0"
vspace="0" hspace="0" align="top">
   <param name="FileName"
value="http://info.channelnewsasia.com/rovingdv/boundaries/chiam.wmv">
   <param name="AnimationatStart" value="true">
   <param name="TransparentatStart" value="true">
   <param name="AutoStart" value="false">
   <param name="ShowControls" value="1">
   <embed type="application/x-mplayer2"
  pluginspage="http://www.microsoft.com/windows95/downloads/contents/wur
ecommended/s_wufeatured/mediaplayer/default.asp"
     showcontrols=1 width=200 height=190
  src="http://info.channelnewsasia.com/rovingdv/boundaries/chiam.wmv"
     border="0" vspace="0" hspace="0" align="top"> </embed>
</object>
   
  It seems to me that Heritrix was not able to categorized the said
file correctly that is why it is indexed. However, Stack mentioned about nutch doing some more test to check if the file is indeed a text/html. 
   
  I hope you guys can check double check my problem. 
   
  Best Regards,
  Alexis Artes
   

		
---------------------------------
New Yahoo! Messenger with Voice. Call regular phones from your PC and save big.

Re: [Archive-access-discuss] CSS stylesheets not being redirected to point to WERA?

From: Brad T. <br...@ar...> - 2006-03-20 21:20:20

The old Wayback Machine rewrites these link tags on the server side, 
before transmitting to the clients. I believe this is because JS 
modification of this during page load has no effect. Not positive, but 
should be easy to test.

I just added a feature about 2 weeks ago to the new Java Wayback Machine 
to do this rewriting on the server side. (also does FRAMEs, and a couple 
other tag types, too..)



Sverre Bang wrote:
> On Thu, 2006-03-16 at 23:23 +0800, Boon Ling Aw wrote:
> 
>>Hello,
>>
>>When using WERA to view website archive, a JS script is inserted by WERA to 
>>ensure that links point to WERA rather than out to the Internet.
>>
>>However, this redirection of link is not being applied to CSS style sheets 
>>used in web page?
>>
>><link rel="stylesheet" href="http://... .../styles.css" type="text/css" />
>>
>>The CSS page does exists within the archive. However, its refering to the 
>>internet's copy when viewed using WERA.
>>
>>Any reason for this?
> 
> No, must be a bug in the JS responsible for rewriting links. We'd
> appreciate any contributions in improving the JS rewriter from anyone
> having the necessary JS skills - i know i don't ;-).
> 
> Sverre
> 
> 
>>Thanks in advance for any replies...
>>
>>
>>
>>
>>-------------------------------------------------------
>>This SF.Net email is sponsored by xPML, a groundbreaking scripting language
>>that extends applications into web and mobile media. Attend the live webcast
>>and join the prime developer group breaking into this new coding territory!
>>http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
>>_______________________________________________
>>Archive-access-discuss mailing list
>>Arc...@li...
>>https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by xPML, a groundbreaking scripting language
> that extends applications into web and mobile media. Attend the live webcast
> and join the prime developer group breaking into this new coding territory!
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

[Archive-access-discuss] [Announcement] NutchWAX-0.4.3: Minor bug fix release.

From: Michael S. <st...@du...> - 2006-03-20 18:51:07

NutchWAX-0.4.3 fixes the following bugs (Hope my little ascii table 
makes it across):

   
+------------------------------------------------------------------------+   

   |   ID    | Type |      Summary       | Open Date  |    By    |  
Filer   |  
   
|---------+------+--------------------+------------+----------+----------|   

   |         |      | Index '.arc' (as   | 2006-03-20 |          
|          |  
   | 1454710 | Fix  | well as            | 08:54      | stack-sf | 
stack-sf |  
   |         |      | '.arc.gz').        |            |          
|          |  
   
|---------+------+--------------------+------------+----------+----------|   

   | 1454714 | Fix  | Null mimetype      | 2006-03-20 | stack-sf | 
stack-sf |  
   |         |      | stops indexing     | 09:00      |          
|          |  
   
|---------+------+--------------------+------------+----------+----------|   

   |         |      | xml output         | 2006-03-20 |          
|          |  
   | 1429788 | Fix  | destroyed by html  | 08:59      | stack-sf | 
stack-sf |  
   |         |      | entity encoding    |            |          
|          |  
   
+------------------------------------------------------------------------+ 

For sure this will be last release before the 0.6.0 move up on to the 
nutch mapreduce platform.
Yours,
St.Ack

[Archive-access-discuss] [Announcement] - WERA 0.4.2-RC1

From: Sverre B. <sve...@nb...> - 2006-03-20 15:21:57

This note is to announce a new release candidate 0.4.2-RC1 of WERA (WEb
aRchive Access), the web archive collection search and navigation tool.
Release 0.4.2-RC1 includes url canonicalization and proxy support. See
release notes and manual for details. 

Release notes:
http://sourceforge.net/project/shownotes.php?release_id=403083&group_id=118427

Online manual: http://nwa.nb.no/wera/articles/manual.html

Download: 
http://sourceforge.net/project/showfiles.php?group_id=118427&package_id=167210&release_id=403083

The WERA home page:
http://archive-access.sourceforge.net/projects/wera/ 

Demo of WERA 0.4.2-RC1 is available at http://nwa.nb.no/wera/ (proxy
setup is not supported by the web server at nwa.nb.no,- i'll let you
know when that is ready)

Yours, Sverre Bang

Re: [Archive-access-discuss] CSS stylesheets not being redirected to point to WERA?

From: Sverre B. <sve...@nb...> - 2006-03-17 11:43:28

On Thu, 2006-03-16 at 23:23 +0800, Boon Ling Aw wrote:
> Hello,
> 
> When using WERA to view website archive, a JS script is inserted by WERA to 
> ensure that links point to WERA rather than out to the Internet.
> 
> However, this redirection of link is not being applied to CSS style sheets 
> used in web page?
> 
> <link rel="stylesheet" href="http://... .../styles.css" type="text/css" />
> 
> The CSS page does exists within the archive. However, its refering to the 
> internet's copy when viewed using WERA.
> 
> Any reason for this?
No, must be a bug in the JS responsible for rewriting links. We'd
appreciate any contributions in improving the JS rewriter from anyone
having the necessary JS skills - i know i don't ;-).

Sverre

> 
> Thanks in advance for any replies...
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by xPML, a groundbreaking scripting language
> that extends applications into web and mobile media. Attend the live webcast
> and join the prime developer group breaking into this new coding territory!
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

[Archive-access-discuss] Re: Nutch Wax indexing - new bie

From: Michael S. <st...@ar...> - 2006-03-16 16:46:59

(Forwarding to the list because likely of general interest)

Sverre Bang wrote:
> ...
>
> On Thu, 2006-03-16 at 05:39 +0000, Banu Gandhi wrote:
>   
>> Hi Sverre,
>>  
>> We are using NutchWAX as an indexing tool for our Web Archive(WERA and
>> HERITRIX).
>>  
>> We wish to implement multiple indexing, as well as incremental
>> indexing we keep our crwal files as an seperate server.
>>  
>>       * 1.I have some questions  when I try to do Incremental
>>         Indexing.
>> I mount the Arcfiles from another server. I have created queue. When I
>> segmenting it, It shows error message that there is no such file in
>> the queue fodler eventhough the arc files are linked properly in arcs
>> folder.
>>     

Please paste in the error Banu and the commands you run (You're not 
using the indexarcs wrapper script?).  That'll help with diagnosis.

>>  
>> When I try to implement the update statements from new segments, I got
>> the message that "FS not specified default LOCAL" How can I specify
>> this as not local. But the Update message shows update finished
>> successfully.
>>     

Be careful here.  You probably want LOCAL for your case, at least for 
the moment.  The alternative is NDFS, the nutch distributed file system 
that has since evolved in later versions of nutch -- nutchwax is based 
on nutch 0.7 --- to become DFS, part of the new hadoop apache project.  
Its phrasing is ominous as though you've left out some important 
specification but its just an emission to tell you which FS its about to 
use.

>>  
>> The same message is shown when I update segments from db.
>> After that If I check the ars folder of old segments, I can't see the
>> new arc files.
>>     

The arcs folder or the queue folder?
>>  
>> Can you explain me where I made the mistake.
>>  
>>       * 2. Can we maitain multiple indexed folder, meaning mutliple
>>         arc file folder in the same machine, it is indexed under diff
>>         folder.Is WERA can access all the indexed folder for search
>>         results....
>>     

I think WERA passes the ARCRetriever a full path so multiple folders 
should be possible (Sverre)?

Do you have an idea of how many ARC files you'll be dealing with?

But there'll be upper limits to how many ARCs you can keep on a single 
machine.... so a means of keeping them distributed over multiple 
machines is needed.  The open source wayback will have such a facility 
and we'll slot it into place when ready in place of ARCRetriever.

>>  
>>       * 3. Regarding the scalability of NutchWax,If I don't want to
>>         index the image file for full text searching. I wish to have
>>         just URL link to the images. How can we do that?
>>     
Thats what currently happens.  image/* and their like are passed to the 
default parser.  All it does is add to the index meta info such as URL, 
type, etc.  These resource types are not 'indexed' in the way text/* are.

>>  
>> And also please let me know where I can I find the functionality parrt
>> of all the folders as well as scripts of NutchWAX other than FAQ.
>>     

I'm not clear what you're asking above.  Please retry.

Thanks Banu,
St.Ack

>>  
>> Thanks in advance.
>>  
>> Best Regards,
>> Banu
>>
>>
>> ______________________________________________________________________
>> Jiyo cricket on Yahoo! India cricket
>> Yahoo! Messenger Mobile Stay in touch with your buddies all the time.
>>

[Archive-access-discuss] CSS stylesheets not being redirected to point to WERA?

From: Boon L. A. <aw...@ho...> - 2006-03-16 15:23:19

Hello,

When using WERA to view website archive, a JS script is inserted by WERA to 
ensure that links point to WERA rather than out to the Internet.

However, this redirection of link is not being applied to CSS style sheets 
used in web page?

<link rel="stylesheet" href="http://... .../styles.css" type="text/css" />

The CSS page does exists within the archive. However, its refering to the 
internet's copy when viewed using WERA.

Any reason for this?

Thanks in advance for any replies...

[Archive-access-discuss] Wera - URL canonicalization and Proxy support

From: Sverre B. <sve...@nb...> - 2006-03-14 15:08:30

Hi there.
Sorry for not responding earlier to all the issues discussed in the
recent weeks. I'm working on adding URL canonicalization and proxy mode
support in WERA and the results so far are promising. Some comments
below.

I'll prepare a new release for this week.
I'll even try to convince the people maintaining our web servers to add
the proxy setup on the wera demo site.

Please ask more questions, this time i promise to get back to you a bit
sooner.

Regards
Sverre


On Wed, 2006-02-22 at 22:36 -0800, stack wrote:
> stack wrote:
> > (Forwarded discussion from the Heritrix list)
> >
> > ------------------------------------------------------------------------
> > ...
> > Generally, the pages are shown fine with the exceptions of
> > javascripts that are retrieved from the live site instead of our arc
> > files. Also, WERA is unable to dynamically replace the links inside
> > the javascipts.
> >
> This leaking to the live web is a difficult problem.  Perhaps this 
> particular JS can be fixed in WERA but the variety of ways in which JS 
> can be conjured, its unlikely all permutations will be guarded against.
No way we're gonna catch all JS-generated URL's by using Javascript or
server side parsing. At least i'm not going to invest a lot of time in
bullet proofing the JS, others are welcome though ;-)
I'd prefer a combination JS and/or server side parsing and a proxy
solution that catches "the rest", i.e. the leakages out to internet.

> 
> > Sample case: http://nwa.nb.no/wera/result.php?time=&url=http%3A%2F% 
> > <http://nwa.nb.no/wera/result.php?time=&url=http%3A%2F%>
> > 2Fwww.nla.gov.au%2F&query=nla.
> >
> > Check the properties of the webpage to verify that you are still
> > within "http://nwa.nb.no". Click on "Exquisite Watercolors" link and
> > veiry that we are still viewing the arcfiles. Go back a page and try
> > any menu links. When you view the properties, it will show that you
> > are indeed browsing the live site instead of the arcfiles.
> >
The proxy mode i'm working does handle the above case.

> Have you considered setting your browser to go to your collection via a 
> proxy (I don't think this mode is supported yet in WERA.  I think its 
> possible to set the wayback into a proxy mode).  The proxy could ensure 
> you never strayed off your ARC collection returning errors if resource 
> is not found.
> 
> >
> > What I wanted to accomplish are the following:
> > 1) Help WERA load the javascripts from our arcfiles instead of the
> > live sites by modifying the loading of the scripts from the html.
> > Instead of the relative /js/xxxx.js, we will change it to
> > http://localhost/wera/......../js/xxxx.js.
> 
> (Sverre or Brad: Does the JS inserted at end of the page by WERA adding 
> a base to the page not effect such JS URLs?)
The JS injected by wera should take care of this (eh, i'm not an expert
on javascript - i did modify IA's original Wayback JS to fit WERA, but i
do not have thorough understanding of it). A big problem with WERA as it
is now is that you have no easy way of telling what is fetched from the
internet and what is fetched through WERA. Cutting the browser off from
internet by using a proxy that redirects the leaking links back to WERA
makes it a lot easier to debug and improve the JS' rewriting.

> .
> >
> > 2) Modify the relative links inside javascript files if WERA is not
> > capable of dynamically modifying them also.

If you really need to change the javascript files before feeding them to
the client i would recommend that you implement this in WERA rather than
start messing with the ARC files. If you look in the Wera config file
you'll see that there are different handlers for different mime types
($conf_document_handler). The text/html handler injects the JS for
rewriting links. Any other mime type is handled by a passthrough
handler. If the javascripts are stored in the archive with one (or more)
distinguished mime type you could write a handler espacially for
this/these.

> Would be sweet if any modifications you'd do in the rewriting of the ARC 
> files was instead done for you by WERA (or wayback).
> 
> St.Ack
> 
> >
> > I am planning to use the dk.netarkivet.ArcUtils for this task.
> >
> > I know that my problem is a little bit off topic but I hope you could
> > give additional tips.
> >
> > Thanks again in advance.
> >
> > --- In arc...@ya..., stack <stack@...> wrote:
> > >
> > > alxartes wrote:
> > > > St. Ack thanks again for the reply.
> > > >
> > > > Most of the pages are not displayed the way it should be when
> > viewed
> > > > from the source.

When you view the source of the web page displayed in the lower frame of
the wera timeline view you will not see the links rewritten. The source
you see is hte source before the JS "kicks in". This can be a bit
annoying when it comes to debugging the JS ;-)

> > >
> > > At this time, are you viewing the pages with WERA?  Or how are they
> > > being viewed?
> > > > I guess it is because the css and javascripts file
> > > > are not being fetched properly at the loading of the html from the
> > > > arcfile. We arrived at this conclusion since we can directly
> > > > retrieved the css and js through WERA.
> > > So pages are showing fine when viewed with WERA (generally)?
> > > >
> > > > I am planning on modifying the htmls inside the arcfiles to
> > correct
> > > > this problem.
> > > I'm trying to understand.  You want to rewrite ARC files changing
> > all
> > > links so they point back into ARCs (or back to a disk populated
> > with the
> > > documents from a set of ARCs)?  You do not want to use WERA viewing
> > pages?
> > >
> > > > What tool can I use to expand the arcfiles so that I
> > > > can modify the files inside? and a tool that will bring the
> > arcfile
> > > > together once again? I think this is somewhat out of topic but I
> > am a
> > > > little bit out of time and would greatly appreciate any inputs.
> > > This section from dev. manual might be of use:
> > > http://crawler.archive.org/articles/developer_manual.html#arcs. 
> > Talks
> > > about tools for reading ARCs.
> > >
> > > One approach would subclass ARCReader.  This will get you a stream
> > onto
> > > ARCs
> > >
> > (http://crawler.archive.org/apidocs/org/archive/io/arc/ARCReader.html) 
> > <http://crawler.archive.org/apidocs/org/archive/io/arc/ARCReader.html%29>
> > .. 
> > > Use the adjacent ARCWriter to write new ARCs.  Modifying the links
> > in
> > > pages, you'll first have to find them.  You could start with the
> > > Extractors that are in Heritrix subclassing them to add a link
> > rewrite
> > > functionality.  Such a tool has been asked for on this list in the
> > past
> > > but its a bit of job and in the end, you'll never successfully be
> > able
> > > to rewrite all links (Think URLs produced by JS in the page).
> > >
> > > Will a WERA (or the coming wayback,
> > > http://archive-access.sourceforge.net/projects/wayback/) 
> > <http://archive-access.sourceforge.net/projects/wayback/%29> not
> > suffice?
> > >
> > > Yours,
> > > St.Ack
> > >
> > >
> > >
> > > >
> > > > Thanks again.
> > > >
> > > >
> > > >
> > > > --- In arc...@ya..., stack <stack@> wrote:
> > > > >
> > > > > alxartes wrote:
> > > > > > Thanks St. Ack.
> > > > > >
> > > > > > It is really worisome to see those errors especially when we
> > are
> > > > not
> > > > > > viewing the arcfiles properly in Wera.
> > > > >
> > > > > Can you say more about what 'not viewing the arcfiles properly
> > in
> > > > > Wera'?  Are pages not being found or are missing
> > images/stylesheets?
> > > > >
> > > > > Regards the local-errors.log, I've upped priority on an RFE that
> > > > > proposes cleaning this log (and added your experience to the
> > > > issue):
> > > > > http://sourceforge.net/tracker/index.php?
> > > > func=detail&aid=1091580&group_id=73833&atid=539099.
> > > > > >
> > > > > > Here is an excerpt from the crawl.log:
> > > > > >
> > > > > > 84046144      http://www.hdb.gov.sg/hdbwww/ownkvb.mpeg
> > > > > > 84046144      http://www5.hdb.gov.sg/hdbwww/ownkvb.mpeg
> > > > > > 84046144      http://www7.hdb.gov.sg/hdbwww/ownkvb.mpeg
> > > > > > 84046144      https://www5.hdb.gov.sg/hdbwww/ownkvb.mpeg
> > > > > > 47097784      http://www.hdb.gov.sg/hdbwww/ownkvs.mpeg
> > > > > > 47097784      http://www5.hdb.gov.sg/hdbwww/ownkvs.mpeg
> > > > > > 47097784      http://www7.hdb.gov.sg/hdbwww/ownkvs.mpeg
> > > > > > 47097784      https://www5.hdb.gov.sg/hdbwww/ownkvs.mpeg
> > > > > > 22292823      http://www.hdb.gov.sg/hdbwww/fallingwindow.wmv
> > > > > > 22292823      http://www5.hdb.gov.sg/hdbwww/fallingwindow.wmv
> > > > > > 22292823      http://www7.hdb.gov.sg/hdbwww/fallingwindow.wmv
> > > > > > 22292823      https://www5.hdb.gov.sg/hdbwww/fallingwindow.wmv
> > > > > >
> > > > > > As you can see, a certain file is crawled 4 times. I have done
> > > > this
> > > > > > crawl using domain scope. Would pathscope with a seed of
> > > > > > http://www.hdb.gov.sg prevent the other sites to being
> > crawled? If
> > > > > > not, are there other ways to prevent it from happening?
> > > > >
> > > > > Yeah, the domain scope warns: "It will however reach subdomains
> > of
> > > > the
> > > > > seeds' original domains.  www[#].host is considered to be the
> > same
> > > > as
> > > > > host."  Explicitly stating 'www.hdb.gov.sg' doesn't look like it
> > > > will
> > > > > avoid the problem either reading the code.
> > > > >
> > > > > FYI, we're moving away from *scope scopes -- i.e. domainscope,
> > > > > pathscope, etc. -- toward decidingscope.  The latter gives
> > > > you "more
> > > > > rope" designing scopes.
> > > > >
> > > > > It looks like the On*DecideRule though has same issue
> > with 'www'.
> > > > Looks
> > > > > like you can write a SURT form, something
> > like '(sg,gov,hdb,www)',
> > > > that
> > > > > will only include URIs with a host of 'www.hdb.gov.sg' (though
> > it
> > > > looks
> > > > > like http and https are flattened to be same scheme).
> > > > >
> > > > > I'll let others -- Igor or Gordon? -- respond.  They can give a
> > > > better
> > > > > quality answer than I.
> > > > >
> > > > > Good stuff,
> > > > > St.Ack
> > > > >
> > > > > >
> > > > > > Thank you so much for your time.
> > > > > >
> > > > > >
> > > > > >
> > > > > > --- In arc...@ya..., stack <stack@> wrote:
> > > > > > >
> > > > > > > alxartes wrote:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I am investigating the log files of my crawls and found
> > the
> > > > error
> > > > > > > > below. I hope someone could explain what this means
> > because
> > > > the
> > > > > > other
> > > > > > > > javascripts are crawled fine.
> > > > > > > >
> > > > > > > > 2006-02-15T03:35:21.747Z
> > > > > > > >
> > > > http://www.macromedia.com/uber/js/omniture_s_code.js "Unsupported
> > > > > > > > scheme: javascript"
> > > > > > > >
> > > > > >
> > > >
> > javascript:,macromedia,dreamweaver,flash,shockwave,sdc,markme,sdc.shoc
> > > > > > kw
> > > > > > > > ave,infopoll,developerlocator.macromedia
> > > > > > >
> > > > > > > In short, the above is just stating that Heritrix does not
> > > > support
> > > > > > > fetching
> > the 'URI' "javascript:,macromedia,dreamweaver...".  Its
> > > > > > not an
> > > > > > > 'error'.
> > > > > > >
> > > > > > > Heritrix is regexing over the content of
> > > > > > > 'http://www.macromedia.com/uber/js/omniture_s_code.js' 
> > <http://www.macromedia.com/uber/js/omniture_s_code.js%27>
> > > > <http://www.macromedia.com/uber/js/omniture_s_code.js%27>
> > > > > > <http://www.macromedia.com/uber/js/omniture_s_code.js%27>
> > looking
> > > > for
> > > > > > > URIs.  It found the string
> > > > > > >
> > > > > >
> > > >
> > > "javascript:,macromedia,dreamweaver,flash,shockwave,sdc,markme,sdc.s
> > > > > > hockwave,infopoll,developerlocator.macromedia"
> > > > > > >
> > > > > > >
> > > > > > > To the Heritrix regex, the above string looks like a likely
> > URI.
> > > > > > Its
> > > > > > > inside quotes
> > > > > > > and starts with what could be an URI scheme
> > > > (i.e. 'javascript:').
> > > > > > >
> > > > > > > So, the candidate URI is passed to our URI parser class,
> > > > > > > org.archive.net.UURIFactory. This class takes configuration
> > in
> > > > > > > heritrix.properties about which URI schemes Heritrix will
> > > > accept.
> > > > > > > Here's relevant extract:
> > > > > > >
> > > > > > >
> > > > > >
> > > >
> > ######################################################################
> > > > > > ########
> > > > > > > # U U R
> > > > > > >
> > > > > >
> > > >
> > I                                                                    #
> > > > > > >
> > > > > >
> > > >
> > ######################################################################
> > > > > > #########
> > > > > > > Any scheme not listed in the below will generate an
> > > > > > UnsupportedUriScheme
> > > > > > > # exception.  Make the list empty to support all schemes.
> > > > > > > org.archive.net.UURIFactory.schemes = http, https, dns,
> > invalid
> > > > > > >
> > > > > > > (We don't currently have a 'UnsupportedUriScheme'
> > exception.  We
> > > > > > should
> > > > > > > add one).
> > > > > > >
> > > > > > > Here is where the test is done:
> > > > > > >
> > > >
> > http://crawler.archive.org/xref/org/archive/net/UURIFactory.html#443
> > > > > > >
> > > > > > > Because 'javascript' scheme is not in above supported
> > schemes
> > > > list
> > > > > > (nor
> > > > > > > in the list of schemes to ignore which appears later in
> > > > > > > heritrix.properties), it generates a URIException with
> > > > > > an 'unsupported
> > > > > > > scheme' message.
> > > > > > >
> > > > > > > We could do with some clean up in here.  Currently all URI
> > > > > > exceptions
> > > > > > > are lumped into URIException.  We could add subclasses of
> > URIE
> > > > so
> > > > > > the
> > > > > > > non-errors get logged at a different level: e.g. FINE for
> > > > > > unsupported
> > > > > > > scheme exceptions.
> > > > > > >
> > > > > > > St.Ack
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > SPONSORED LINKS
> > > > > > Computer security
> > > > > > <http://groups.yahoo.com/gads?
> > > >
> > t=ms&k=Computer+security&w1=Computer+security&w2=Computer+training&c=2
> > > > &s=46&.sig=BHmcxBg5sKfN9-gcWnJWDg>
> > > > > >       Computer training
> > > > > > <http://groups.yahoo.com/gads?
> > > >
> > t=ms&k=Computer+training&w1=Computer+security&w2=Computer+training&c=2
> > > > &s=46&.sig=v0JjJWA4s7mLnWQWdFxuTQ>
> > > > > >
> > > > > >
> > > > > >
> > > > > > --------------------------------------------------------------
> > ----
> > > > ------
> > > > > > YAHOO! GROUPS LINKS
> > > > > >
> > > > > >     *  Visit your group "archive-crawler
> > > > > >       <http://groups.yahoo.com/group/archive-crawler>" on the
> > web.
> > > > > >      
> > > > > >     *  To unsubscribe from this group, send an email to:
> > > > > >        arc...@ya...
> > > > > >       <mailto:arc...@ya...?
> > > > subject=Unsubscribe>
> > > > > >      
> > > > > >     *  Your use of Yahoo! Groups is subject to the Yahoo!
> > Terms of
> > > > > >       Service <http://docs.yahoo.com/info/terms/>.
> > > > > >
> > > > > >
> > > > > > --------------------------------------------------------------
> > ----
> > > > ------
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > SPONSORED LINKS
> > > > Computer security
> > > > <http://groups.yahoo.com/gads?
> > t=ms&k=Computer+security&w1=Computer+security&w2=Computer+training&c=2
> > &s=46&.sig=BHmcxBg5sKfN9-gcWnJWDg>
> > > >       Computer training
> > > > <http://groups.yahoo.com/gads?
> > t=ms&k=Computer+training&w1=Computer+security&w2=Computer+training&c=2
> > &s=46&.sig=v0JjJWA4s7mLnWQWdFxuTQ>
> > > >
> > > >
> > > >
> > > > ------------------------------------------------------------------
> > ------
> > > > YAHOO! GROUPS LINKS
> > > >
> > > >     *  Visit your group "archive-crawler
> > > >       <http://groups.yahoo.com/group/archive-crawler>" on the web.
> > > >       
> > > >     *  To unsubscribe from this group, send an email to:
> > > >        arc...@ya...
> > > >       <mailto:arc...@ya...?
> > subject=Unsubscribe>
> > > >       
> > > >     *  Your use of Yahoo! Groups is subject to the Yahoo! Terms of
> > > >       Service <http://docs.yahoo.com/info/terms/>.
> > > >
> > > >
> > > > ------------------------------------------------------------------
> > ------
> > > >
> > >
> >
> >
> >
> >
> >
> >
> > SPONSORED LINKS
> > Computer security 
> > <http://groups.yahoo.com/gads?t=ms&k=Computer+security&w1=Computer+security&w2=Computer+training&c=2&s=46&.sig=BHmcxBg5sKfN9-gcWnJWDg> 
> > 	Computer training 
> > <http://groups.yahoo.com/gads?t=ms&k=Computer+training&w1=Computer+security&w2=Computer+training&c=2&s=46&.sig=v0JjJWA4s7mLnWQWdFxuTQ> 
> >
> >
> >
> > ------------------------------------------------------------------------
> > YAHOO! GROUPS LINKS
> >
> >     *  Visit your group "archive-crawler
> >       <http://groups.yahoo.com/group/archive-crawler>" on the web.
> >        
> >     *  To unsubscribe from this group, send an email to:
> >        arc...@ya...
> >       <mailto:arc...@ya...?subject=Unsubscribe>
> >        
> >     *  Your use of Yahoo! Groups is subject to the Yahoo! Terms of
> >       Service <http://docs.yahoo.com/info/terms/>.
> >
> >
> > ------------------------------------------------------------------------
> >
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by xPML, a groundbreaking scripting language
> that extends applications into web and mobile media. Attend the live webcast
> and join the prime developer group breaking into this new coding territory!
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

[Archive-access-discuss] Music files

From: Kaisa K. <kau...@cs...> - 2006-02-28 12:28:29

Hello,
can you listen music files using Wera, if you have 
various music formats in your archive?

kaisa

Re: [Archive-access-discuss] [Fwd: [archive-crawler] Re: Help with Uri-errors.log]

From: stack <st...@ar...> - 2006-02-23 18:31:23

stack wrote:
> stack wrote:
>> (Forwarded discussion from the Heritrix list)
>>
>> ------------------------------------------------------------------------
>> ...
>> Generally, the pages are shown fine with the exceptions of
>> javascripts that are retrieved from the live site instead of our arc
>> files. Also, WERA is unable to dynamically replace the links inside
>> the javascipts.
>>

Related, here are all current issues regards JS (the first originally 
reported by Charles of LU):

"[ 1312214 ] [wera/wayback js] More redirects to llive web (look at 
it)." 
https://sourceforge.net/tracker/index.php?func=detail&aid=1312214&group_id=118427&atid=681137.

"[ 1280447 ] [wera/wayback js] Link rewritng not working well for 
frames" 
https://sourceforge.net/tracker/index.php?func=detail&aid=1280447&group_id=118427&atid=681137

"[ 1421112 ] WERA web page display Menus in JS" 
https://sourceforge.net/tracker/index.php?func=detail&aid=1421112&group_id=118427&atid=681137

St.Ack

Re: [Archive-access-discuss] Status of WERA

From: stack <st...@ar...> - 2006-02-23 17:40:55

Charles Foetz wrote:
> Hello St.Ack, Lukas and everyone else,
Good to hear from you again Charles.

We still owe you a response to the long list of issues you found in the=20
WERA+NutchWAX combo.  A good few have been addressed but others remain=20
still.
> =20
> Long time since I posted any news concerning Luxembourg's web=20
> archiving efforts - as you know, we are very limited in human=20
> resources (we only have 2 IT people at the national library) and=20
> therefore need to find a balance between many different projects.
> =20
> Last time we were forced to put our web archiving project on hold due=20
> to the known limitations of the WERA Access tool (no canonicalization=20
> of URLs, no handling of redirects, encoding issues)... As a prototype=20
> project we had archived at several dates the sites of 7 political=20
> parties during local elections. The two limitations above made it=20
> impossible for WERA to access most parts of 4 out of these 7 archived=20
> sites (links to "http://site.com" instead of "http://www.site.com"=20
> were quite common, for instance), we therefore had pretty much nothing=20
> to "show" and had didn't go further than the prototype.
Your report on canonicalization failures was captured as this issue:=20
https://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1312202&gro=
up_id=3D118427&atid=3D681137. =20
We should  make WERA requery with the 'www' stripped (or prepended) if=20
gets a 404 out of the index.

> =20
> I am now wondering what the plans are for WERA... are the issues above=20
> likely to be fixed any time soon or are they considered low priority?=20
> Is a new release planned or is the focus on other tools at the moment=20
> (I realise you guys also struggle at many fronts at the same time) ?
> If you think we could help on the development of WERA ourselves and=20
> maybe should be having a go at trying to fix the issues above, let me=20
> know.
> Another question: via the archive-access-cvs list, I noticed a lot of=20
> updates on the wayback project. What is this project? An open-source=20
> implementation of the Wayback machine (I've heard this mentioned=20
> before)? Has there been a release and at which stage is it? alpha?=20
> beta? working version? Where does this project fit in? Should it be=20
> seen as an alternative to WERA?
> =20
Sverre knows the WERA story best.  I'll let him speak to the above.=20

Would be sweet if we could fix sufficent for you to launch at least a=20
prototype.

The long term plan is to transition from WERA on to the new wayback. =20
Sverre points this out at the end of this 'What is WERA?' document in=20
the future of WERA:=20
http://archive-access.sourceforge.net/projects/wera/articles/what-is-wera=
.html#N100AE. =20

For description of the new wayback, see=20
http://archive-access.sourceforge.net/projects/wayback/.  The front page=20
does a good job situating the project.  Its alpha software currently=20
though a pending release release will move it past this designation (Let=20
me kick our Brad and get him to introduce the wayback on this list). =20
Wayback is currently focusing on scaling and being able to act as=20
replacement for http://web.archive.org wayback for small collections.

IMO, we're a ways yet from the wayback replacing WERA.  While it already=20
has capabilities in excess of the WERA+ARCRetriever in certain regards,=20
its focus is elsewhere -- at least for now -- and it lacks core WERA UI=20
functionality,  the quality documentation, and the sweet installer.

St.Ack

> Best regards,
> =20
> Charlie Foetz
> Biblioth=E8que nationale Luxembourg
> Sp=E9cialiste de la gestion =E9lectronique de l'information
> =20

[Archive-access-discuss] Status of WERA

From: Charles F. <Cha...@bn...> - 2006-02-23 09:43:36

Hello St.Ack, Lukas and everyone else,

Long time since I posted any news concerning Luxembourg's web archiving =
efforts - as you know, we are very limited in human resources (we only =
have 2 IT people at the national library) and therefore need to find a =
balance between many different projects.

Last time we were forced to put our web archiving project on hold due to =
the known limitations of the WERA Access tool (no canonicalization of =
URLs, no handling of redirects, encoding issues)... As a prototype =
project we had archived at several dates the sites of 7 political =
parties during local elections. The two limitations above made it =
impossible for WERA to access most parts of 4 out of these 7 archived =
sites (links to "http://site.com" instead of "http://www.site.com" were =
quite common, for instance), we therefore had pretty much nothing to =
"show" and had didn't go further than the prototype.=20

I am now wondering what the plans are for WERA... are the issues above =
likely to be fixed any time soon or are they considered low priority? Is =
a new release planned or is the focus on other tools at the moment (I =
realise you guys also struggle at many fronts at the same time) ?

If you think we could help on the development of WERA ourselves and =
maybe should be having a go at trying to fix the issues above, let me =
know.=20

Another question: via the archive-access-cvs list, I noticed a lot of =
updates on the wayback project. What is this project? An open-source =
implementation of the Wayback machine (I've heard this mentioned =
before)? Has there been a release and at which stage is it? alpha? beta? =
working version? Where does this project fit in? Should it be seen as an =
alternative to WERA?=20

Best regards,

Charlie Foetz
Biblioth=E8que nationale Luxembourg
Sp=E9cialiste de la gestion =E9lectronique de l'information

Re: [Archive-access-discuss] [Fwd: [archive-crawler] Re: Help with Uri-errors.log]

From: stack <st...@ar...> - 2006-02-23 06:37:27

stack wrote:
> (Forwarded discussion from the Heritrix list)
>
> ------------------------------------------------------------------------
> ...
> Generally, the pages are shown fine with the exceptions of
> javascripts that are retrieved from the live site instead of our arc
> files. Also, WERA is unable to dynamically replace the links inside
> the javascipts.
>
This leaking to the live web is a difficult problem.  Perhaps this 
particular JS can be fixed in WERA but the variety of ways in which JS 
can be conjured, its unlikely all permutations will be guarded against.

> Sample case: http://nwa.nb.no/wera/result.php?time=&url=http%3A%2F% 
> <http://nwa.nb.no/wera/result.php?time=&url=http%3A%2F%>
> 2Fwww.nla.gov.au%2F&query=nla.
>
> Check the properties of the webpage to verify that you are still
> within "http://nwa.nb.no". Click on "Exquisite Watercolors" link and
> veiry that we are still viewing the arcfiles. Go back a page and try
> any menu links. When you view the properties, it will show that you
> are indeed browsing the live site instead of the arcfiles.
>
Have you considered setting your browser to go to your collection via a 
proxy (I don't think this mode is supported yet in WERA.  I think its 
possible to set the wayback into a proxy mode).  The proxy could ensure 
you never strayed off your ARC collection returning errors if resource 
is not found.

>
> What I wanted to accomplish are the following:
> 1) Help WERA load the javascripts from our arcfiles instead of the
> live sites by modifying the loading of the scripts from the html.
> Instead of the relative /js/xxxx.js, we will change it to
> http://localhost/wera/......../js/xxxx.js.

(Sverre or Brad: Does the JS inserted at end of the page by WERA adding 
a base to the page not effect such JS URLs?).
>
> 2) Modify the relative links inside javascript files if WERA is not
> capable of dynamically modifying them also.
Would be sweet if any modifications you'd do in the rewriting of the ARC 
files was instead done for you by WERA (or wayback).

St.Ack

>
> I am planning to use the dk.netarkivet.ArcUtils for this task.
>
> I know that my problem is a little bit off topic but I hope you could
> give additional tips.
>
> Thanks again in advance.
>
> --- In arc...@ya..., stack <stack@...> wrote:
> >
> > alxartes wrote:
> > > St. Ack thanks again for the reply.
> > >
> > > Most of the pages are not displayed the way it should be when
> viewed
> > > from the source.
> >
> > At this time, are you viewing the pages with WERA?  Or how are they
> > being viewed?
> > > I guess it is because the css and javascripts file
> > > are not being fetched properly at the loading of the html from the
> > > arcfile. We arrived at this conclusion since we can directly
> > > retrieved the css and js through WERA.
> > So pages are showing fine when viewed with WERA (generally)?
> > >
> > > I am planning on modifying the htmls inside the arcfiles to
> correct
> > > this problem.
> > I'm trying to understand.  You want to rewrite ARC files changing
> all
> > links so they point back into ARCs (or back to a disk populated
> with the
> > documents from a set of ARCs)?  You do not want to use WERA viewing
> pages?
> >
> > > What tool can I use to expand the arcfiles so that I
> > > can modify the files inside? and a tool that will bring the
> arcfile
> > > together once again? I think this is somewhat out of topic but I
> am a
> > > little bit out of time and would greatly appreciate any inputs.
> > This section from dev. manual might be of use:
> > http://crawler.archive.org/articles/developer_manual.html#arcs. 
> Talks
> > about tools for reading ARCs.
> >
> > One approach would subclass ARCReader.  This will get you a stream
> onto
> > ARCs
> >
> (http://crawler.archive.org/apidocs/org/archive/io/arc/ARCReader.html) 
> <http://crawler.archive.org/apidocs/org/archive/io/arc/ARCReader.html%29>
> .. 
> > Use the adjacent ARCWriter to write new ARCs.  Modifying the links
> in
> > pages, you'll first have to find them.  You could start with the
> > Extractors that are in Heritrix subclassing them to add a link
> rewrite
> > functionality.  Such a tool has been asked for on this list in the
> past
> > but its a bit of job and in the end, you'll never successfully be
> able
> > to rewrite all links (Think URLs produced by JS in the page).
> >
> > Will a WERA (or the coming wayback,
> > http://archive-access.sourceforge.net/projects/wayback/) 
> <http://archive-access.sourceforge.net/projects/wayback/%29> not
> suffice?
> >
> > Yours,
> > St.Ack
> >
> >
> >
> > >
> > > Thanks again.
> > >
> > >
> > >
> > > --- In arc...@ya..., stack <stack@> wrote:
> > > >
> > > > alxartes wrote:
> > > > > Thanks St. Ack.
> > > > >
> > > > > It is really worisome to see those errors especially when we
> are
> > > not
> > > > > viewing the arcfiles properly in Wera.
> > > >
> > > > Can you say more about what 'not viewing the arcfiles properly
> in
> > > > Wera'?  Are pages not being found or are missing
> images/stylesheets?
> > > >
> > > > Regards the local-errors.log, I've upped priority on an RFE that
> > > > proposes cleaning this log (and added your experience to the
> > > issue):
> > > > http://sourceforge.net/tracker/index.php?
> > > func=detail&aid=1091580&group_id=73833&atid=539099.
> > > > >
> > > > > Here is an excerpt from the crawl.log:
> > > > >
> > > > > 84046144      http://www.hdb.gov.sg/hdbwww/ownkvb.mpeg
> > > > > 84046144      http://www5.hdb.gov.sg/hdbwww/ownkvb.mpeg
> > > > > 84046144      http://www7.hdb.gov.sg/hdbwww/ownkvb.mpeg
> > > > > 84046144      https://www5.hdb.gov.sg/hdbwww/ownkvb.mpeg
> > > > > 47097784      http://www.hdb.gov.sg/hdbwww/ownkvs.mpeg
> > > > > 47097784      http://www5.hdb.gov.sg/hdbwww/ownkvs.mpeg
> > > > > 47097784      http://www7.hdb.gov.sg/hdbwww/ownkvs.mpeg
> > > > > 47097784      https://www5.hdb.gov.sg/hdbwww/ownkvs.mpeg
> > > > > 22292823      http://www.hdb.gov.sg/hdbwww/fallingwindow.wmv
> > > > > 22292823      http://www5.hdb.gov.sg/hdbwww/fallingwindow.wmv
> > > > > 22292823      http://www7.hdb.gov.sg/hdbwww/fallingwindow.wmv
> > > > > 22292823      https://www5.hdb.gov.sg/hdbwww/fallingwindow.wmv
> > > > >
> > > > > As you can see, a certain file is crawled 4 times. I have done
> > > this
> > > > > crawl using domain scope. Would pathscope with a seed of
> > > > > http://www.hdb.gov.sg prevent the other sites to being
> crawled? If
> > > > > not, are there other ways to prevent it from happening?
> > > >
> > > > Yeah, the domain scope warns: "It will however reach subdomains
> of
> > > the
> > > > seeds' original domains.  www[#].host is considered to be the
> same
> > > as
> > > > host."  Explicitly stating 'www.hdb.gov.sg' doesn't look like it
> > > will
> > > > avoid the problem either reading the code.
> > > >
> > > > FYI, we're moving away from *scope scopes -- i.e. domainscope,
> > > > pathscope, etc. -- toward decidingscope.  The latter gives
> > > you "more
> > > > rope" designing scopes.
> > > >
> > > > It looks like the On*DecideRule though has same issue
> with 'www'.
> > > Looks
> > > > like you can write a SURT form, something
> like '(sg,gov,hdb,www)',
> > > that
> > > > will only include URIs with a host of 'www.hdb.gov.sg' (though
> it
> > > looks
> > > > like http and https are flattened to be same scheme).
> > > >
> > > > I'll let others -- Igor or Gordon? -- respond.  They can give a
> > > better
> > > > quality answer than I.
> > > >
> > > > Good stuff,
> > > > St.Ack
> > > >
> > > > >
> > > > > Thank you so much for your time.
> > > > >
> > > > >
> > > > >
> > > > > --- In arc...@ya..., stack <stack@> wrote:
> > > > > >
> > > > > > alxartes wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > I am investigating the log files of my crawls and found
> the
> > > error
> > > > > > > below. I hope someone could explain what this means
> because
> > > the
> > > > > other
> > > > > > > javascripts are crawled fine.
> > > > > > >
> > > > > > > 2006-02-15T03:35:21.747Z
> > > > > > >
> > > http://www.macromedia.com/uber/js/omniture_s_code.js "Unsupported
> > > > > > > scheme: javascript"
> > > > > > >
> > > > >
> > >
> javascript:,macromedia,dreamweaver,flash,shockwave,sdc,markme,sdc.shoc
> > > > > kw
> > > > > > > ave,infopoll,developerlocator.macromedia
> > > > > >
> > > > > > In short, the above is just stating that Heritrix does not
> > > support
> > > > > > fetching
> the 'URI' "javascript:,macromedia,dreamweaver...".  Its
> > > > > not an
> > > > > > 'error'.
> > > > > >
> > > > > > Heritrix is regexing over the content of
> > > > > > 'http://www.macromedia.com/uber/js/omniture_s_code.js' 
> <http://www.macromedia.com/uber/js/omniture_s_code.js%27>
> > > <http://www.macromedia.com/uber/js/omniture_s_code.js%27>
> > > > > <http://www.macromedia.com/uber/js/omniture_s_code.js%27>
> looking
> > > for
> > > > > > URIs.  It found the string
> > > > > >
> > > > >
> > >
> > "javascript:,macromedia,dreamweaver,flash,shockwave,sdc,markme,sdc.s
> > > > > hockwave,infopoll,developerlocator.macromedia"
> > > > > >
> > > > > >
> > > > > > To the Heritrix regex, the above string looks like a likely
> URI.
> > > > > Its
> > > > > > inside quotes
> > > > > > and starts with what could be an URI scheme
> > > (i.e. 'javascript:').
> > > > > >
> > > > > > So, the candidate URI is passed to our URI parser class,
> > > > > > org.archive.net.UURIFactory. This class takes configuration
> in
> > > > > > heritrix.properties about which URI schemes Heritrix will
> > > accept.
> > > > > > Here's relevant extract:
> > > > > >
> > > > > >
> > > > >
> > >
> ######################################################################
> > > > > ########
> > > > > > # U U R
> > > > > >
> > > > >
> > >
> I                                                                    #
> > > > > >
> > > > >
> > >
> ######################################################################
> > > > > #########
> > > > > > Any scheme not listed in the below will generate an
> > > > > UnsupportedUriScheme
> > > > > > # exception.  Make the list empty to support all schemes.
> > > > > > org.archive.net.UURIFactory.schemes = http, https, dns,
> invalid
> > > > > >
> > > > > > (We don't currently have a 'UnsupportedUriScheme'
> exception.  We
> > > > > should
> > > > > > add one).
> > > > > >
> > > > > > Here is where the test is done:
> > > > > >
> > >
> http://crawler.archive.org/xref/org/archive/net/UURIFactory.html#443
> > > > > >
> > > > > > Because 'javascript' scheme is not in above supported
> schemes
> > > list
> > > > > (nor
> > > > > > in the list of schemes to ignore which appears later in
> > > > > > heritrix.properties), it generates a URIException with
> > > > > an 'unsupported
> > > > > > scheme' message.
> > > > > >
> > > > > > We could do with some clean up in here.  Currently all URI
> > > > > exceptions
> > > > > > are lumped into URIException.  We could add subclasses of
> URIE
> > > so
> > > > > the
> > > > > > non-errors get logged at a different level: e.g. FINE for
> > > > > unsupported
> > > > > > scheme exceptions.
> > > > > >
> > > > > > St.Ack
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > SPONSORED LINKS
> > > > > Computer security
> > > > > <http://groups.yahoo.com/gads?
> > >
> t=ms&k=Computer+security&w1=Computer+security&w2=Computer+training&c=2
> > > &s=46&.sig=BHmcxBg5sKfN9-gcWnJWDg>
> > > > >       Computer training
> > > > > <http://groups.yahoo.com/gads?
> > >
> t=ms&k=Computer+training&w1=Computer+security&w2=Computer+training&c=2
> > > &s=46&.sig=v0JjJWA4s7mLnWQWdFxuTQ>
> > > > >
> > > > >
> > > > >
> > > > > --------------------------------------------------------------
> ----
> > > ------
> > > > > YAHOO! GROUPS LINKS
> > > > >
> > > > >     *  Visit your group "archive-crawler
> > > > >       <http://groups.yahoo.com/group/archive-crawler>" on the
> web.
> > > > >      
> > > > >     *  To unsubscribe from this group, send an email to:
> > > > >        arc...@ya...
> > > > >       <mailto:arc...@ya...?
> > > subject=Unsubscribe>
> > > > >      
> > > > >     *  Your use of Yahoo! Groups is subject to the Yahoo!
> Terms of
> > > > >       Service <http://docs.yahoo.com/info/terms/>.
> > > > >
> > > > >
> > > > > --------------------------------------------------------------
> ----
> > > ------
> > > > >
> > > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > SPONSORED LINKS
> > > Computer security
> > > <http://groups.yahoo.com/gads?
> t=ms&k=Computer+security&w1=Computer+security&w2=Computer+training&c=2
> &s=46&.sig=BHmcxBg5sKfN9-gcWnJWDg>
> > >       Computer training
> > > <http://groups.yahoo.com/gads?
> t=ms&k=Computer+training&w1=Computer+security&w2=Computer+training&c=2
> &s=46&.sig=v0JjJWA4s7mLnWQWdFxuTQ>
> > >
> > >
> > >
> > > ------------------------------------------------------------------
> ------
> > > YAHOO! GROUPS LINKS
> > >
> > >     *  Visit your group "archive-crawler
> > >       <http://groups.yahoo.com/group/archive-crawler>" on the web.
> > >       
> > >     *  To unsubscribe from this group, send an email to:
> > >        arc...@ya...
> > >       <mailto:arc...@ya...?
> subject=Unsubscribe>
> > >       
> > >     *  Your use of Yahoo! Groups is subject to the Yahoo! Terms of
> > >       Service <http://docs.yahoo.com/info/terms/>.
> > >
> > >
> > > ------------------------------------------------------------------
> ------
> > >
> >
>
>
>
>
>
>
> SPONSORED LINKS
> Computer security 
> <http://groups.yahoo.com/gads?t=ms&k=Computer+security&w1=Computer+security&w2=Computer+training&c=2&s=46&.sig=BHmcxBg5sKfN9-gcWnJWDg> 
> 	Computer training 
> <http://groups.yahoo.com/gads?t=ms&k=Computer+training&w1=Computer+security&w2=Computer+training&c=2&s=46&.sig=v0JjJWA4s7mLnWQWdFxuTQ> 
>
>
>
> ------------------------------------------------------------------------
> YAHOO! GROUPS LINKS
>
>     *  Visit your group "archive-crawler
>       <http://groups.yahoo.com/group/archive-crawler>" on the web.
>        
>     *  To unsubscribe from this group, send an email to:
>        arc...@ya...
>       <mailto:arc...@ya...?subject=Unsubscribe>
>        
>     *  Your use of Yahoo! Groups is subject to the Yahoo! Terms of
>       Service <http://docs.yahoo.com/info/terms/>.
>
>
> ------------------------------------------------------------------------
>

[Archive-access-discuss] [Fwd: [archive-crawler] Re: Help with Uri-errors.log]

From: stack <st...@ar...> - 2006-02-23 05:27:50

Attachments: [archive-crawler] Re: Help with Uri-errors.log

(Forwarded discussion from the Heritrix list)

Re: [Archive-access-discuss] incremental indexing and NutchWAX 0.6.0

From: stack <st...@ar...> - 2006-02-17 16:49:52

Lukas Matejka wrote:
> first i made links through command 'setup' than  i used 'segment' to create 
> segments from arcs and than i wanted to use 'links' to process pages and 
> links to webdb, but command 'links' uses 
>
> ${NUTCH}/bin/nutch admin "${nutchdb}" -create
> before updating db from segments and updating back segments from db
>
> shall I create new WebDB or continue on an old one(for example disabling this 
> creating command)?
>
>   
For incremental indexing, you will want to keep updating the one webdb 
rather than create it anew each incremental indexing, so  yes, modify 
the indexarcs.sh script so it doesn't invoke create of the webdb.  You 
will likely also need to change the steps that follow so that it passes 
only the segments that are part of the incremental update set rather 
than all segments (Currently its written as 'segments/*').

Tell us more about the size of your incremental updates?  How frequently 
are you planning to do them and how much data are you adding?  Our 
experience trying to do frequent updates has not been good: index 
merging and webdb updating all can take a long time to complete.  Tell 
me more about the rates of update you are considering and meantime I'll 
try and get some figures on our experience posted.

The story should be better in new nutch though I guess index merge works 
effectively as it did, a pure lucene operations.

On NutchWAX 0.6.0 release status, a NutchWAX running on top of MapReduce 
nutch, development is going well.  We've been using a rack of 35 or so 
(very) slow processors to test indexing collections of 100M and more.  
We're having some robustness and performance issues but they are being 
addressed.  We're still looking at an end-of-March/start-of-April 
release.  Will keep the list posted.

St.Ack

[Archive-access-discuss] incremental indexing

From: Lukas M. <mat...@ce...> - 2006-02-17 15:10:13

Hi,

i'm just testing inceremental indexing and i want to ask for a little 
help(really simple for you:))

i've used file in nutchwax/bin/indexarc.sh...

first i made links through command 'setup' than  i used 'segment' to create 
segments from arcs and than i wanted to use 'links' to process pages and 
links to webdb, but command 'links' uses 

${NUTCH}/bin/nutch admin "${nutchdb}" -create
before updating db from segments and updating back segments from db

shall I create new WebDB or continue on an old one(for example disabling this 
creating command)?

putting all indexes together is running well

thanks for advise

l.

Re: Fwd: Re: [Archive-access-discuss] Re: nutchwax

From: Lukas M. <mat...@ce...> - 2006-02-14 17:22:33

>
>
>
>
> ---------- P=F8eposlan=E1 zpr=E1va ----------
> From: stack <st...@ar...>
> To: stack <st...@ar...>
> Date: Sat, 11 Feb 2006 10:59:07 -0800
> Subject: Re: [Archive-access-discuss] Re: nutchwax
> uk=E1=B9:
>
> I committed code to undo any html entity encoding found in text to be
> emitted by OpenSearchServlet.  I committed on the nutchwax 'release-0_4'
> branch so be careful you get this branch from CVS rather than HEAD if
> building from source.  If you just want the WAR with the fix, its
> available here: http://archive.org/~stack/nutchwax.war.  Let me know if
> you want me to make up a complete nutchwax tarball.  Let me know if the
> fix works for you (Here's the bug:
> https://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1429788&gro=
up_id=3D
>118427&atid=3D681137 ).

it works very well! good work.
I've just downloaded nutchwax.war and .. it seems to be ok:)

=2Dlm

>
> This is a band-aid fix until the core issue gets addressed in nutch.
> I'll work on trying to get this done this week.
>
> This is a pretty serious issue.  Text snippets -- i.e. the 'description'
> field in the XML -- that have anything but plain ASCII are mangled
> showing ugly numeric character representations, '&#343;', etc., in place
> of legit UTF-8 characters.  Its was also possible to by-pass our
> legit-xml character checking encoding illegal characters: e.g. '&#7;'.
> If the fix works for you Luk=E1=B9, I'll make a new release of nutchwax w=
ith
> the bandaid incorporated later this week (Hopefully by the release of
> the 0.6.0 mapreduce version of NutchWAX, will have the real fix
> incorporated).
>
> Good stuff,
> St.Ack
>
> stack wrote:
> > Lukas Matejka wrote:
> >> Dne =E8t 9. =FAnora 2006 18:51 stack napsal(a):
> >>> ....
> >>> I see the 0x07 Bell character in the original page.  Below is an 'od'
> >>> dump of the relevant section with the ascii line underwritten by its
> >>> hex
> >>> representation.  The last line has the 0x07 character.
> >>
> >> you're absolutely right with bell character, but i think there is one
> >> another different thing. I'll try to explain.
> >>
> >> i will search word 'kniha' (which means book) trough
>
> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dkniha&start=3D0&hitsPe=
rPage=3D
>10&hitsPerDup=3D1&dedupField=3Dexacturl
>
> >> answer is valid XML(57472 hits of word kniha), but in result
> >> in entity 'description' there are html entites that represent czech
> >> characters with diacritics and that's the problem. Original site
> >> doesn't contain these html entities but regular czech characters.
> >>
> >> Interesting thing is that in entity 'title' shows czech characters
> >> well, but in entity 'description' like html entites(for instance html
> >> entity &yacute; represents special character y with dash).
> >>
> >> Have you any idea where could be problem?
> >
> > Thanks for the extra info Lukas.
> >
> > Digging in, I see that the generation of summaries runs the text
> > through org.apache.nutch.html.Entities.  Here is the code for the
> > Entities#encode method that all summary text is run through:
> >
> >  static final public String encode(String s) {
> >    int length =3D s.length();
> >    StringBuffer buffer =3D new StringBuffer(length * 2);
> >    for (int i =3D 0; i < length; i++) {
> >      char c =3D s.charAt(i);
> >      int j =3D (int)c;
> >      if (j < 0x100 && encoder[j] !=3D null) {
> >    buffer.append(encoder[j]);        // have a named encoding
> >    buffer.append(';');
> >      } else if (j < 0x80) {
> >    buffer.append(c);             // use ASCII value
> >      } else {
> >    buffer.append("&#");              // use numeric encoding
> >    buffer.append((int)c);
> >    buffer.append(';');
> >      }
> >    }
> >    return buffer.toString();
> >  }
> >
> > Any character that is super-ASCII gets a numeric character encoding.
> > Assuming all is UTF-8 in nutch, then we probably don't want HTML
> > entity encoding when we're outputtting UTF-8 XML.  In fact, looks like
> > we don't want any html entity encoding at all when outputting XML.
> >
> > The call to Entities#encode is buried in nutch inside the Fragment
> > inner class of Summary.  It would take a good bit of work making up a
> > NutchBean that called an alternate Summary-maker when outputting XML.
> >
> > Meantime, I have a quick fix that adds HTML entity decoding to the
> > Nutchwax OpenSearchServlet.  Let me do some more testing and hopefully
> > I can commit later today. I'll let you know.
> >
> > St.Ack
> >
> >
> > -------------------------------------------------------
> > This SF.net email is sponsored by: Splunk Inc. Do you grep through log
> > files
> > for problems?  Stop!  Download the new AJAX search engine that makes
> > searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
> > http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=103432&bid#0486&dat=121642
> > _______________________________________________
> > Archive-access-discuss mailing list
> > Arc...@li...
> > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log
> files for problems?  Stop!  Download the new AJAX search engine that makes
> searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
> http://sel.as-us.falkag.net/sel?cmdlnk&kid=103432&bid#0486&dat=121642
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

=2D-=20
=2D-----------------------------
Bc.Lukas Matejka
email:mat...@ce...
GSM:+420777093233

Re: [Archive-access-discuss] Re: nutchwax

From: stack <st...@ar...> - 2006-02-11 19:00:22

Luk=E1=9A:

I committed code to undo any html entity encoding found in text to be=20
emitted by OpenSearchServlet.  I committed on the nutchwax 'release-0_4'=20
branch so be careful you get this branch from CVS rather than HEAD if=20
building from source.  If you just want the WAR with the fix, its=20
available here: http://archive.org/~stack/nutchwax.war.  Let me know if=20
you want me to make up a complete nutchwax tarball.  Let me know if the=20
fix works for you (Here's the bug:=20
https://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1429788&gro=
up_id=3D118427&atid=3D681137).

This is a band-aid fix until the core issue gets addressed in nutch. =20
I'll work on trying to get this done this week.

This is a pretty serious issue.  Text snippets -- i.e. the 'description'=20
field in the XML -- that have anything but plain ASCII are mangled=20
showing ugly numeric character representations, '&#343;', etc., in place=20
of legit UTF-8 characters.  Its was also possible to by-pass our=20
legit-xml character checking encoding illegal characters: e.g. '&#7;'. =20
If the fix works for you Luk=E1=9A, I'll make a new release of nutchwax w=
ith=20
the bandaid incorporated later this week (Hopefully by the release of=20
the 0.6.0 mapreduce version of NutchWAX, will have the real fix=20
incorporated).

Good stuff,
St.Ack


stack wrote:
> Lukas Matejka wrote:
>> Dne =E8t 9. =FAnora 2006 18:51 stack napsal(a):
>> =20
>>> ....
>>> I see the 0x07 Bell character in the original page.  Below is an 'od'
>>> dump of the relevant section with the ascii line underwritten by its=20
>>> hex
>>> representation.  The last line has the 0x07 character.
>>>    =20
>>
>> you're absolutely right with bell character, but i think there is one=20
>> another different thing. I'll try to explain.
>>
>> i will search word 'kniha' (which means book) trough
>> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dkniha&start=3D0&hit=
sPerPage=3D10&hitsPerDup=3D1&dedupField=3Dexacturl=20
>>
>>
>> answer is valid XML(57472 hits of word kniha), but in result
>> in entity 'description' there are html entites that represent czech=20
>> characters with diacritics and that's the problem. Original site=20
>> doesn't contain these html entities but regular czech characters.
>>
>> Interesting thing is that in entity 'title' shows czech characters=20
>> well, but in entity 'description' like html entites(for instance html=20
>> entity &yacute; represents special character y with dash).
>>
>> Have you any idea where could be problem?
>>
>>  =20
> Thanks for the extra info Lukas.
>
> Digging in, I see that the generation of summaries runs the text=20
> through org.apache.nutch.html.Entities.  Here is the code for the=20
> Entities#encode method that all summary text is run through:
>
>  static final public String encode(String s) {
>    int length =3D s.length();
>    StringBuffer buffer =3D new StringBuffer(length * 2);
>    for (int i =3D 0; i < length; i++) {
>      char c =3D s.charAt(i);
>      int j =3D (int)c;
>      if (j < 0x100 && encoder[j] !=3D null) {
>    buffer.append(encoder[j]);        // have a named encoding
>    buffer.append(';');
>      } else if (j < 0x80) {
>    buffer.append(c);             // use ASCII value
>      } else {
>    buffer.append("&#");              // use numeric encoding
>    buffer.append((int)c);
>    buffer.append(';');
>      }
>    }
>    return buffer.toString();
>  }
>
> Any character that is super-ASCII gets a numeric character encoding. =20
> Assuming all is UTF-8 in nutch, then we probably don't want HTML=20
> entity encoding when we're outputtting UTF-8 XML.  In fact, looks like=20
> we don't want any html entity encoding at all when outputting XML.
>
> The call to Entities#encode is buried in nutch inside the Fragment=20
> inner class of Summary.  It would take a good bit of work making up a=20
> NutchBean that called an alternate Summary-maker when outputting XML.
>
> Meantime, I have a quick fix that adds HTML entity decoding to the=20
> Nutchwax OpenSearchServlet.  Let me do some more testing and hopefully=20
> I can commit later today. I'll let you know.
>
> St.Ack
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log=20
> files
> for problems?  Stop!  Download the new AJAX search engine that makes
> searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
> http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=103432&bid#0486&dat=121642
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

Re: [Archive-access-discuss] Re: nutchwax

From: stack <st...@ar...> - 2006-02-10 19:33:03

Lukas Matejka wrote:
> Dne =E8t 9. =FAnora 2006 18:51 stack napsal(a):
>  =20
>> ....
>> I see the 0x07 Bell character in the original page.  Below is an 'od'
>> dump of the relevant section with the ascii line underwritten by its h=
ex
>> representation.  The last line has the 0x07 character.
>>    =20
>
> you're absolutely right with bell character, but i think there is one a=
nother=20
> different thing. I'll try to explain.
>
> i will search word 'kniha' (which means book) trough
> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dkniha&start=3D0&hits=
PerPage=3D10&hitsPerDup=3D1&dedupField=3Dexacturl
>
> answer is valid XML(57472 hits of word kniha), but in result
> in entity 'description' there are html entites that represent czech cha=
racters=20
> with diacritics and that's the problem. Original site doesn't contain t=
hese=20
> html entities but regular czech characters.
>
> Interesting thing is that in entity 'title' shows czech characters well=
, but=20
> in entity 'description' like html entites(for instance html entity &yac=
ute;=20
> represents special character y with dash).
>
> Have you any idea where could be problem?
>
>  =20
Thanks for the extra info Lukas.

Digging in, I see that the generation of summaries runs the text through=20
org.apache.nutch.html.Entities.  Here is the code for the=20
Entities#encode method that all summary text is run through:

  static final public String encode(String s) {
    int length =3D s.length();
    StringBuffer buffer =3D new StringBuffer(length * 2);
    for (int i =3D 0; i < length; i++) {
      char c =3D s.charAt(i);
      int j =3D (int)c;
      if (j < 0x100 && encoder[j] !=3D null) {
    buffer.append(encoder[j]);        // have a named encoding
    buffer.append(';');
      } else if (j < 0x80) {
    buffer.append(c);             // use ASCII value
      } else {
    buffer.append("&#");              // use numeric encoding
    buffer.append((int)c);
    buffer.append(';');
      }
    }
    return buffer.toString();
  }

Any character that is super-ASCII gets a numeric character encoding. =20
Assuming all is UTF-8 in nutch, then we probably don't want HTML entity=20
encoding when we're outputtting UTF-8 XML.  In fact, looks like we don't=20
want any html entity encoding at all when outputting XML.

The call to Entities#encode is buried in nutch inside the Fragment inner=20
class of Summary.  It would take a good bit of work making up a=20
NutchBean that called an alternate Summary-maker when outputting XML.

Meantime, I have a quick fix that adds HTML entity decoding to the=20
Nutchwax OpenSearchServlet.  Let me do some more testing and hopefully I=20
can commit later today. I'll let you know.

St.Ack

Re: [Archive-access-discuss] Re: nutchwax

From: Lukas M. <mat...@ce...> - 2006-02-10 14:52:20

Dne =E8t 9. =FAnora 2006 18:51 stack napsal(a):
>  Luk=E1=9A Mat=ECjka wrote:
> > Hi,
> >
> > i still can't handle this issue..
>
> Pardon the late reply  Luk=E1=9A.
>
> Here seems to be a page with problematic characters:
> http://dig.vkol.cz/vz/vz01_12.htm. I get it by following Sverre's recipe
> below adding hitsPerDup=3D0, etc.
>
> If I get the page via opensearchservlet, firefox complains about '&#7;'
> character in description field.  The ascii  'Bell' character is illegal
> in XML, even though its represented by numeric character reference
> (Here's the grammer for XML Char:
> http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char).
>
> I see the 0x07 Bell character in the original page.  Below is an 'od'
> dump of the relevant section with the ascii line underwritten by its hex
> representation.  The last line has the 0x07 character.

you're absolutely right with bell character, but i think there is one anoth=
er=20
different thing. I'll try to explain.

i will search word 'kniha' (which means book) trough
http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dkniha&start=3D0&hitsPerP=
age=3D10&hitsPerDup=3D1&dedupField=3Dexacturl

answer is valid XML(57472 hits of word kniha), but in result
in entity 'description' there are html entites that represent czech charact=
ers=20
with diacritics and that's the problem. Original site doesn't contain these=
=20
html entities but regular czech characters.

Interesting thing is that in entity 'title' shows czech characters well, bu=
t=20
in entity 'description' like html entites(for instance html entity &yacute;=
=20
represents special character y with dash).

Have you any idea where could be problem?

l.

>
> ....
>
> 0008832    8   0   .   <   /   p   >   <   p   >  nl   <   b   >   M   K
>              3830    2e3c    2f70    3e3c    703e    0a3c    623e    4d4b
> 0008848   sp  c8   R   <   /   b   >  sp   z   a  f8   a   d   i   l   o
>              20c8    523c    2f62    3e20    7a61    f861    6469    6c6f
> 0008864   sp   n   a  sp   s   e   z   n   a   m  sp   n   e   j   c   e
>              206e    6120    7365    7a6e    616d    206e    656a    6365
> 0008880    n   n  ec   j  b9  ed   c   h  sp   d   o   k   l   a   d  f9
>              6e6e    ec6a    b9ed    6368    2064    6f6b    6c61    64f9
> 0008896    . bel  sp   /   B   i   b   l   e  sp   b   o   s   k   o   v
>              2e07    202f    4269    626c    6520    626f    736b    6f76
>
> ...
>
> That the illegal character shows in the description text as a character
> reference, then its probably been encoded earlier in the processing of
> the document.
>
> Regardless, the opensearchservlet should probably look for such illegal
> encodings and just strip them (Its doing this already for raw
> characters).  Let me try and fix this.
>
> St.Ack
>
> > does anybody know how to help?
> > Can NutchWAX produce output with html entities?(Output from NutchWAX
> > shloud be utf,shouldn't be?) Because (in cases written below) invalid x=
ml
> > is caused by special characters in html entties.
> >
> > thanks for any help
> >
> > -lm
> >
> > ______________________________________________________________
> >
> >> Od: sve...@nb...
> >> Komu: stack <st...@ar...>
> >> CC: Luk=E1=9A Mat=ECjka <mat...@ce...>
> >> Datum: 12.01.2006 10:38
> >> P=F8edm=ECt: Re: nutchwax
> >>
> >> Hi Michael, Luk=E1=9A ..
> >>
> >> On Thursday 12 January 2006 01:33, stack wrote:
> >>> Luk=E1=9A Mat=ECjka wrote:
> >>
> >> ...
> >>
> >>>> what's the difference between these cases?
> >>>>
> >>>> 1)
> >>
> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%=
C3%B
> >>D
> >>
> >>>> &start=3D0&hitsPerDup=3D0&hitsPerPage=3D10&dedupField=3Dexacturl ->o=
utput is
> >>
> >> not
> >>
> >>>> valid xml(called from WERA)
> >>>>
> >>>> 2)
> >>
> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc=
k%C3
> >>%
> >>
> >>>> BD&start=3D0&hitsPerPage=3D10&hitsPerDup=3D1&dedupField=3Dexacturl o=
utput is
> >>
> >> valid
> >>
> >>>> xml(called from Nutchwax search.jsp)
> >>
> >> If i try the above urls i find quite the opposite! Case 1 produces val=
id
> >> XML,
> >> case 2 produces invalid XML.
> >>
> >> Test results:
> >>
> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%=
C3%B
> >>D -> valid XML
> >>
> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc=
k%C3
> >>%BD -> valid XML
> >>
> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%=
C3%B
> >>D&hitsPerDup=3D0&dedupField=3Dexacturl -> valid XML
> >>
> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc=
k%C3
> >>%BD&hitsPerDup=3D0&dedupField=3Dexacturl -> valid XML
> >>
> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%=
C3%B
> >>D&hitsPerDup=3D1&dedupField=3Dexacturl -> INVALID XML
> >>
> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc=
k%C3
> >>%BD&hitsPerDup=3D1&dedupField=3Dexacturl -> INVALID XML
> >>
> >> Setting hitsPerDup=3D2 results in valid XML
> >>
> >> Conclusion:
> >> A specific record in the index contains invalid XML chars, and it is
> >> only part
> >> of the result list when hitsPerDup=3D1. Setting hitsPerDup=3D0 and sta=
rt=3D10
> >> will
> >> produce a result list including the invalid XML. chars record.
> >>
> >> I don't know if the above was of any help to you, i just had to say
> >> something
> >> about it ;-)
> >>
> >> Sverre
> >
> > -------------------------------------------------------
> > This SF.net email is sponsored by: Splunk Inc. Do you grep through log
> > files for problems?  Stop!  Download the new AJAX search engine that
> > makes searching your log files as easy as surfing the  web.  DOWNLOAD
> > SPLUNK! http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=103432&bid#0486&dat=
=121642
> > _______________________________________________
> > Archive-access-discuss mailing list
> > Arc...@li...
> > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log
> files for problems?  Stop!  Download the new AJAX search engine that makes
> searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
> http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=103432&bid#0486&dat=121642
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

=2D-=20
=2D-----------------------------
Bc.Lukas Matejka
email:mat...@ce...
GSM:+420777093233

Re: [Archive-access-discuss] Re: nutchwax

From: stack <st...@ar...> - 2006-02-09 17:52:17

 Luk=E1=9A Mat=ECjka wrote:
> Hi,
>
> i still can't handle this issue..
>  =20

Pardon the late reply  Luk=E1=9A.

Here seems to be a page with problematic characters:=20
http://dig.vkol.cz/vz/vz01_12.htm. I get it by following Sverre's recipe=20
below adding hitsPerDup=3D0, etc.=20

If I get the page via opensearchservlet, firefox complains about '&#7;'=20
character in description field.  The ascii  'Bell' character is illegal=20
in XML, even though its represented by numeric character reference=20
(Here's the grammer for XML Char:=20
http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char).

I see the 0x07 Bell character in the original page.  Below is an 'od'=20
dump of the relevant section with the ascii line underwritten by its hex=20
representation.  The last line has the 0x07 character.

....

0008832    8   0   .   <   /   p   >   <   p   >  nl   <   b   >   M   K
             3830    2e3c    2f70    3e3c    703e    0a3c    623e    4d4b
0008848   sp  c8   R   <   /   b   >  sp   z   a  f8   a   d   i   l   o
             20c8    523c    2f62    3e20    7a61    f861    6469    6c6f
0008864   sp   n   a  sp   s   e   z   n   a   m  sp   n   e   j   c   e
             206e    6120    7365    7a6e    616d    206e    656a    6365
0008880    n   n  ec   j  b9  ed   c   h  sp   d   o   k   l   a   d  f9
             6e6e    ec6a    b9ed    6368    2064    6f6b    6c61    64f9
0008896    . bel  sp   /   B   i   b   l   e  sp   b   o   s   k   o   v
             2e07    202f    4269    626c    6520    626f    736b    6f76

...

That the illegal character shows in the description text as a character=20
reference, then its probably been encoded earlier in the processing of=20
the document.

Regardless, the opensearchservlet should probably look for such illegal=20
encodings and just strip them (Its doing this already for raw=20
characters).  Let me try and fix this.

St.Ack


> does anybody know how to help?
> Can NutchWAX produce output with html entities?(Output from NutchWAX sh=
loud be utf,shouldn't be?)
> Because (in cases written below) invalid xml is caused by special chara=
cters in html entties.
>
> thanks for any help
>
> -lm
>
> ______________________________________________________________
>  =20
>> Od: sve...@nb...
>> Komu: stack <st...@ar...>
>> CC: Luk=E1=9A Mat=ECjka <mat...@ce...>
>> Datum: 12.01.2006 10:38
>> P=F8edm=ECt: Re: nutchwax
>>
>> Hi Michael, Luk=E1=9A ..
>>
>> On Thursday 12 January 2006 01:33, stack wrote:
>>    =20
>>> Luk=E1=9A Mat=ECjka wrote:
>>>      =20
>> ...
>>    =20
>>>> what's the difference between these cases?
>>>>
>>>> 1)
>>>>
>>>>        =20
>> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%=
C3%BD
>>    =20
>>>> &start=3D0&hitsPerDup=3D0&hitsPerPage=3D10&dedupField=3Dexacturl ->o=
utput is
>>>>        =20
>> not
>>    =20
>>>> valid xml(called from WERA)
>>>>
>>>> 2)
>>>>
>>>>        =20
>> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc=
k%C3%
>>    =20
>>>> BD&start=3D0&hitsPerPage=3D10&hitsPerDup=3D1&dedupField=3Dexacturl o=
utput is
>>>>        =20
>> valid
>>    =20
>>>> xml(called from Nutchwax search.jsp)
>>>>        =20
>> If i try the above urls i find quite the opposite! Case 1 produces val=
id
>> XML,=20
>> case 2 produces invalid XML.
>>
>> Test results:
>>
>> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%=
C3%BD
>> -> valid XML
>>
>> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc=
k%C3%BD
>> -> valid XML
>>
>> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%=
C3%BD&hitsPerDup=3D0&dedupField=3Dexacturl
>> -> valid XML
>>
>> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc=
k%C3%BD&hitsPerDup=3D0&dedupField=3Dexacturl
>> -> valid XML
>>
>> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%=
C3%BD&hitsPerDup=3D1&dedupField=3Dexacturl
>> -> INVALID XML
>>
>> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc=
k%C3%BD&hitsPerDup=3D1&dedupField=3Dexacturl
>> -> INVALID XML
>>
>> Setting hitsPerDup=3D2 results in valid XML
>>
>> Conclusion:=20
>> A specific record in the index contains invalid XML chars, and it is o=
nly
>> part=20
>> of the result list when hitsPerDup=3D1. Setting hitsPerDup=3D0 and sta=
rt=3D10
>> will=20
>> produce a result list including the invalid XML. chars record.
>>
>> I don't know if the above was of any help to you, i just had to say
>> something=20
>> about it ;-)
>>
>> Sverre
>>
>>    =20
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log =
files
> for problems?  Stop!  Download the new AJAX search engine that makes
> searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
> http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=103432&bid#0486&dat=121642
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>  =20

[Archive-access-discuss] Re: nutchwax

From: <mat...@ce...> - 2006-02-09 12:08:33

Hi,

i still can't handle this issue..

does anybody know how to help?
Can NutchWAX produce output with html entities?(Output from NutchWAX sh=
loud be utf,shouldn't be?)
Because (in cases written below) invalid xml is caused by special chara=
cters in html entties.

thanks for any help

-lm

______________________________________________________________
> Od: sve...@nb...
> Komu: stack <st...@ar...>
> CC: Luk=E1=9A Mat=ECjka <mat...@ce...>
> Datum: 12.01.2006 10:38
> P=F8edm=ECt: Re: nutchwax
>
> Hi Michael, Luk=E1=9A ..
>=20
> On Thursday 12 January 2006 01:33, stack wrote:
> > Luk=E1=9A Mat=ECjka wrote:
> ...
> > > what's the difference between these cases?
> > >
> > > 1)
> > >
> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck=
%C3%BD
> > >&start=3D0&hitsPerDup=3D0&hitsPerPage=3D10&dedupField=3Dexacturl -=
>output is
> not
> > > valid xml(called from WERA)
> > >
> > > 2)
> > >
> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20lou=
ck%C3%
> > >BD&start=3D0&hitsPerPage=3D10&hitsPerDup=3D1&dedupField=3Dexacturl=
 output is
> valid
> > > xml(called from Nutchwax search.jsp)
> >
>=20
> If i try the above urls i find quite the opposite! Case 1 produces va=
lid
> XML,=20
> case 2 produces invalid XML.
>=20
> Test results:
>=20
> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck=
%C3%BD
> -> valid XML
>=20
> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20lou=
ck%C3%BD
> -> valid XML
>=20
> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck=
%C3%BD&hitsPerDup=3D0&dedupField=3Dexacturl
> -> valid XML
>=20
> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20lou=
ck%C3%BD&hitsPerDup=3D0&dedupField=3Dexacturl
> -> valid XML
>=20
> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck=
%C3%BD&hitsPerDup=3D1&dedupField=3Dexacturl
> -> INVALID XML
>=20
> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20lou=
ck%C3%BD&hitsPerDup=3D1&dedupField=3Dexacturl
> -> INVALID XML
>=20
> Setting hitsPerDup=3D2 results in valid XML
>=20
> Conclusion:=20
> A specific record in the index contains invalid XML chars, and it is =
only
> part=20
> of the result list when hitsPerDup=3D1. Setting hitsPerDup=3D0 and st=
art=3D10
> will=20
> produce a result list including the invalid XML. chars record.
>=20
> I don't know if the above was of any help to you, i just had to say
> something=20
> about it ;-)
>=20
> Sverre
>

37 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 38 39 40 41 42 43 > >> (Page 40 of 43)