You can subscribe to this list here.
| 2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(4) |
Sep
(5) |
Oct
(17) |
Nov
(30) |
Dec
(3) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2006 |
Jan
(4) |
Feb
(14) |
Mar
(8) |
Apr
(11) |
May
(2) |
Jun
(13) |
Jul
(9) |
Aug
(2) |
Sep
(2) |
Oct
(9) |
Nov
(20) |
Dec
(9) |
| 2007 |
Jan
(6) |
Feb
(4) |
Mar
(6) |
Apr
(7) |
May
(6) |
Jun
(6) |
Jul
(4) |
Aug
(3) |
Sep
(9) |
Oct
(26) |
Nov
(23) |
Dec
(2) |
| 2008 |
Jan
(17) |
Feb
(19) |
Mar
(16) |
Apr
(27) |
May
(3) |
Jun
(21) |
Jul
(21) |
Aug
(8) |
Sep
(13) |
Oct
(7) |
Nov
(8) |
Dec
(8) |
| 2009 |
Jan
(18) |
Feb
(14) |
Mar
(27) |
Apr
(14) |
May
(10) |
Jun
(14) |
Jul
(18) |
Aug
(30) |
Sep
(18) |
Oct
(12) |
Nov
(5) |
Dec
(26) |
| 2010 |
Jan
(27) |
Feb
(3) |
Mar
(8) |
Apr
(4) |
May
(6) |
Jun
(13) |
Jul
(25) |
Aug
(11) |
Sep
(2) |
Oct
(4) |
Nov
(7) |
Dec
(6) |
| 2011 |
Jan
(25) |
Feb
(17) |
Mar
(25) |
Apr
(23) |
May
(15) |
Jun
(12) |
Jul
(8) |
Aug
(13) |
Sep
(4) |
Oct
(17) |
Nov
(7) |
Dec
(6) |
| 2012 |
Jan
(4) |
Feb
(7) |
Mar
(1) |
Apr
(10) |
May
(11) |
Jun
(5) |
Jul
(7) |
Aug
(1) |
Sep
(1) |
Oct
(5) |
Nov
(6) |
Dec
(13) |
| 2013 |
Jan
(9) |
Feb
(7) |
Mar
(3) |
Apr
(1) |
May
(3) |
Jun
(19) |
Jul
(3) |
Aug
(3) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
| 2014 |
Jan
(11) |
Feb
(1) |
Mar
|
Apr
(2) |
May
(6) |
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
| 2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
(4) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
| 2016 |
Jan
(4) |
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| 2018 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
(1) |
Dec
|
| 2019 |
Jan
(2) |
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
|
Sep
(1) |
Oct
(1) |
Nov
|
Dec
|
|
From: Michael S. <st...@ar...> - 2006-07-03 22:59:08
|
Jo=E3o Cl=E1udio Luzio wrote: > Oops.. forgot to say that the arcs where on=20 > /var/local/webarchive/heritrix/jobs/bn_18_test-20060619172505727/=20 > directory but with .arc.gz instead of .arc. > > =20 This should be fine. > Jo=E3o Cl=E1udio Luzio wrote: > =20 >> Hi, >> I've been trying to get the pair up and running for a while now bu= t=20 >> had some problems.. >> Using nutchwax 0.4.3 and the wera (0.4.2RC1 & 0.5.0) I managed to=20 >> get it running but some of the related files (images) >> aren't displayed. Those get: >> <retrievermessage> >> <head> >> <errorcode>4</errorcode> >> <errormessage>Unable to parse Archive Identifier</errormessage= > >> </head> >> </retrievermessage> >> Using wera debug I found that the "[archiveidentifier] =3D>=20 >> 2770/IAH-20060619172903-00000-webarchive1" for a specific search i mad= e. >> (Starting tomcat from the nutchwax indexed data) >> =20 So, it generally works but some of the images don't show sometimes? >> Using wayback I dont have the same problems(I dont use nutchwax wi= th=20 >> wayback..). >> >> I've tried to get nutchwax 0.6.1 and wera running but the opensear= ch=20 >> servlet for the rss from nutchwax gives an exception.. >> =20 Do you still have the exception? >> So i tried nutchwax 0.7.0 (with latest hadoop - standalone), but n= ow=20 >> the arcretriever gives an exception when trying to get the document. >> Using wera debug I found that the "[archiveidentifier] =3D>=20 >> 2234331/filedesc://IAH-20060619172903-00000-webarchive1.arc" for the=20 >> same search i made. >> (Starting tomcat from anywhere) >> >> The exception: >> 7 Bad function argument Cause: java.io.FileNotFoundException:=20 >> /var/local/webarchive/heritrix/jobs/bn_18_test-20060619172505727/filed= esc:/IAH-20060619172903-00000-webarchive1.arc=20 >> does not exist. Stack trace:=20 >> org.archive.io.arc.ARCUtils.isReadable(ARCUtils.java:171)=20 >> org.archive.io.arc.ARCUtils.testCompressedARCFile(ARCUtils.java:94)=20 >> org.archive.io.arc.ARCReaderFactory.get(ARCReaderFactory.java:200)=20 >> org.archive.io.arc.ARCReaderFactory.get(ARCReaderFactory.java:194)=20 >> no.nb.nwa.retriever.ARCRetriever.getDocument(ARCRetriever.java:410)=20 >> no.nb.nwa.retriever.ARCRetriever.doGet(ARCRetriever.java:131) >> =20 Looks like we shouldn't be putting the 'filedesc:' on front of ARC=20 name? Does ARCRetreiver work if you make a request with=20 IAH-20060619172903-00000-webarchive1.arc instead of=20 filedesc:/IAH-20060619172903-00000-webarchive1.arc? St.Ack |
|
From: Natalia T. <nt...@ce...> - 2006-07-03 11:43:55
|
Hello I tried to add the new job moving the indexes directory before starting index process and it works fine. Thanks!! So, every time I want to index a new job I need to move indexes directory? If I move this directory the nuch wax search still working? This proces takes many hours ... Natalia |
|
From: <jl...@ex...> - 2006-06-29 10:51:11
|
Oops.. forgot to say that the arcs where on=20 /var/local/webarchive/heritrix/jobs/bn_18_test-20060619172505727/=20 directory but with .arc.gz instead of .arc. Jo=E3o Cl=E1udio Luzio wrote: > Hi, > I've been trying to get the pair up and running for a while now but= =20 > had some problems.. > Using nutchwax 0.4.3 and the wera (0.4.2RC1 & 0.5.0) I managed to=20 > get it running but some of the related files (images) > aren't displayed. Those get: > <retrievermessage> > <head> > <errorcode>4</errorcode> > <errormessage>Unable to parse Archive Identifier</errormessage> > </head> > </retrievermessage> > Using wera debug I found that the "[archiveidentifier] =3D>=20 > 2770/IAH-20060619172903-00000-webarchive1" for a specific search i made= . > (Starting tomcat from the nutchwax indexed data) > > Using wayback I dont have the same problems(I dont use nutchwax wit= h=20 > wayback..). > > I've tried to get nutchwax 0.6.1 and wera running but the opensearc= h=20 > servlet for the rss from nutchwax gives an exception.. > So i tried nutchwax 0.7.0 (with latest hadoop - standalone), but no= w=20 > the arcretriever gives an exception when trying to get the document. > Using wera debug I found that the "[archiveidentifier] =3D>=20 > 2234331/filedesc://IAH-20060619172903-00000-webarchive1.arc" for the=20 > same search i made. > (Starting tomcat from anywhere) > > The exception: > 7 Bad function argument Cause: java.io.FileNotFoundException:=20 > /var/local/webarchive/heritrix/jobs/bn_18_test-20060619172505727/filede= sc:/IAH-20060619172903-00000-webarchive1.arc=20 > does not exist. Stack trace:=20 > org.archive.io.arc.ARCUtils.isReadable(ARCUtils.java:171)=20 > org.archive.io.arc.ARCUtils.testCompressedARCFile(ARCUtils.java:94)=20 > org.archive.io.arc.ARCReaderFactory.get(ARCReaderFactory.java:200)=20 > org.archive.io.arc.ARCReaderFactory.get(ARCReaderFactory.java:194)=20 > no.nb.nwa.retriever.ARCRetriever.getDocument(ARCRetriever.java:410)=20 > no.nb.nwa.retriever.ARCRetriever.doGet(ARCRetriever.java:131) > .... > > Also using, > JDK 1.5.0_05 > Tomcat 5.5.16 > Heritrix 1.6.0 > > I have tried to figure it out but i'm not having any luck.. I'm a=20 > newbie with these tools so I appreciate all the help I can get in=20 > getting the latest nutchwax+wera setting going. > > Thanks in advance, > Jo=E3o Luzio > > Using Tomcat but need to do more? Need to support web services, securit= y? > Get stuff done quickly with pre-integrated technology to make your job = easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geron= imo > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D120709&bid=3D263057&dat= =3D121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > =20 |
|
From: <jl...@ex...> - 2006-06-28 17:29:18
|
Hi,
I've been trying to get the pair up and running for a while now but=20
had some problems..
Using nutchwax 0.4.3 and the wera (0.4.2RC1 & 0.5.0) I managed to=20
get it running but some of the related files (images)
aren't displayed. Those get:
<retrievermessage>
<head>
<errorcode>4</errorcode>
<errormessage>Unable to parse Archive Identifier</errormessage>
</head>
</retrievermessage>
Using wera debug I found that the "[archiveidentifier] =3D>=20
2770/IAH-20060619172903-00000-webarchive1" for a specific search i made.
(Starting tomcat from the nutchwax indexed data)
Using wayback I dont have the same problems(I dont use nutchwax with=20
wayback..).
I've tried to get nutchwax 0.6.1 and wera running but the opensearch=20
servlet for the rss from nutchwax gives an exception..
So i tried nutchwax 0.7.0 (with latest hadoop - standalone), but now=20
the arcretriever gives an exception when trying to get the document.
Using wera debug I found that the "[archiveidentifier] =3D>=20
2234331/filedesc://IAH-20060619172903-00000-webarchive1.arc" for the=20
same search i made.
(Starting tomcat from anywhere)
The exception:
7 Bad function argument Cause: java.io.FileNotFoundException:=20
/var/local/webarchive/heritrix/jobs/bn_18_test-20060619172505727/filedesc=
:/IAH-20060619172903-00000-webarchive1.arc=20
does not exist. Stack trace:=20
org.archive.io.arc.ARCUtils.isReadable(ARCUtils.java:171)=20
org.archive.io.arc.ARCUtils.testCompressedARCFile(ARCUtils.java:94)=20
org.archive.io.arc.ARCReaderFactory.get(ARCReaderFactory.java:200)=20
org.archive.io.arc.ARCReaderFactory.get(ARCReaderFactory.java:194)=20
no.nb.nwa.retriever.ARCRetriever.getDocument(ARCRetriever.java:410)=20
no.nb.nwa.retriever.ARCRetriever.doGet(ARCRetriever.java:131)
....
Also using,
JDK 1.5.0_05
Tomcat 5.5.16
Heritrix 1.6.0
I have tried to figure it out but i'm not having any luck.. I'm a=20
newbie with these tools so I appreciate all the help I can get in=20
getting the latest nutchwax+wera setting going.
Thanks in advance,
Jo=E3o Luzio
|
|
From: Natalia T. <nt...@ce...> - 2006-06-26 16:46:13
|
Hello Using this now it's running and i can index!! I can read the explanation and "more from the site" links but i can't acces to title. I've a dubt with "collectionsHost" variable. It points to the server that will return the content of ARCs. I put directly the arc files (or arc.gz file) in this server but link on "title" and "other versions" on nutchwax search results doesn't work. Wich information offers this server?? Thanks. Natalia |
|
From: Michael S. <st...@ar...> - 2006-06-23 20:27:29
|
Looks like it will be a while before I can get to a release. I'm out next week. Meantime I just took this build for a test run: http://crawltools.archive.org:8080/cruisecontrol/artifacts/HEAD-archive-access/20060623113807. Fixes at least your collection issue. Requires hadoop 0.3.2. Yours, St.Ack Natalia Torres wrote: > when I index my jobs with nutchwax 0.6.1 I use this command > > hadoop jar /usr/local/nutchwax-0.6.1/nutchwax-0.6.1.jar all > /data/inputs/ /data/outputs ciencia > > help explains that input, output, collection are required > I put "ciencia" as collection name not null, but listing search results > this name is not included in path ... > > If I edit search.jsp page and add my collection name then search doesn't > work (it doesn't recognize this collection). > > Natalia > > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: Michael S. <st...@ar...> - 2006-06-23 15:33:46
|
Natalia Torres wrote: > when I index my jobs with nutchwax 0.6.1 I use this command > > hadoop jar /usr/local/nutchwax-0.6.1/nutchwax-0.6.1.jar all > /data/inputs/ /data/outputs ciencia > > help explains that input, output, collection are required > I put "ciencia" as collection name not null, but listing search results > this name is not included in path ... > > If I edit search.jsp page and add my collection name then search doesn't > work (it doesn't recognize this collection). > Just add it to the path that gets made as part of the unrolling of search results. Put in place 'ciencia' instead of the value of collection at that point -- around line 196 where we assign to the archiveCollection value. Collection name not being passed to index is a bug. Looks like the fix is not in 0.6.1. It was fixed 2006/05/12. I'll make a 0.6.2 -- hopefully today. St.Ack P.S. Regards why the archives are not in place, from SF support: Per the site status page: ( 2006-06-20 12:41:07 - Mailing List Service ) On 2006-06-20 the Mailing List Archives were taken down for preventative maintenance that occurs about once every two years. We expect the duration of this downtime to last between 1 to 3 days. > Natalia > > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: Natalia T. <nt...@ce...> - 2006-06-23 11:56:27
|
when I index my jobs with nutchwax 0.6.1 I use this command hadoop jar /usr/local/nutchwax-0.6.1/nutchwax-0.6.1.jar all /data/inputs/ /data/outputs ciencia help explains that input, output, collection are required I put "ciencia" as collection name not null, but listing search results this name is not included in path ... If I edit search.jsp page and add my collection name then search doesn't work (it doesn't recognize this collection). Natalia |
|
From: Michael S. <st...@ar...> - 2006-06-22 21:50:06
|
Natalia Torres wrote: > Hello > > I have a problem indexing new jobs with haddop and nutchwax. The forum > archive of this list doesn't works so I can't find information about it. > I wrote sourceforge asking where's our archive! > I index a couple of jobs crawled whith Heritrix to try NutchWax search > and it seems to work. > > I search a word in Nutchwax Search and the results are showed. But when > I click the title or "other versions" the url was wrong. It's something > like http://www.myurl.com/null/*/http//www.urlcrawled.com. The host 'myurl.com' is a server that will return the content of ARCs? > Surfing > examples on internet archive web I think is that "null" in path may be > collection name used at index time, I'm right? Why null? > Looks like your collection name is 'null'. If you do an explain of your search result, is there a 'collection' field, and if so, is its value null? You used 0.6.2 Nutchwax? With that version it was not possible to do an indexing without supplying a collection name -- supposedly. You can edit the search.jsp and add in a collection name. > There's any way to list the collections used indexing? > See the explain above. Otherwise, use nutch tools to read the content of metadata in your segments -- let me know and I'll supply more detail -- or you can look at the index produced using tools like luke (http://www.getopt.org/luke/) or some quick lucene code that iterates over each document printing out content of the content field (Sounds like yours is null though). > After try it i decided to add new jobs. When I try to index new jobs > using the same command an error appears because the indexes directory > in the output dir exists. Is this at the merge indices step? Try moving aside the old merged index -- i.e. DATA_DIR/index -- and retry running the single merge step. The 'all' command for nutchwax is for running through a complete indexing -- from start to finish. Adding increments needs work in nutchwax. Adding doc on howto with my experience running a few here will be the focus of the next nutchwax release. St.Ack > How can I add jobs to this index? > > Thanks > > > Natalia > > All the advantages of Linux Managed Hosting--Without the Cost and Risk! > Fully trained technicians. The highest number of Red Hat certifications in > the hosting industry. Fanatical Support. Click to learn more > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: Natalia T. <nt...@ce...> - 2006-06-22 10:02:28
|
Hello I have a problem indexing new jobs with haddop and nutchwax. The forum archive of this list doesn't works so I can't find information about it. I index a couple of jobs crawled whith Heritrix to try NutchWax search and it seems to work. I search a word in Nutchwax Search and the results are showed. But when I click the title or "other versions" the url was wrong. It's something like http://www.myurl.com/null/*/http//www.urlcrawled.com. Surfing examples on internet archive web I think is that "null" in path may be collection name used at index time, I'm right? Why null? There's any way to list the collections used indexing? After try it i decided to add new jobs. When I try to index new jobs using the same command an error appears because the indexes directory in the output dir exists. How can I add jobs to this index? Thanks Natalia |
|
From: Natalia T. <nt...@ce...> - 2006-06-16 08:11:40
|
You're right!! Now it works fine!! Thanks Natalia |
|
From: Michael S. <st...@ar...> - 2006-06-15 16:30:43
|
You need hadoop 0.2.0 at least (Get 0.2.1). HADOOP-189 -- http://issues.apache.org/jira/browse/HADOOP-189 -- is needed to run NutchWAX in 'non-distributed, /Standalone/ mode' which I'm guessing your trying to do since thats what the documentation suggests. The documentation used to describe 'Pseudo-distributed configuration' -- see the hadoop doc. for definition of what this means -- but I redid doc. after HADOOP-189 was fixed only, I forgot to update the required hadoop version. I've fixed the doc. Hopefully this fixes your problem (Looked like couldn't find stuff on the CLASSPATH). Let me know. Sorry for any inconvenience. St.Ack Natalia Torres wrote: > Hi > I'm using: > > Haddop 0.1.1 > Nutchwax 0.6.1. > OS Debian > Java 1.4.1 > > N. > > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: Natalia T. <nt...@ce...> - 2006-06-15 08:50:55
|
Hi I'm using: Haddop 0.1.1 Nutchwax 0.6.1. OS Debian Java 1.4.1 N. |
|
From: Michael S. <st...@ar...> - 2006-06-14 15:50:45
|
Tell us more Natalia. It looks like you are doing the right thing but
the below exception is odd: We're not finding items on CLASSPATH. What
hadoop are you using? I just tried nutchwax-0.6.1 locally and didn't
get the below. Have you made any config. in hadoop or is it set to all
defaults? What operating system and what JVM?
St.Ack
Natalia Torres wrote:
> Hi
> this is my first experience with NutchWax after crawl with heritrix.
>
> I've installed all software (hadoop, nutch ...) to run nutchwax+wera
> whith jobs crawled.
>
> When I run all of the indexing steps in one go by passing the 'all'
> directive to NutchWAX using this command
>
> % ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax-0.6.1.jar
> all /tmp/inputs /tmp/outputs test
>
> I get this error
>
> java.lang.RuntimeException: org.apache.nutch.net.URLFilter not found.
> at org.apache.nutch.net.URLFilters.<init>(URLFilters.java:47)
> at
> org.apache.nutch.parse.ParseOutputFormat.getRecordWriter(ParseOutputFormat.java:47)
> at
> org.apache.nutch.fetcher.FetcherOutputFormat$1.<init>(FetcherOutputFormat.java:69)
> at
> org.apache.nutch.fetcher.FetcherOutputFormat.getRecordWriter(FetcherOutputFormat.java:58)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:265)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)
>
> Can anyone tellme which is the problem?
>
> Thanks
>
> Natalia
>
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>
|
|
From: Natalia T. <nt...@ce...> - 2006-06-14 08:01:22
|
Hi
this is my first experience with NutchWax after crawl with heritrix.
I've installed all software (hadoop, nutch ...) to run nutchwax+wera
whith jobs crawled.
When I run all of the indexing steps in one go by passing the 'all'
directive to NutchWAX using this command
% ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax-0.6.1.jar
all /tmp/inputs /tmp/outputs test
I get this error
java.lang.RuntimeException: org.apache.nutch.net.URLFilter not found.
at org.apache.nutch.net.URLFilters.<init>(URLFilters.java:47)
at
org.apache.nutch.parse.ParseOutputFormat.getRecordWriter(ParseOutputFormat.java:47)
at
org.apache.nutch.fetcher.FetcherOutputFormat$1.<init>(FetcherOutputFormat.java:69)
at
org.apache.nutch.fetcher.FetcherOutputFormat.getRecordWriter(FetcherOutputFormat.java:58)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:265)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)
Can anyone tellme which is the problem?
Thanks
Natalia
|
|
From: Gordon M. <go...@ar...> - 2006-05-03 21:50:17
|
The Internet Archive, home of the Heritrix web crawler, Wayback archive browser, and NutchWAX archive search engine projects, has current opportunities for both student and career open source software developers. We are looking for a full-time Java Software Engineer to complement our core team in San Francisco. See details at: http://www.archive.org/about/webjobs.php#JavaSoftwareEngineer We are also participating in the Google 2006 "Summer of Code" program which awards students a stipend for mentored work on open source projects. (Monday May 8th is the deadline to apply for this program.) More info for student applicants: http://code.google.com/soc/studentfaq.html Ideas page for Internet Archive projects: http://webteam.archive.org/confluence/display/SOC06/Ideas (other ideas are welcome!) Student application entry page: http://code.google.com/soc/student_step1.html - Gordon @ IA |
|
From: Michael S. <st...@ar...> - 2006-05-01 22:36:26
|
With this release, NutchWAX uses mapreduce Nutch at its base: i.e. Nutch 0.8-dev+ (Previous NutchWAX releases were based on Nutch 0.7.x). This allows NutchWAX to scale to index even larger collections while at the same time requiring less user intervention than was previously necessary. A recent indexing, using a rack of ~33 dual-core 2Ghz Athlons, each with 4Gigs of RAM, took 3.8 days to index end-to-end 50k ARCs of 141 million documents (We'll post more stats to the list as we knock off collection indexings). Be aware that 0.6.0 bears little resemblance to previous releases both in how it goes about its work and how its run by the user. Be prepared to leave aside all old NutchWAX assumptions. For an introduction, see http://archive-access.sourceforge.net/projects/nutch/apidocs/overview-summary.html#getting_started Release notes are available here: http://archive-access.sourceforge.net/projects/nutch/articles/releasenotes.html. Note that indices made with earlier versions of NutchWAX will not be compatible with 0.6.0. Yours, St.Ack |
|
From: Oskar G. <osk...@kb...> - 2006-04-27 11:35:27
|
Hi everybody! (insert voice of Dr. Nick Riviera here) WAXToolbar is a firefox extension, to aid browsing a web archive, that works tightly together with the new open source Wayback Machine. The first official release -- 0.2.0 -- is now available at: http://archive-access.sourceforge.net/projects/waxtoolbar/ Some basic information on how to install and use it are also there. A few minor changes have to be made to the configuration of the Wayback as well, but those are also covered in the documentation. **Please not that for the toolbar to work you have to get the latest Wayback Machine from CVS HEAD, since changes have been added since the 0.4.1 release.** Regards, Oskar Grenholm National Library of Sweden |
|
From: Michael S. <st...@ar...> - 2006-04-27 05:13:47
|
I'd like to make the 0.6.0 NutchWAX release this weekend. Its ready to go. If any are so inspired, it'd be great if ye could take the application for a spin before saturday/sunday to verify it basically works for you. Comments on the documentation or problems you ran into would be much appreciated. Beware. NutchWAX 0.6.0 is very different from 0.4.x. Be sure to read the 'Getting Started' documentation linked off the home page -- http://archive-access.sourceforge.net/projects/nutch/ -- so you can get a handle on the new mapreduce mode of operation. To download candidate builds, grab the latest NutchWAX bundle from under the "Build Artifacts" on this page on our continuous build server: http://crawltools.archive.org:8080/cruisecontrol/buildresults/HEAD-archive-access. Builds are still labeled nutchwax-0.5.0-xxxx. We haven't yet revved the build to be 0.6.0. Thanks all, St.Ack |
|
From: Lukas M. <mat...@ce...> - 2006-04-24 20:27:41
|
Dne po 24. dubna 2006 18:38 Michael Stack napsal(a): > Andrea Goethals wrote: > > Hello, > > Hello Andrea. > > > I have been reading documentation for nutchwax, nutch and lucene > > trying to figure out if there's a way to do what I need to do: > > basically to allow curators to "tag" particular archived web sites as > > belonging to a collection for the purposes of restricting searches to > > that collection and for generating web pages related to that collection. > > > > The trick is that these collections can be defined at any time, > > post-harvest (heritrix), post-index (nutchwax). And web sites can > > belong to multiple collections. Sometimes the collections are > > hierarchical, sometimes they are not. > > > > This is my thinking on it so far. I'm hoping that someone will step in > > with a better or more elegant way to do this. > > > > If I restrict their collections to be defined by a set of seed URIs > > (rather than all archived URIs) I think it's more manageable. Picture > > a database (call it myDB) that manages a list of seed URIs, each > > associated with a unique seedId. I can perform separate heritrix > > crawls per seed URI. When I index that set of ARC files (associated > > with a single seed URI) I can set the command-line "collection" field > > to the seed ID. Then all URIs associated with a seed URI will have the > > same indexed "collection" value. This posting might be hard to read > > because the word collection is overused - the nutchwax collection > > field would be used in this case to group together sites that came > > from the same seed. We had to solve similiar situation. In separate database we defined special metadata (etc.collection,contract with author) for each URI and than we'd like to feed nutchwax with tagged set of records from ARCs (arc name + offset). In other way database extra metadata are used for accessing documents through wayback machine. > > > > Then in that separate database (myDB) I can manage associations > > defined at any time between curator-defined collections and the > > seedIds. I wouldn't want to add these collection "tags" to the index > > because to do this I'd probably have to add these new collection > > values to the content in the ARC files, then write a parse, index and > > query filter to handle the new field, right? Or is there a way to just > > add a field directly to the index for a set of seedIds? > > Not at the moment but I've been playing and we could add a new step that > did nothing but read from a data source and add metadata from the data > source to the Nutch(WAX) segment (In particular, rewrite the parse_data > file in the segment, the file that holds the 'metadata' such as fetch > time, etc.). > > So, you wouldn't have to touch the ARCs, just the product of the > Nutch(WAX) parse. > > You could tag a page as being of multiple collections: E.g. of > collection 1, 7 and 8. > > After adding the metadata, you'd have to reindex. > > Would that work for you? > > > So the idea is to translate a user's search query into indexed fields > > using the associations in myDB. Say the user searches with (and > > myCollection is the field name for the curator's collection, which > > isn't a lucene field): > > myCollection:Asia cats > > then this could be translated behind-the-scenes to a lucene query: > > collection:seed1 OR collection:seed7 OR collection:seed8 cats > > (assuming the crawls associated with seeds 1, 7 and 8 were mapped to > > the Asia collection.) > > > > I don't think nutchWAX can support the OR queries yet, is that right? > > Thats right. No OR yet. > > But, we're sort of having a similar problem to you here at the archive > (archiveit.org in particular). > > They have done similar to your idea in that they have tried to add in a > little indirection naming collections by ID instead of explicitly. > Querying one collection works now or querying all collections but > awkward is querying a couple of collections. One thought is to amend > the collection query-time plugin so it can take a list of collections: > E.g. 'collection:asia,europe,australia cats'. This would find instances > of cats in all three listed collections. Would break if the list of > collections was in the hundreds I'd imagine. And its not what you want. > > I suppose you could do 3 separate queries aggregating the results? > Would that be onerous? > > St.Ack > > > > > ------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier Download IBM WebSphere Application Server v.1.0.1 based on Apache > Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From: Michael S. <st...@ar...> - 2006-04-24 17:13:32
|
Andrea Goethals wrote: ..... >> >> Not at the moment but I've been playing and we could add a new step >> that did nothing but read from a data source and add metadata from >> the data source to the Nutch(WAX) segment (In particular, rewrite the >> parse_data file in the segment, the file that holds the 'metadata' >> such as fetch time, etc.). >> >> So, you wouldn't have to touch the ARCs, just the product of the >> Nutch(WAX) parse. >> >> You could tag a page as being of multiple collections: E.g. of >> collection 1, 7 and 8. >> >> After adding the metadata, you'd have to reindex. >> >> Would that work for you? > > That would be great! There is a need in general (at least for us - > probably for others?) to be able to add metadata to already-harvested > content. We have another situation like this where the curators would > want to add subjects to particular URLs. So we could come up with a > generic solution to this - add any field (e.g. collection, subject), > tell it which ? to apply this to. Would the new fields be associated > at the ARC-level or URI level? At URI level. I've added an RFE: http://sourceforge.net/tracker/index.php?func=detail&aid=1475667&group_id=118427&atid=681140. I'll start in on it after the 0.6.0 release of nutchwax (Should be any week soon -- but I've been saying that with a while now...). ... > > I think that that could work. To reduce the query length problem you > could hide the actual query syntax from the user. Like the > archiveit.org way, you could use the collection IDs in the query to > keep it the query shorter: > 'collection:1,7,8 cats' > by either translating the user's selection of asia, europe and > austrailia from a drop-down list, or translating the user's typed in > collection:asia,europe,australia to collection:1,7,8 before the query > is executed. > Yes. That sounds right. St.Ack |
|
From: Andrea G. <and...@ha...> - 2006-04-24 17:00:26
|
Hi Michael, >>I have been reading documentation for nutchwax, nutch and lucene trying >>to figure out if there's a way to do what I need to do: basically to >>allow curators to "tag" particular archived web sites as belonging to a >>collection for the purposes of restricting searches to that collection >>and for generating web pages related to that collection. >> >>The trick is that these collections can be defined at any time, >>post-harvest (heritrix), post-index (nutchwax). And web sites can belong >>to multiple collections. Sometimes the collections are hierarchical, >>sometimes they are not. >> >>This is my thinking on it so far. I'm hoping that someone will step in >>with a better or more elegant way to do this. >> >>If I restrict their collections to be defined by a set of seed URIs >>(rather than all archived URIs) I think it's more manageable. Picture a >>database (call it myDB) that manages a list of seed URIs, each associated >>with a unique seedId. I can perform separate heritrix crawls per seed >>URI. When I index that set of ARC files (associated with a single seed >>URI) I can set the command-line "collection" field to the seed ID. Then >>all URIs associated with a seed URI will have the same indexed >>"collection" value. This posting might be hard to read because the word >>collection is overused - the nutchwax collection field would be used in >>this case to group together sites that came from the same seed. >> >>Then in that separate database (myDB) I can manage associations defined >>at any time between curator-defined collections and the seedIds. I >>wouldn't want to add these collection "tags" to the index because to do >>this I'd probably have to add these new collection values to the content >>in the ARC files, then write a parse, index and query filter to handle >>the new field, right? Or is there a way to just add a field directly to >>the index for a set of seedIds? > >Not at the moment but I've been playing and we could add a new step that >did nothing but read from a data source and add metadata from the data >source to the Nutch(WAX) segment (In particular, rewrite the parse_data >file in the segment, the file that holds the 'metadata' such as fetch >time, etc.). > >So, you wouldn't have to touch the ARCs, just the product of the >Nutch(WAX) parse. > >You could tag a page as being of multiple collections: E.g. of collection >1, 7 and 8. > >After adding the metadata, you'd have to reindex. > >Would that work for you? That would be great! There is a need in general (at least for us - probably for others?) to be able to add metadata to already-harvested content. We have another situation like this where the curators would want to add subjects to particular URLs. So we could come up with a generic solution to this - add any field (e.g. collection, subject), tell it which ? to apply this to. Would the new fields be associated at the ARC-level or URI level? >>So the idea is to translate a user's search query into indexed fields >>using the associations in myDB. Say the user searches with (and >>myCollection is the field name for the curator's collection, which isn't >>a lucene field): >>myCollection:Asia cats >>then this could be translated behind-the-scenes to a lucene query: >>collection:seed1 OR collection:seed7 OR collection:seed8 cats >>(assuming the crawls associated with seeds 1, 7 and 8 were mapped to the >>Asia collection.) >> >>I don't think nutchWAX can support the OR queries yet, is that right? >Thats right. No OR yet. > >But, we're sort of having a similar problem to you here at the archive >(archiveit.org in particular). > >They have done similar to your idea in that they have tried to add in a >little indirection naming collections by ID instead of explicitly. >Querying one collection works now or querying all collections but awkward >is querying a couple of collections. One thought is to amend the >collection query-time plugin so it can take a list of collections: E.g. >'collection:asia,europe,australia cats'. This would find instances of >cats in all three listed collections. Would break if the list of >collections was in the hundreds I'd imagine. And its not what you want. I think that that could work. To reduce the query length problem you could hide the actual query syntax from the user. Like the archiveit.org way, you could use the collection IDs in the query to keep it the query shorter: 'collection:1,7,8 cats' by either translating the user's selection of asia, europe and austrailia from a drop-down list, or translating the user's typed in collection:asia,europe,australia to collection:1,7,8 before the query is executed. >I suppose you could do 3 separate queries aggregating the results? >Would that be onerous? I'll probably try the single query in a list first to not have to deal with ordering the results. thanks, Andrea >St.Ack > |
|
From: Michael S. <st...@ar...> - 2006-04-24 16:38:33
|
Andrea Goethals wrote: > Hello, Hello Andrea. > > I have been reading documentation for nutchwax, nutch and lucene > trying to figure out if there's a way to do what I need to do: > basically to allow curators to "tag" particular archived web sites as > belonging to a collection for the purposes of restricting searches to > that collection and for generating web pages related to that collection. > > The trick is that these collections can be defined at any time, > post-harvest (heritrix), post-index (nutchwax). And web sites can > belong to multiple collections. Sometimes the collections are > hierarchical, sometimes they are not. > > This is my thinking on it so far. I'm hoping that someone will step in > with a better or more elegant way to do this. > > If I restrict their collections to be defined by a set of seed URIs > (rather than all archived URIs) I think it's more manageable. Picture > a database (call it myDB) that manages a list of seed URIs, each > associated with a unique seedId. I can perform separate heritrix > crawls per seed URI. When I index that set of ARC files (associated > with a single seed URI) I can set the command-line "collection" field > to the seed ID. Then all URIs associated with a seed URI will have the > same indexed "collection" value. This posting might be hard to read > because the word collection is overused - the nutchwax collection > field would be used in this case to group together sites that came > from the same seed. > > Then in that separate database (myDB) I can manage associations > defined at any time between curator-defined collections and the > seedIds. I wouldn't want to add these collection "tags" to the index > because to do this I'd probably have to add these new collection > values to the content in the ARC files, then write a parse, index and > query filter to handle the new field, right? Or is there a way to just > add a field directly to the index for a set of seedIds? Not at the moment but I've been playing and we could add a new step that did nothing but read from a data source and add metadata from the data source to the Nutch(WAX) segment (In particular, rewrite the parse_data file in the segment, the file that holds the 'metadata' such as fetch time, etc.). So, you wouldn't have to touch the ARCs, just the product of the Nutch(WAX) parse. You could tag a page as being of multiple collections: E.g. of collection 1, 7 and 8. After adding the metadata, you'd have to reindex. Would that work for you? > > So the idea is to translate a user's search query into indexed fields > using the associations in myDB. Say the user searches with (and > myCollection is the field name for the curator's collection, which > isn't a lucene field): > myCollection:Asia cats > then this could be translated behind-the-scenes to a lucene query: > collection:seed1 OR collection:seed7 OR collection:seed8 cats > (assuming the crawls associated with seeds 1, 7 and 8 were mapped to > the Asia collection.) > > I don't think nutchWAX can support the OR queries yet, is that right? Thats right. No OR yet. But, we're sort of having a similar problem to you here at the archive (archiveit.org in particular). They have done similar to your idea in that they have tried to add in a little indirection naming collections by ID instead of explicitly. Querying one collection works now or querying all collections but awkward is querying a couple of collections. One thought is to amend the collection query-time plugin so it can take a list of collections: E.g. 'collection:asia,europe,australia cats'. This would find instances of cats in all three listed collections. Would break if the list of collections was in the hundreds I'd imagine. And its not what you want. I suppose you could do 3 separate queries aggregating the results? Would that be onerous? St.Ack |
|
From: Andrea G. <and...@ha...> - 2006-04-21 18:54:17
|
Hello, I have been reading documentation for nutchwax, nutch and lucene trying to figure out if there's a way to do what I need to do: basically to allow curators to "tag" particular archived web sites as belonging to a collection for the purposes of restricting searches to that collection and for generating web pages related to that collection. The trick is that these collections can be defined at any time, post-harvest (heritrix), post-index (nutchwax). And web sites can belong to multiple collections. Sometimes the collections are hierarchical, sometimes they are not. This is my thinking on it so far. I'm hoping that someone will step in with a better or more elegant way to do this. If I restrict their collections to be defined by a set of seed URIs (rather than all archived URIs) I think it's more manageable. Picture a database (call it myDB) that manages a list of seed URIs, each associated with a unique seedId. I can perform separate heritrix crawls per seed URI. When I index that set of ARC files (associated with a single seed URI) I can set the command-line "collection" field to the seed ID. Then all URIs associated with a seed URI will have the same indexed "collection" value. This posting might be hard to read because the word collection is overused - the nutchwax collection field would be used in this case to group together sites that came from the same seed. Then in that separate database (myDB) I can manage associations defined at any time between curator-defined collections and the seedIds. I wouldn't want to add these collection "tags" to the index because to do this I'd probably have to add these new collection values to the content in the ARC files, then write a parse, index and query filter to handle the new field, right? Or is there a way to just add a field directly to the index for a set of seedIds? So the idea is to translate a user's search query into indexed fields using the associations in myDB. Say the user searches with (and myCollection is the field name for the curator's collection, which isn't a lucene field): myCollection:Asia cats then this could be translated behind-the-scenes to a lucene query: collection:seed1 OR collection:seed7 OR collection:seed8 cats (assuming the crawls associated with seeds 1, 7 and 8 were mapped to the Asia collection.) I don't think nutchWAX can support the OR queries yet, is that right? Has anyone else figured out a different way to do this or have a different idea? thanks, Andrea |
|
From: <st...@ar...> - 2006-04-06 21:59:08
|
Excellent Oskar!
Do you want us to host your firefox extension at archive-access? If so,
we can set up a subproject for it and give you access.
St.Ack
Oskar Grenholm wrote:
> Hi everyone!
>
> Let me first introduce me to those of you who don't know me already.
> My name is Oskar Grenholm and I work as a programmer at The National Library
> of Sweden. I mainly work with things related to our web archive here.
>
> Lately I have made some minor improvemtents to the way the proxy-mode works in
> the Open Wayback Machine. Those changes have made it possible to surf not
> only the most recent copy of a page in the web archive, but instead any copy
> available.
> This can be done with just the Wayback Machine, but to aid (and perhaps
> simplify) the surfing I have also started working on a Firefox extension that
> will help the user with common tasks often encountered when surfing a web
> archive. Among the things this WAX Toolbar does is providing a search field
> for searching the Wayback Machine for different URL:s OR do a full-text
> search from a NutchWAX index (if one is available of course). You can also
> use the toolbar to switch between proxy-mode and the regular Internet, and
> when in proxy-mode easily go back and forth in time.
>
> The changes made to the Wayback are not many. The main idea is that you have a
> BDB index that holds mappings between id:s (a unique id if the toolbar was
> used, otherwise the ip-address the request was made from) and a preferred
> time to surf at. This timestamp is set either when you choose a page to visit
> from the search interface in the WB or by the WAX Toolbar.
> Then for each request made to the proxy the WB will look up this timestamp and
> return the page that is the closest in time.
>
> Patches for these changes are attached to this e-mail. Four of the files are
> earlier existing files that have been modified somewhat and two of them are
> new (BDBMapper.java and Redirect.jsp).
>
> Attached is also a tar-file containing the source for the Firefox extension.
> If you untar this and enter the directory you can just run 'ant' and a file
> named WaxToolbar.xpi will be built. That is the actual Firefox extension and
> it can be installed as any other extension (i,e,. double-clicking it from
> within Firefox).
> When the extension is installed (and after a re-start of Firefox) a new
> toolbar will be there. In the Tools menu there will also be a WAX Toolbar
> Configuration option. Using this you can set the proxy to use (the WB) and a
> server running NutchWAX.
>
> Finally I have attached an example of a web.xml that can be used when running
> the WB with these new changes and the WAX Toolbar. In it some new stuff has
> been added, namely a parameter specifying the redirect path (the Redirect.jsp
> mentioned above) and a servlet called xmlquery that runs in parallell with
> the normal query interface and is used by the extension to find the times a
> page has been archived.
>
> So, let the feedback begin!
>
> Regards, Oskar.
> ------------------------------------------------------------------------
>
> Index: BDBMap.java
> ===================================================================
> RCS file: BDBMap.java
> diff -N BDBMap.java
> --- /dev/null 1 Jan 1970 00:00:00 -0000
> +++ BDBMap.java 1 Jan 1970 00:00:00 -0000
> @@ -0,0 +1,94 @@
> +/*
> + * Created on 2006-apr-05
> + *
> + * Copyright (C) 2006 Royal Library of Sweden.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public License
> + * as published by the Free Software Foundation; either version 2
> + * of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> + */
> +package org.archive.wayback.core;
> +
> +import java.io.File;
> +import java.io.UnsupportedEncodingException;
> +
> +import com.sleepycat.je.Database;
> +import com.sleepycat.je.DatabaseConfig;
> +import com.sleepycat.je.DatabaseEntry;
> +import com.sleepycat.je.DatabaseException;
> +import com.sleepycat.je.Environment;
> +import com.sleepycat.je.EnvironmentConfig;
> +import com.sleepycat.je.LockMode;
> +import com.sleepycat.je.OperationStatus;
> +
> +public class BDBMap {
> +
> + protected Environment env = null;
> + protected Database db = null;
> + protected String name;
> + protected String dir;
> +
> + public BDBMap(String name, String dir) {
> + this.name = name;
> + this.dir = dir;
> + init();
> + }
> +
> + protected void init() {
> + try {
> + EnvironmentConfig envConf = new EnvironmentConfig();
> + envConf.setAllowCreate(true);
> + File envDir = new File(dir);
> + if (!envDir.exists())
> + envDir.mkdirs();
> + env = new Environment(envDir, envConf);
> +
> + DatabaseConfig dbConf = new DatabaseConfig();
> + dbConf.setAllowCreate(true);
> + dbConf.setSortedDuplicates(false);
> + db = env.openDatabase(null, name, dbConf);
> + } catch (DatabaseException e) {
> + e.printStackTrace();
> + }
> + }
> +
> + public void put(String keyStr, String valueStr) {
> + try {
> + DatabaseEntry key = new DatabaseEntry(keyStr.getBytes("UTF-8"));
> + DatabaseEntry data = new DatabaseEntry(valueStr.getBytes("UTF-8"));
> + db.put(null, key, data);
> + } catch (DatabaseException e) {
> + e.printStackTrace();
> + } catch (UnsupportedEncodingException e) {
> + e.printStackTrace();
> + }
> + }
> +
> + public String get(String keyStr) {
> + String result = null;
> + try {
> + DatabaseEntry key = new DatabaseEntry(keyStr.getBytes("UTF-8"));
> + DatabaseEntry data = new DatabaseEntry();
> + if (db.get(null, key, data, LockMode.DEFAULT) == OperationStatus.SUCCESS) {
> + byte[] bytes = data.getData();
> + result = new String(bytes, "UTF-8");
> + }
> + } catch (DatabaseException e) {
> + e.printStackTrace();
> + } catch (UnsupportedEncodingException e) {
> + e.printStackTrace();
> + }
> + return result;
> + }
> +
> +}
> ------------------------------------------------------------------------
>
> Index: ResultURIConverter.java
> ===================================================================
> RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/proxy/ResultURIConverter.java,v
> retrieving revision 1.3
> diff -u -r1.3 ResultURIConverter.java
> --- ResultURIConverter.java 1 Dec 2005 02:08:34 -0000 1.3
> +++ ResultURIConverter.java 6 Apr 2006 11:36:25 -0000
> @@ -41,10 +41,19 @@
> * @version $Date: 2005/12/01 02:08:34 $, $Revision: 1.3 $
> */
> public class ResultURIConverter implements ReplayResultURIConverter {
> - /* (non-Javadoc)
> +
> + private static final String REDIRECT_PATH_PROPERTY = "proxy.redirectpath";
> +
> + private String redirectPath;
> +
> + /* (non-Javadoc)
> * @see org.archive.wayback.ReplayResultURIConverter#init(java.util.Properties)
> */
> public void init(Properties p) throws ConfigurationException {
> + redirectPath = (String) p.get(REDIRECT_PATH_PROPERTY);
> + if (redirectPath == null || redirectPath.length() <= 0) {
> + throw new ConfigurationException("Failed to find " + REDIRECT_PATH_PROPERTY);
> + }
> }
>
> /* (non-Javadoc)
> @@ -52,10 +61,12 @@
> */
> public String makeReplayURI(SearchResult result) {
> String finalUrl = result.get(WaybackConstants.RESULT_URL);
> + String finalTime = result.get(WaybackConstants.RESULT_CAPTURE_DATE);
> if(!finalUrl.startsWith(WaybackConstants.HTTP_URL_PREFIX)) {
> finalUrl = WaybackConstants.HTTP_URL_PREFIX + finalUrl;
> }
> - return finalUrl;
> + //return finalUrl;
> + return redirectPath + "?url=" + finalUrl + "&time=" + finalTime;
> }
>
> /**
> @@ -70,6 +81,7 @@
> */
> public String makeRedirectReplayURI(SearchResult result, String url) {
> String finalUrl = url;
> + String finalTime = result.get(WaybackConstants.RESULT_CAPTURE_DATE);
> try {
>
> UURI origURI = UURIFactory.getInstance(url);
> @@ -86,6 +98,7 @@
> if(!finalUrl.startsWith(WaybackConstants.HTTP_URL_PREFIX)) {
> finalUrl = WaybackConstants.HTTP_URL_PREFIX + finalUrl;
> }
> - return finalUrl;
> + //return finalUrl;
> + return redirectPath + "?url=" + finalUrl + "&time=" + finalTime;
> }
> }
> ------------------------------------------------------------------------
>
> Index: Timestamp.java
> ===================================================================
> RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/core/Timestamp.java,v
> retrieving revision 1.7
> diff -u -r1.7 Timestamp.java
> --- Timestamp.java 16 Feb 2006 03:14:42 -0000 1.7
> +++ Timestamp.java 6 Apr 2006 11:34:06 -0000
> @@ -56,6 +56,11 @@
>
> private final static String[] months = { "Jan", "Feb", "Mar", "Apr", "May",
> "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec" };
> +
> + // Acts as a mapping between an ID and a timestamp to surf at.
> + // The dir should probably be configurable somehow.
> + private static String BDB_DIR = System.getProperty("java.io.tmpdir") + "/wayback/bdb";
> + private static BDBMap idToTimestamp = new BDBMap("IdToTimestamp", BDB_DIR);
>
> private String dateStr = null;
> private Date date = null;
> @@ -430,6 +435,7 @@
> public static Timestamp currentTimestamp() {
> return new Timestamp(new Date());
> }
> +
> /**
> * @return Timestamp object representing the latest possible date.
> */
> @@ -437,12 +443,20 @@
> return currentTimestamp();
> }
>
> -
> /**
> * @return Timestamp object representing the earliest possible date.
> */
> public static Timestamp earliestTimestamp() {
> return new Timestamp(SSE_1996);
> }
> +
> + public static String getTimestampForId(String ip) {
> + String dateStr = idToTimestamp.get(ip);
> + return (dateStr != null) ? dateStr : currentTimestamp().getDateStr();
> + }
> +
> + public static void addTimestampForId(String ip, String time) {
> + idToTimestamp.put(ip, time);
> + }
>
> }
> ------------------------------------------------------------------------
>
> Index: Redirect.jsp
> ===================================================================
> RCS file: Redirect.jsp
> diff -N Redirect.jsp
> --- /dev/null 1 Jan 1970 00:00:00 -0000
> +++ Redirect.jsp 1 Jan 1970 00:00:00 -0000
> @@ -0,0 +1,14 @@
> +<%@ page import="org.archive.wayback.core.Timestamp" %>
> +
> +<%
> + String url = request.getParameter("url");
> + String time = request.getParameter("time");
> +
> + // Put time-mapping for this id, or if no id, the ip-addr.
> + String id = request.getHeader("Proxy-Id");
> + if(id == null) id = request.getRemoteAddr();
> + Timestamp.addTimestampForId(id, time);
> +
> + // Now redirect to the page the user wanted.
> + response.sendRedirect(url);
> +%>
> ------------------------------------------------------------------------
>
> Index: ReplayFilter.java
> ===================================================================
> RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/proxy/ReplayFilter.java,v
> retrieving revision 1.4
> diff -u -r1.4 ReplayFilter.java
> --- ReplayFilter.java 18 Jan 2006 02:04:12 -0000 1.4
> +++ ReplayFilter.java 6 Apr 2006 11:36:02 -0000
> @@ -84,10 +84,15 @@
> referer = "";
> }
> wbRequest.put(WaybackConstants.REQUEST_REFERER_URL,referer);
> -
> - wbRequest.put(WaybackConstants.REQUEST_EXACT_DATE,
> - Timestamp.currentTimestamp().getDateStr());
> -
> +
> + // Original
> + //wbRequest.put(WaybackConstants.REQUEST_EXACT_DATE, Timestamp.currentTimestamp().getDateStr());
> +
> + // Get the id from the request. If no id, use the ip-address instead.
> + // Then get the timestamp (or rather datestr) matching this id.
> + String id = httpRequest.getHeader("Proxy-Id");
> + if(id == null) id = httpRequest.getRemoteAddr();
> + wbRequest.put(WaybackConstants.REQUEST_EXACT_DATE, Timestamp.getTimestampForId(id));
>
> return wbRequest;
> }
> ------------------------------------------------------------------------
>
> Index: QueryServlet.java
> ===================================================================
> RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/query/QueryServlet.java,v
> retrieving revision 1.5
> diff -u -r1.5 QueryServlet.java
> --- QueryServlet.java 7 Mar 2006 23:22:20 -0000 1.5
> +++ QueryServlet.java 6 Apr 2006 11:38:30 -0000
> @@ -25,7 +25,9 @@
> package org.archive.wayback.query;
>
> import java.io.IOException;
> +import java.text.ParseException;
> import java.util.Enumeration;
> +import java.util.Iterator;
> import java.util.Properties;
>
> import javax.servlet.ServletConfig;
> @@ -39,7 +41,9 @@
> import org.archive.wayback.QueryRenderer;
> import org.archive.wayback.ReplayResultURIConverter;
> import org.archive.wayback.ResourceIndex;
> +import org.archive.wayback.core.SearchResult;
> import org.archive.wayback.core.SearchResults;
> +import org.archive.wayback.core.Timestamp;
> import org.archive.wayback.core.WaybackLogic;
> import org.archive.wayback.core.WaybackRequest;
> import org.archive.wayback.exception.BadQueryException;
> @@ -119,6 +123,14 @@
> if (wbRequest.get(WaybackConstants.REQUEST_TYPE).equals(
> WaybackConstants.REQUEST_URL_QUERY)) {
>
> + // Annotate the closest matching hit so that it can
> + // be retrieved later from the xml.
> + try {
> + annotateClosest(results, wbRequest, httpRequest);
> + } catch (ParseException e) {
> + e.printStackTrace();
> + }
> +
> renderer.renderUrlResults(httpRequest, httpResponse,
> wbRequest, results, uriConverter);
>
> @@ -144,4 +156,34 @@
>
> }
> }
> +
> + // Method annotating the searchresult closest in time to the timestamp
> + // belonging to this request.
> + private void annotateClosest(SearchResults results,
> + WaybackRequest wbRequest, HttpServletRequest request) throws ParseException {
> +
> + SearchResult closest = null;
> + long closestDistance = 0;
> + SearchResult cur = null;
> + String id = request.getHeader("Proxy-Id");
> + if(id == null) id = request.getRemoteAddr();
> + String requestsDate = Timestamp.getTimestampForId(id);
> + Timestamp wantTimestamp;
> + wantTimestamp = Timestamp.parseBefore(requestsDate);
> +
> + Iterator itr = results.iterator();
> + while (itr.hasNext()) {
> + cur = (SearchResult) itr.next();
> + long curDistance;
> + Timestamp curTimestamp = Timestamp.parseBefore(cur
> + .get(WaybackConstants.RESULT_CAPTURE_DATE));
> + curDistance = curTimestamp.absDistanceFromTimestamp(wantTimestamp);
> +
> + if ((closest == null) || (curDistance < closestDistance)) {
> + closest = cur;
> + closestDistance = curDistance;
> + }
> + }
> + closest.put("closest", "true");
> + }
> }
> ------------------------------------------------------------------------
>
> <?xml version="1.0"?>
> <!DOCTYPE web-app PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN"
> "http://java.sun.com/dtd/web-app_2_3.dtd">
> <web-app>
>
> <!-- General Installation information
> -->
>
> <context-param>
> <param-name>installationname</param-name>
> <param-value>Local Proxy Installation</param-value>
> <description>
> This text will appear on the Wayback Configuration and Status page
> and may assist in determining which installation users are viewing
> via their web browser in environments with multiple Wayback
> installations.
> </description>
> </context-param>
>
>
> <!-- Local Arc Path Configuration:
> used by both indexpipeline and LocalARCResourceStore
> -->
>
> <context-param>
> <param-name>arcpath</param-name>
> <param-value>/tmp/wayback/arcs</param-value>
> <description>
> Directory where ARC files are found (possibly where Heritrix writes them.)
> This directory must exist.
> </description>
> </context-param>
>
>
>
> <!-- ResourceStore Configuration -->
>
> <context-param>
> <param-name>resourcestore.classname</param-name>
> <param-value>org.archive.wayback.localresourcestore.LocalARCResourceStore</param-value>
> <description>Class that implements ResourceStore for this Wayback</description>
> </context-param>
>
>
>
> <!-- ResourceIndex Configuration -->
>
> <context-param>
> <param-name>resourceindex.classname</param-name>
> <param-value>org.archive.wayback.cdx.LocalBDBResourceIndex</param-value>
> <description>Class that implements ResourceIndex for this Wayback</description>
> </context-param>
>
> <context-param>
> <param-name>resourceindex.indexpath</param-name>
> <param-value>/tmp/wayback/index</param-value>
> <description>
> LocalBDBResourceIndex specific directory to store the BDB files.
> This directory must exist.
> </description>
> </context-param>
>
> <context-param>
> <param-name>resourceindex.dbname</param-name>
> <param-value>DB1</param-value>
> <description>
> LocalBDBResourceIndex specific name for BDB database
> </description>
> </context-param>
>
>
> <!-- ResourceIndex Pipeline Configuration -->
>
> <context-param>
> <param-name>indexpipeline.workpath</param-name>
> <param-value>/tmp/wayback/pipeline</param-value>
> <description>
> LocalBDBResourceIndex specific directory to store flag files and
> temporary index data. This directory must exist.
> </description>
> </context-param>
>
> <context-param>
> <param-name>indexpipeline.runpipeline</param-name>
> <param-value>1</param-value>
> <description>
> if set to '1' then a background indexing thread will automatically
> update the BDB index when new ARC files are noticed in the 'arcpath'
> directory.
> </description>
> </context-param>
>
> <!-- Pipeline Filter Configuration
> this enables a trival (and very in-progress) UI for viewing the
> pipeline status.
> -->
>
> <filter>
> <filter-name>PipelineFilter</filter-name>
> <filter-class>org.archive.wayback.cdx.indexer.PipelineFilter</filter-class>
> <init-param>
> <param-name>pipeline.statusjsp</param-name>
> <param-value>jsp/PipelineUI/PipelineStatus.jsp</param-value>
> </init-param>
> </filter>
> <filter-mapping>
> <filter-name>PipelineFilter</filter-name>
> <url-pattern>/pipeline</url-pattern>
> </filter-mapping>
>
>
>
>
> <!-- Query Servlet Configuration -->
>
> <servlet>
> <servlet-name>QueryServlet</servlet-name>
> <servlet-class>org.archive.wayback.query.QueryServlet</servlet-class>
> <init-param>
> <param-name>queryui.jsppath</param-name>
> <param-value>jsp/QueryUI</param-value>
> </init-param>
> </servlet>
> <servlet-mapping>
> <servlet-name>QueryServlet</servlet-name>
> <url-pattern>/query</url-pattern>
> </servlet-mapping>
>
> <!-- XMLQuery Servlet Configuration -->
>
> <servlet>
> <servlet-name>XMLQueryServlet</servlet-name>
> <servlet-class>org.archive.wayback.query.QueryServlet</servlet-class>
> <init-param>
> <param-name>queryui.jsppath</param-name>
> <param-value>jsp/QueryXMLUI</param-value>
> </init-param>
> </servlet>
> <servlet-mapping>
> <servlet-name>XMLQueryServlet</servlet-name>
> <url-pattern>/xmlquery</url-pattern>
> </servlet-mapping>
>
> <!-- QueryUI Configuration -->
>
> <context-param>
> <param-name>queryrenderer.classname</param-name>
> <param-value>org.archive.wayback.query.Renderer</param-value>
> <description>Implementation responsible for drawing Index Query results</description>
> </context-param>
>
> <context-param>
> <param-name>proxy.redirectpath</param-name>
> <param-value>/jsp/QueryUI/Redirect.jsp</param-value>
> </context-param>
>
>
>
> <!-- Replay Servlet Configuration -->
>
> <servlet>
> <servlet-name>ReplayServlet</servlet-name>
> <servlet-class>org.archive.wayback.replay.ReplayServlet</servlet-class>
> </servlet>
> <servlet-mapping>
> <servlet-name>ReplayServlet</servlet-name>
> <url-pattern>/replay</url-pattern>
> </servlet-mapping>
>
>
>
> <!-- Proxy RawReplayUI Configuration -->
>
> <context-param>
> <param-name>replayrenderer.classname</param-name>
> <param-value>org.archive.wayback.proxy.RawReplayRenderer</param-value>
> <description>Implementation responsible for drawing replayed resources and replay error messages</description>
> </context-param>
>
> <context-param>
> <param-name>replayui.jsppath</param-name>
> <param-value>jsp/ReplayUI</param-value>
> <description>
> RawReplayUI specific path to jsp pages. relative to webapp/
> </description>
> </context-param>
>
> <!-- Proxy URI Conversion Configuration -->
>
> <context-param>
> <param-name>replayuriconverter.classname</param-name>
> <param-value>org.archive.wayback.proxy.ResultURIConverter</param-value>
> <description>Class that implements translation of index results to Replayable URIs for this Wayback</description>
> </context-param>
>
> <!-- Proxy ReplayFilter Configuration -->
>
> <filter>
> <filter-name>ReplayFilter</filter-name>
> <filter-class>org.archive.wayback.proxy.ReplayFilter</filter-class>
>
> <init-param>
> <param-name>handler.url</param-name>
> <param-value>/replay</param-value>
> </init-param>
> </filter>
> <filter-mapping>
> <filter-name>ReplayFilter</filter-name>
> <url-pattern>/*</url-pattern>
> </filter-mapping>
>
> </web-app>
|