You can subscribe to this list here.
| 2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(4) |
Sep
(5) |
Oct
(17) |
Nov
(30) |
Dec
(3) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2006 |
Jan
(4) |
Feb
(14) |
Mar
(8) |
Apr
(11) |
May
(2) |
Jun
(13) |
Jul
(9) |
Aug
(2) |
Sep
(2) |
Oct
(9) |
Nov
(20) |
Dec
(9) |
| 2007 |
Jan
(6) |
Feb
(4) |
Mar
(6) |
Apr
(7) |
May
(6) |
Jun
(6) |
Jul
(4) |
Aug
(3) |
Sep
(9) |
Oct
(26) |
Nov
(23) |
Dec
(2) |
| 2008 |
Jan
(17) |
Feb
(19) |
Mar
(16) |
Apr
(27) |
May
(3) |
Jun
(21) |
Jul
(21) |
Aug
(8) |
Sep
(13) |
Oct
(7) |
Nov
(8) |
Dec
(8) |
| 2009 |
Jan
(18) |
Feb
(14) |
Mar
(27) |
Apr
(14) |
May
(10) |
Jun
(14) |
Jul
(18) |
Aug
(30) |
Sep
(18) |
Oct
(12) |
Nov
(5) |
Dec
(26) |
| 2010 |
Jan
(27) |
Feb
(3) |
Mar
(8) |
Apr
(4) |
May
(6) |
Jun
(13) |
Jul
(25) |
Aug
(11) |
Sep
(2) |
Oct
(4) |
Nov
(7) |
Dec
(6) |
| 2011 |
Jan
(25) |
Feb
(17) |
Mar
(25) |
Apr
(23) |
May
(15) |
Jun
(12) |
Jul
(8) |
Aug
(13) |
Sep
(4) |
Oct
(17) |
Nov
(7) |
Dec
(6) |
| 2012 |
Jan
(4) |
Feb
(7) |
Mar
(1) |
Apr
(10) |
May
(11) |
Jun
(5) |
Jul
(7) |
Aug
(1) |
Sep
(1) |
Oct
(5) |
Nov
(6) |
Dec
(13) |
| 2013 |
Jan
(9) |
Feb
(7) |
Mar
(3) |
Apr
(1) |
May
(3) |
Jun
(19) |
Jul
(3) |
Aug
(3) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
| 2014 |
Jan
(11) |
Feb
(1) |
Mar
|
Apr
(2) |
May
(6) |
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
| 2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
(4) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
| 2016 |
Jan
(4) |
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| 2018 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
(1) |
Dec
|
| 2019 |
Jan
(2) |
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
|
Sep
(1) |
Oct
(1) |
Nov
|
Dec
|
|
From: Sverre B. <sve...@nb...> - 2006-01-19 11:57:03
|
Good for me ;-) The behaviour you reported is actually different than the "Fragment identifier" problem i reported as a bug for v.0.4.1. The problem with urls's breaking up at '&' is fixed in 0.4.1 (http://sourceforge.net/tracker/index.php?func=detail&aid=1354276&group_id=118427&atid=681137). I hope upgrading to WERA 0.4.1 solves your problems. Regards Sverre On Thu, 2006-01-19 at 12:18 +0100, Remy Cristini wrote: > Hi, > > I installed NutchWAX 0.4.2 and WERA > 0.4.0. Oh, but now i see that i'm > already behind, i'll get the 0.4.1 > tomorrow when i'm back in the office. > But i guess you reported the problem > using the new release, right? By the > way, the NutchWAX resolver and the > ARCretriever are running on Apache > Tomcat 5.5.12, and for the WERA > front-end i now use XAMPP 1.5.1 with PHP > 4.4.1. And all that on an external disk > attached to a PC running Windows XP > Pro... But otherwise it works fine! Only > have to address one more issue, that is > to enable searches in Chinese, but i > think i already found the instructions i > need (change a <connector> tag in a > Tomcat config file). > > Best, > Remy |
|
From: Remy C. <r.v...@le...> - 2006-01-19 11:19:13
|
Hi, I installed NutchWAX 0.4.2 and WERA 0.4.0. Oh, but now i see that i'm already behind, i'll get the 0.4.1 tomorrow when i'm back in the office. But i guess you reported the problem using the new release, right? By the way, the NutchWAX resolver and the ARCretriever are running on Apache Tomcat 5.5.12, and for the WERA front-end i now use XAMPP 1.5.1 with PHP 4.4.1. And all that on an external disk attached to a PC running Windows XP Pro... But otherwise it works fine! Only have to address one more issue, that is to enable searches in Chinese, but i think i already found the instructions i need (change a <connector> tag in a Tomcat config file). Best, Remy ----- Oorspronkelijk bericht ----- Van: "Sverre Bang" <sve...@nb...> Aan: <dac...@us...> CC: "archive-access-discuss" <arc...@li... e.net> Verzonden: donderdag 19 januari 2006 10:02 Onderwerp: Re: [ archive-access-Bugs-1409045 ] [wera] Fragment identifiers inURI's not handled correctly Hi, What version of WERA+NutchWAX are your using? Sverre |
|
From: Sverre B. <sve...@nb...> - 2006-01-19 09:02:15
|
Hi, What version of WERA+NutchWAX are your using? Sverre On Wed, 2006-01-18 at 15:56 -0800, SourceForge.net wrote: > Bugs item #1409045, was opened at 2006-01-18 14:55 > Message generated for change (Comment added) made by dachs_remy > You can respond by visiting: > https://sourceforge.net/tracker/?func=detail&atid=681137&aid=1409045&group_id=118427 > > Please note that this message will contain a full copy of the comment thread, > including the initial issue submission, for this request, > not just the latest update. > Category: wera > Group: None > Status: Open > Resolution: None > Priority: 5 > Submitted By: Sverre Bang (sverreb) > Assigned to: Sverre Bang (sverreb) > Summary: [wera] Fragment identifiers in URI's not handled correctly > > Initial Comment: > Some URIs refer to a location within a resource. This > kind of URI ends with "#" followed by an anchor > identifier (fragment identifier). Wera does not handle > this correctly. See > http://nwa.nb.no/wera/result.php?time=&url=http%3A%2F%2Fwww.lib.helsinki.fi%2Fsibelius%2F > for an example (try one of the links in the displayed page) > > > ---------------------------------------------------------------------- > > Comment By: Remy Cristini (dachs_remy) > Date: 2006-01-19 00:56 > > Message: > Logged In: YES > user_id=1430678 > > I just got WERA running today (great!!) and encountered > (what seems to be) the same problem: when browsing through > different pages within an archived website, i get "Sorry, > no documents with the given uri were found" when the URI > ends with an identifier containing "&". It seems that the > exacturl request from WERA to NutchWAX loses the last part > of the uri when browsing (not when searching for the exact > uri!). > > Example: > > I click on a link and the URL in the address bar shows: > > http://localhost/wera/result.php? > time=20060116091645&mode=standalone&url=http://www.someserve > r.org/show.aspx?id=126&cid=5 > > However, the uri in the search field on the "sorry..." page > just shows: > > http://www.someserver.org/show.aspx?id=126 > > Of course when i add the &cid=5 part at the end of the > search string, the requested page shows without a problem. > Now the address bar shows: > > http://localhost/wera/result.php?url=http%3A%2F% > 2Fwww.someserver.org%2Fshow.aspx%3Fid%3D126%26cid% > 3D5&level=6&time=20060116091645 > > So there seems to be a difference in how a uri is requested > from the NutchWAX index, depending on whether it's typed > directly into the search field or when you browse from one > page to the other: it will then lose the last identifier. > > ---------------------------------------------------------------------- > > You can respond by visiting: > https://sourceforge.net/tracker/?func=detail&atid=681137&aid=1409045&group_id=118427 |
|
From: Sverre B. <sve...@nb...> - 2006-01-18 11:51:03
|
This note is to announce a new release 0.4.1 of WERA (WEb aRchive Access), the web archive collection search and navigation tool. Release 0.4.1 includes improved url encoding, added metadata view in Timeline, and Google-like result presentation. See the Release Notes for more detail on changes (and current known limitations) at http://archive-access.sourceforge.net/projects/wera/articles/releasenotes.html The WERA home page is at: http://archive-access.sourceforge.net/projects/wera/ Demo of WERA 0.4.1 (using NutchWAX 0.4.2): http://nwa.nb.no/wera/ Yours, Sverre Bang |
|
From: Adam B. <ad...@ya...> - 2005-12-19 17:51:21
|
Hi There, Being a member of mailing lists, I know this is really bad form, and sorry to everyone, but I am looking for Charlie Foetz, and saw from googling that he posted on 26/10 in the link below: http://sourceforge.net/mailarchive/forum.php?thread_id=8812320&forum_id=45842 Charlie, it is Adam from Reuters - can you please email me at adamwrx at yahoo.com or at Reuters with your updated email details. Apologies once again to everyone else on this mailing list. Thanks, Adam __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
|
From: Sverre B. <sve...@nb...> - 2005-12-14 10:04:15
|
Hi, Good that you got the jpg's displaying. Not so good the &'s. I don't have that problem myself (!?). I hope you can assist in flushing out whats wrong. Some questions related to the &-urls: 1. Does the result list display those urls correctly. I.e. are the number correct (e.g. 1/1 and not 0/1) 2. Does the URL display properly in the url field of the timeline view? 3. When in the timeline view could you take a look at the source and check if the url is correctly formatted there (the one that starts with: frame src="http://your_wera_host/wera/documentDispatcher.php?url= .....) 4. What NutchWax version are you using? And are you sure you are using the nutchwax webapp from same version in tomcat? 5. What Tomcat and jvm (java -version) version are you using? 5. Could you create a new file, e.g. phpinfo.php in the wera directory with the contents: *<?php* phpinfo(); *?> *point a browser to the corresponfing url, save the output to file and send that file to me. If you could just quickly report on these questions, i'm sure i can come up with some more questions ;-) Sverre Kaisa Kaunonen wrote: > Hi, > with the new package my archive pages look better than > before. All jpg:s having % character in urls are now visible. > Thanks! > > I still some some problems, though. Trying to open a link > http://www.helsinki2005.fi/index.php?Name=newsitem&item454 > results in error 'Sorry, no documents with the given url, > http://www.helsinki2005.fi/index.php?Name=newsitem > were found' > > In a similar way all other links are cut after the first > & (ampersand) character in url > > eg. > http://www.event-travel.fi/?openmenu=16&docID=71&sitelang=en > becomes > http://www.event-travel.fi/?openmenu=16 > > Kaisa > > On Thu, 8 Dec 2005, Sverre Bang wrote: > > >> Sorry about that. Here is a manual pack, not including the retriever war file (use the one you got). >> >> You have to update lib/config.inc. >> >> Sverre >> >> >> >> >> -----Original Message----- >> From: Kaisa Kaunonen [mailto:kau...@cs...] >> Sent: to 08.12.2005 13:42 >> To: Sverre Bang >> Cc: st...@ar... >> Subject: Re: [Archive-access-discuss] Wera 0.4.0 and &%s in urls >> >> >> Hi, >> the java installer has never worked here. Even the new version >> below doesn't work, so I'd need a package to install manually. >> (I have the strange PC, WinaXe and Unix combination discussed >> before) >> >> But anyway it's fine a new version is available >> >> Kaisa >> >> On Thu, 8 Dec 2005, Sverre Bang wrote: >> >> >>> Hi Kaisa, >>> I've rewritten the url encoding stuff in Wera (and a few other things as >>> well ;-) . Could you please try it out? >>> >>> http://nwa.nb.no/tmp/wera-200512081017-installer.jar >>> >>> I've tested on nutchWax 0.4.2 >>> >>> Sverre >>> >>> >> >> |
|
From:
<mat...@ce...> - 2005-12-13 16:34:00
|
Hi, we still have a few problems with output from nutchwax... I'd like to ask you for little help. XML output from nutchwax servlet http://war.mzk.cz:8080/nutchwax/opensearch?query=gradu%C3%A1l+louck%C3%BD&hitsPerDup=1&hitsPerPage=10 consists of html entities like Graduál is that right? Nutchwax should retrieve xml with html entities?(instead of characters in utf-8? like gradu%C3%A1l ) what's the difference between these cases? 1) http://war.mzk.cz:8080/nutchwax/opensearch?query=gradu%C3%A1l+louck%C3%BD&start=0&hitsPerDup=0&hitsPerPage=10&dedupField=exacturl ->output is not valid xml(called from WERA) 2) http://war.mzk.cz:8080/nutchwax/opensearch?query=gradu%C3%A1l%20louck%C3%BD&start=0&hitsPerPage=10&hitsPerDup=1&dedupField=exacturl output is valid xml(called from Nutchwax search.jsp) Another confusing issue for me:) characters in entity "title" are well-displayed, but text in entity "description" consist of html entities(as i described above) thanks for any help lukas |
|
From: stack <st...@ar...> - 2005-11-29 01:37:09
|
Following up on a note posted earlier (reproduced below), we've just uploaded release 0.4.2 of nutchwax -- the last release based on Nutch-0.7. This new release has a few minor fixes mostly build cleanup including making the binary work with Java 1.4.x and fixes to make all links relative rather than absolute so the webapp can sit behind an apache proxy pass-through: E.g. See http://websearch.archive.org/ (The Katrina collection search is also reachable off the archive's home page as our first publically-accessible full-text search. Look for Katrina in the top-left announcements). The default search.jsp now also has Google-like paging through search results thanks to Dan Avery. Finally, also added some small functionality needed by WERA. From here on out, nutchwax will be dependent on the mapreduce-based Nutch. As has been noted below, the move to mapreduce is a pretty radical break with how things have been done in the past but intent is to provide tools and documentation to smooth the transition. Will keep the list updated with progress. Yours, St.Ack Subject: [Archive-access-discuss] Nutchwax future: 0.6.0 to be mapreduce-based. From: Michael Stack <st...@du...> Date: Thu, 17 Nov 2005 18:00:16 -0800 To: arc...@li... This note advocates that the next major nutchwax release be based on a mapreduce version of nutch and use the nutch distributed filesystem, NDFS (See references at end of the following paper if you want to learn more about NDFS, mapreduce and mapreduce in nutch: http://archive-access.sourceforge.net/projects/nutch/iwaw/iwaw-wacsearch.pdf). Here at the Archive, we've been bumping up against the limitations of the nutch-0.7.x release pre-mapreduce model trying to index collections >100million documents, so we've started moving nutchwax to work atop NDFS and the coming mapreduce version of nutch (See the 'mapred' branch in nutchwax and the 'mapred' branch in nutch). Doug Cutting has been doing the bulk of this development work and the running of initial indexings. So far, things look good. An index made of a collection of 200million URLs ran to completion atop a rack of 30 odd machines. While the indexing speed is currently not that impressive -- the rack is comprised of lowly processors and there remains much room for optimization -- the amount of oversight required to complete the task was close to zero (This is impressive). We also want to move onto mapreduce and NDFS because we see it getting us over incremental indexing woes we've run into using current nutchwax: segments and the webdb have outgrown large disks and various steps in the process take interminably long to complete. A mapreduce based nutchwax internally operates in a manner very different from how current nutchwax works; documentation, plugins, wrapper, tools, and packaging scripts will all have to be adapted/developed to go against the new mapreduce model (More on how the two models differ in correspondence to follow). While we could try carrying-forward the two branches of nutchwax, keeping the two branches in sync -- especially when the underlying model differs so -- would take a lot of work. Rather, lets just switch completely to be mapreduce (and NDFS) going forward. Advantages are outlined at above. Main disadvantage is that mapreduce-based nutch is still in early-stages of development. I was thinking I'd make one more point release of nutchwax in the next week or so based on Nutch-0.7 to fix a couple of critical issues and to add functionality WERA will need for its next release -- see list below -- but that thereafter, the next major release would be a mapreduce/NDFS-based nutchwax, a version 0.6.0 (Perhaps coordinated with a mapreduce-based nutch release, likely 0.8.0). I'd switch HEAD to be mapreduce immediately after the point release but I'd imagine it'd be the new year before sufficent work was done to release nutchwax 0.6.0. Comments/thoughts on above appreciated or if you think there is a critical nutchwax bug or feature missing from the list below. Yours, St.Ack Bugs: 1354276 [nutchwax] Still have URL encoding issues <http://sourceforge.net/tracker/index.php?func=detail&aid=1354276&group_id=118427&atid=681137> 2005-11-11 10:27 8 Open stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin <http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin <http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> 1312200 [nutchwax+wera] Pages at end of redirects not found. <http://sourceforge.net/tracker/index.php?func=detail&aid=1312200&group_id=118427&atid=681137> ** 2005-10-03 12:06 * 8 Open stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin <http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin <http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> Features: 1247519 [nutchwax] next/previous in search.jsp needs improvement <http://sourceforge.net/tracker/index.php?func=detail&aid=1247519&group_id=118427&atid=681140> ** 2005-07-29 08:33 * 8 Open nobody stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin <http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> |
|
From: stack <st...@ar...> - 2005-11-21 19:31:51
|
kau...@cs... wrote: >Hi, >we are planning here a new hardware configuration for February 2006. >It will have big disks, but such a directory tree that maximum size for >a leaf directory is 2 TB. > > >I suppose that this will be no problem for future Nutchwax, because it >can read arcs from a directory structure instead of a single large >directory and keep the index in pieces not larger than 2 TB ? > > > As you say, the 2TB limit should not be a problem. Using current nutchwax/nutch, only part that could overwhelm -- because all else on disk can be partitioned -- is the webdb. I'm guessing you'd have to have a collection of >300million documents to bump up against the 2TB limit. When we move nutchwax to the mapreduce/ndfs-based nutchwax, all indexing data and the link db are kept in the distributed file system. If the NDFS footprint on a particular disk threatens the 2TB limit, add more machines (or new directories) to NDFS. For performance, it will be better to keep the merged index outside of NDFS on the native filesystem. Long before this approaches 2TB maximum size, it'll have made sense to distribute the querying across multiple instances of the query webapp. St.Ack >I've been reading about 'Full indexing' from Nutchwax operation >instructions. But I haven't tried indexing *big* archives in practice. > >On 11/18/2005, "Michael Stack" <st...@du...> wrote: > > >>This note advocates that the next major nutchwax release be based on a >>mapreduce version of nutch and use the nutch distributed filesystem, >>NDFS (See references at end of the following paper if you want to learn >>more about NDFS, mapreduce and mapreduce in nutch: >>http://archive-access.sourceforge.net/projects/nutch/iwaw/iwaw-wacsearch.pdf). >> >>Here at the Archive, we've been bumping up against the limitations of >>the nutch-0.7.x release pre-mapreduce model trying to index collections >> >100million documents, so we've started moving nutchwax to work atop >>NDFS and the coming mapreduce version of nutch (See the 'mapred' branch >>in nutchwax and the 'mapred' branch in nutch). Doug Cutting has been >>doing the bulk of this development work and the running of initial >>indexings. So far, things look good. An index made of a collection of >>200million URLs ran to completion atop a rack of 30 odd machines. While >>the indexing speed is currently not that impressive -- the rack is >>comprised of lowly processors and there remains much room for >>optimization -- the amount of oversight required to complete the task >>was close to zero (This is impressive). >> >>We also want to move onto mapreduce and NDFS because we see it getting >>us over incremental indexing woes we've run into using current nutchwax: >>segments and the webdb have outgrown large disks and various steps in >>the process take interminably long to complete. >> >>A mapreduce based nutchwax internally operates in a manner very >>different from how current nutchwax works; documentation, plugins, >>wrapper, tools, and packaging scripts will all have to be >>adapted/developed to go against the new mapreduce model (More on how the >>two models differ in correspondence to follow). While we could try >>carrying-forward the two branches of nutchwax, keeping the two branches >>in sync -- especially when the underlying model differs so -- would take >>a lot of work. Rather, lets just switch completely to be mapreduce (and >>NDFS) going forward. >> >>Advantages are outlined at above. Main disadvantage is that >>mapreduce-based nutch is still in early-stages of development. >> >>I was thinking I'd make one more point release of nutchwax in the next >>week or so based on Nutch-0.7 to fix a couple of critical issues and to >>add functionality WERA will need for its next release -- see list below >>-- but that thereafter, the next major release would be a >>mapreduce/NDFS-based nutchwax, a version 0.6.0 (Perhaps coordinated with >>a mapreduce-based nutch release, likely 0.8.0). I'd switch HEAD to be >>mapreduce immediately after the point release but I'd imagine it'd be >>the new year before sufficent work was done to release nutchwax 0.6.0. >> >>Comments/thoughts on above appreciated or if you think there is a >>critical nutchwax bug or feature missing from the list below. >> >>Yours, >>St.Ack >> >> >>Bugs: >> >>1354276 [nutchwax] Still have URL encoding issues >><http://sourceforge.net/tracker/index.php?func=detail&aid=1354276&group_id=118427&atid=681137> >> 2005-11-11 10:27 8 Open stack-sf >><http://sourceforge.net/users/stack-sf/>Project Admin >><http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> >> stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin >><http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> >> >>1312200 [nutchwax+wera] Pages at end of redirects not found. >><http://sourceforge.net/tracker/index.php?func=detail&aid=1312200&group_id=118427&atid=681137> >> ** 2005-10-03 12:06 * 8 Open stack-sf >><http://sourceforge.net/users/stack-sf/>Project Admin >><http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> >> stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin >><http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> >> >> >> >> >>Features: >> >>1247519 [nutchwax] next/previous in search.jsp needs improvement >><http://sourceforge.net/tracker/index.php?func=detail&aid=1247519&group_id=118427&atid=681140> >> ** 2005-07-29 08:33 * 8 Open nobody stack-sf >><http://sourceforge.net/users/stack-sf/>Project Admin >><http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> >> >> >> >> >> >> >> >>------------------------------------------------------- >>This SF.Net email is sponsored by the JBoss Inc. Get Certified Today >>Register for a JBoss Training Course. Free Certification Exam >>for All Training Attendees Through End of 2005. For more info visit: >>http://ads.osdn.com/?ad_id=7628&alloc_id=16845&op=click >>_______________________________________________ >>Archive-access-discuss mailing list >>Arc...@li... >>https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >> >> > > >------------------------------------------------------- >This SF.Net email is sponsored by the JBoss Inc. Get Certified Today >Register for a JBoss Training Course. Free Certification Exam >for All Training Attendees Through End of 2005. For more info visit: >http://ads.osdn.com/?ad_idv28&alloc_id845&op=click >_______________________________________________ >Archive-access-discuss mailing list >Arc...@li... >https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > |
|
From: stack <st...@ar...> - 2005-11-21 19:06:56
|
I just tried it with the nutchwax 0.4.1 binary and seems to parse pdfs fine. The exception below is a complaint about a missing method and a missing class. Do you build nutchwax yourself Lukáą? If so, did you match the nutch and nutchwax versions? (Nutchwax works with Nutch 0.7). Maybe you've misnamed the extension point class? Compare against my parse-ext plugin.xml below. Do you have your own wrapper script for the nutch process? Maybe you've left off mention of the plugin dirs? (The parse-ext is present in your plugin dir?). St.Ack <?xml version="1.0" encoding="UTF-8"?> <plugin id="parse-ext" name="External Parser Plug-in" version="1.0.0" provider-name="nutch.org"> <extension-point id="org.apache.nutch.parse.Parser" name="Nutch Content Parser"/> <runtime> <library name="parse-ext.jar"> <export name="*"/> </library> </runtime> <extension id="org.apache.nutch.parse.ext" name="ExtParse" point="org.apache.nutch.parse.Parser"> <implementation id="ExtParser" class="org.apache.nutch.parse.ext.ExtParser" contentType="application/pdf" pathSuffix="pdf" command="/home/stack/workspace/archive-access/projects/nutch/target/distributions/nutchwax-0.4.1/bin/parse-pdf.sh" timeout="30"/> </extension> </plugin> Lukáš Matějka wrote: >Hi, > >i changed path in parser-ext to my local pdf-parser > >have any idea? > >adding 103214 bytes of mimetype application/pdf http://www.inforum.cz/infomedia98/pdf/telenor.pdf >051118 163001 SEVERE Error processing /home/nwa/nutchwax/data//queue/NEDLIB--20051116141001-00000.arc.gz >java.lang.NoSuchMethodError: org.apache.nutch.plugin.ExtensionPoint.getExtensions()[Lorg/apache/nutch/plugin/Extensions; > at org.apache.nutch.parse.ext.ExtParser.<clinit>(ExtParser.java:60) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) > at java.lang.reflect.Constructor.newInstance(Unknown Source) > at java.lang.Class.newInstance0(Unknown Source) > at java.lang.Class.newInstance(Unknown Source) > at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:144) > at org.apache.nutch.parse.ParserFactory.getParser(ParserFactory.java:63) > at org.archive.access.nutch.Arc2Segment.addRecord(Arc2Segment.java:243) > at org.archive.access.nutch.Arc2Segment.addArc(Arc2Segment.java:146) > at org.archive.access.nutch.Arc2Segment.main(Arc2Segment.java:326) > > > >051118 163001 adding 80205 bytes of mimetype application/pdf http://full.nkp.cz/nkkr/pdf/0101/nk0101051.pdf >051118 163001 SEVERE Error processing /home/nwa/nutchwax/data//queue/NEDLIB--20051116141314-00000.arc.gz >java.lang.NoClassDefFoundError > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) > at java.lang.reflect.Constructor.newInstance(Unknown Source) > at java.lang.Class.newInstance0(Unknown Source) > at java.lang.Class.newInstance(Unknown Source) > at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:144) > at org.apache.nutch.parse.ParserFactory.getParser(ParserFactory.java:63) > at org.archive.access.nutch.Arc2Segment.addRecord(Arc2Segment.java:243) > at org.archive.access.nutch.Arc2Segment.addArc(Arc2Segment.java:146) > at org.archive.access.nutch.Arc2Segment.main(Arc2Segment.java:326) > >thanks for comments > >lukas > > > >------------------------------------------------------- >This SF.Net email is sponsored by the JBoss Inc. Get Certified Today >Register for a JBoss Training Course. Free Certification Exam >for All Training Attendees Through End of 2005. For more info visit: >http://ads.osdn.com/?ad_id=7628&alloc_id=16845&op=click >_______________________________________________ >Archive-access-discuss mailing list >Arc...@li... >https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > |
|
From:
<mat...@ce...> - 2005-11-21 11:35:36
|
Hi, i changed path in parser-ext to my local pdf-parser have any idea? adding 103214 bytes of mimetype application/pdf http://www.inforum.cz/infomedia98/pdf/telenor.pdf 051118 163001 SEVERE Error processing /home/nwa/nutchwax/data//queue/NEDLIB--20051116141001-00000.arc.gz java.lang.NoSuchMethodError: org.apache.nutch.plugin.ExtensionPoint.getExtensions()[Lorg/apache/nutch/plugin/Extensions; at org.apache.nutch.parse.ext.ExtParser.<clinit>(ExtParser.java:60) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) at java.lang.Class.newInstance0(Unknown Source) at java.lang.Class.newInstance(Unknown Source) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:144) at org.apache.nutch.parse.ParserFactory.getParser(ParserFactory.java:63) at org.archive.access.nutch.Arc2Segment.addRecord(Arc2Segment.java:243) at org.archive.access.nutch.Arc2Segment.addArc(Arc2Segment.java:146) at org.archive.access.nutch.Arc2Segment.main(Arc2Segment.java:326) 051118 163001 adding 80205 bytes of mimetype application/pdf http://full.nkp.cz/nkkr/pdf/0101/nk0101051.pdf 051118 163001 SEVERE Error processing /home/nwa/nutchwax/data//queue/NEDLIB--20051116141314-00000.arc.gz java.lang.NoClassDefFoundError at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) at java.lang.Class.newInstance0(Unknown Source) at java.lang.Class.newInstance(Unknown Source) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:144) at org.apache.nutch.parse.ParserFactory.getParser(ParserFactory.java:63) at org.archive.access.nutch.Arc2Segment.addRecord(Arc2Segment.java:243) at org.archive.access.nutch.Arc2Segment.addArc(Arc2Segment.java:146) at org.archive.access.nutch.Arc2Segment.main(Arc2Segment.java:326) thanks for comments lukas |
|
From: <kau...@cs...> - 2005-11-21 08:43:37
|
Hi, we are planning here a new hardware configuration for February 2006. It will have big disks, but such a directory tree that maximum size for a leaf directory is 2 TB. I suppose that this will be no problem for future Nutchwax, because it can read arcs from a directory structure instead of a single large directory and keep the index in pieces not larger than 2 TB ? I've been reading about 'Full indexing' from Nutchwax operation instructions. But I haven't tried indexing *big* archives in practice. On 11/18/2005, "Michael Stack" <st...@du...> wrote: > This note advocates that the next major nutchwax release be based on a > mapreduce version of nutch and use the nutch distributed filesystem, > NDFS (See references at end of the following paper if you want to learn > more about NDFS, mapreduce and mapreduce in nutch: > http://archive-access.sourceforge.net/projects/nutch/iwaw/iwaw-wacsearch.pd= f). >=20 > Here at the Archive, we've been bumping up against the limitations of > the nutch-0.7.x release pre-mapreduce model trying to index collections > >100million documents, so we've started moving nutchwax to work atop > NDFS and the coming mapreduce version of nutch (See the 'mapred' branch > in nutchwax and the 'mapred' branch in nutch). Doug Cutting has been > doing the bulk of this development work and the running of initial > indexings. So far, things look good. An index made of a collection of > 200million URLs ran to completion atop a rack of 30 odd machines. While > the indexing speed is currently not that impressive -- the rack is > comprised of lowly processors and there remains much room for > optimization -- the amount of oversight required to complete the task > was close to zero (This is impressive). >=20 > We also want to move onto mapreduce and NDFS because we see it getting > us over incremental indexing woes we've run into using current nutchwax: > segments and the webdb have outgrown large disks and various steps in > the process take interminably long to complete. >=20 > A mapreduce based nutchwax internally operates in a manner very > different from how current nutchwax works; documentation, plugins, > wrapper, tools, and packaging scripts will all have to be > adapted/developed to go against the new mapreduce model (More on how the > two models differ in correspondence to follow). While we could try > carrying-forward the two branches of nutchwax, keeping the two branches > in sync -- especially when the underlying model differs so -- would take > a lot of work. Rather, lets just switch completely to be mapreduce (and > NDFS) going forward. >=20 > Advantages are outlined at above. Main disadvantage is that > mapreduce-based nutch is still in early-stages of development. >=20 > I was thinking I'd make one more point release of nutchwax in the next > week or so based on Nutch-0.7 to fix a couple of critical issues and to > add functionality WERA will need for its next release -- see list below > -- but that thereafter, the next major release would be a > mapreduce/NDFS-based nutchwax, a version 0.6.0 (Perhaps coordinated with > a mapreduce-based nutch release, likely 0.8.0). I'd switch HEAD to be > mapreduce immediately after the point release but I'd imagine it'd be > the new year before sufficent work was done to release nutchwax 0.6.0. >=20 > Comments/thoughts on above appreciated or if you think there is a > critical nutchwax bug or feature missing from the list below. >=20 > Yours, > St.Ack >=20 >=20 > Bugs: >=20 > 1354276 =09[nutchwax] Still have URL encoding issues > <http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1354276&group= _id=3D118427&atid=3D681137> > =09 2005-11-11 10:27 =098 =09Open =09stack-sf > <http://sourceforge.net/users/stack-sf/>Project Admin > <http://sourceforge.net/help/icon_legend.php?context=3Dgroup_admin&uname=3D= stack-sf&this_group=3D118427&return_to=3D%2F> > =09stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin > <http://sourceforge.net/help/icon_legend.php?context=3Dgroup_admin&uname=3D= stack-sf&this_group=3D118427&return_to=3D%2F> >=20 > 1312200 =09[nutchwax+wera] Pages at end of redirects not found. > <http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1312200&group= _id=3D118427&atid=3D681137> > =09** 2005-10-03 12:06 * =098 =09Open =09stack-sf > <http://sourceforge.net/users/stack-sf/>Project Admin > <http://sourceforge.net/help/icon_legend.php?context=3Dgroup_admin&uname=3D= stack-sf&this_group=3D118427&return_to=3D%2F> > =09stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin > <http://sourceforge.net/help/icon_legend.php?context=3Dgroup_admin&uname=3D= stack-sf&this_group=3D118427&return_to=3D%2F> >=20 >=20 >=20 >=20 > Features: >=20 > 1247519 =09[nutchwax] next/previous in search.jsp needs improvement > <http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1247519&group= _id=3D118427&atid=3D681140> > =09** 2005-07-29 08:33 * =098 =09Open =09nobody =09stack-sf > <http://sourceforge.net/users/stack-sf/>Project Admin > <http://sourceforge.net/help/icon_legend.php?context=3Dgroup_admin&uname=3D= stack-sf&this_group=3D118427&return_to=3D%2F> >=20 >=20 >=20 >=20 >=20 >=20 >=20 > ------------------------------------------------------- > This SF.Net email is sponsored by the JBoss Inc. Get Certified Today > Register for a JBoss Training Course. Free Certification Exam > for All Training Attendees Through End of 2005. For more info visit: > http://ads.osdn.com/?ad_id=3D7628&alloc_id=3D16845&op=3Dclick > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From: Michael S. <st...@du...> - 2005-11-18 02:00:40
|
This note advocates that the next major nutchwax release be based on a mapreduce version of nutch and use the nutch distributed filesystem, NDFS (See references at end of the following paper if you want to learn more about NDFS, mapreduce and mapreduce in nutch: http://archive-access.sourceforge.net/projects/nutch/iwaw/iwaw-wacsearch.pdf). Here at the Archive, we've been bumping up against the limitations of the nutch-0.7.x release pre-mapreduce model trying to index collections >100million documents, so we've started moving nutchwax to work atop NDFS and the coming mapreduce version of nutch (See the 'mapred' branch in nutchwax and the 'mapred' branch in nutch). Doug Cutting has been doing the bulk of this development work and the running of initial indexings. So far, things look good. An index made of a collection of 200million URLs ran to completion atop a rack of 30 odd machines. While the indexing speed is currently not that impressive -- the rack is comprised of lowly processors and there remains much room for optimization -- the amount of oversight required to complete the task was close to zero (This is impressive). We also want to move onto mapreduce and NDFS because we see it getting us over incremental indexing woes we've run into using current nutchwax: segments and the webdb have outgrown large disks and various steps in the process take interminably long to complete. A mapreduce based nutchwax internally operates in a manner very different from how current nutchwax works; documentation, plugins, wrapper, tools, and packaging scripts will all have to be adapted/developed to go against the new mapreduce model (More on how the two models differ in correspondence to follow). While we could try carrying-forward the two branches of nutchwax, keeping the two branches in sync -- especially when the underlying model differs so -- would take a lot of work. Rather, lets just switch completely to be mapreduce (and NDFS) going forward. Advantages are outlined at above. Main disadvantage is that mapreduce-based nutch is still in early-stages of development. I was thinking I'd make one more point release of nutchwax in the next week or so based on Nutch-0.7 to fix a couple of critical issues and to add functionality WERA will need for its next release -- see list below -- but that thereafter, the next major release would be a mapreduce/NDFS-based nutchwax, a version 0.6.0 (Perhaps coordinated with a mapreduce-based nutch release, likely 0.8.0). I'd switch HEAD to be mapreduce immediately after the point release but I'd imagine it'd be the new year before sufficent work was done to release nutchwax 0.6.0. Comments/thoughts on above appreciated or if you think there is a critical nutchwax bug or feature missing from the list below. Yours, St.Ack Bugs: 1354276 [nutchwax] Still have URL encoding issues <http://sourceforge.net/tracker/index.php?func=detail&aid=1354276&group_id=118427&atid=681137> 2005-11-11 10:27 8 Open stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin <http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin <http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> 1312200 [nutchwax+wera] Pages at end of redirects not found. <http://sourceforge.net/tracker/index.php?func=detail&aid=1312200&group_id=118427&atid=681137> ** 2005-10-03 12:06 * 8 Open stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin <http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin <http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> Features: 1247519 [nutchwax] next/previous in search.jsp needs improvement <http://sourceforge.net/tracker/index.php?func=detail&aid=1247519&group_id=118427&atid=681140> ** 2005-07-29 08:33 * 8 Open nobody stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin <http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> |
|
From: stack <st...@ar...> - 2005-11-11 18:28:18
|
kau...@cs... wrote: >On 11/11/2005, "stack" <st...@ar...> wrote: > > > >>The below should be fixed by upgrade to nutchwax 0.4.1, if you haven't >>already. >> >>St.Ack >> >> > > > Sounds like still work to do. Thanks for the detailed report Kaisa (I've pasted below into new encoding issue and will try and figure whats going on). St.Ack >Yes I noticed that errors in my archive were of the same type as reported >later by other people. After installing nutchwax 0.4.1 the archive looks >better now, thanks very much. > >I still have some isolated cases where a file is inside archive but wera >shows 'not found'. Here are some examples of problem urls > >Perhaps '&' right after '?' is too much >http://www.helsinki2005.fi/index.php?&Lang=eng >http://www.helsinki2005.fi/index.php?&Name=xteams >http://www.helsinki2005.fi/index.php?&Name=tickets > >http://www.noc.fi/mp/db/tiedotteet/foo/IMG?num=18157&FIELD=kuva0_kk&R=897830 >http://www.noc.fi/taustasivut/artikkeliarkisto/?num=17075&JKNUM=17075 >http://www.slu.fi/mp/db/tiedotteet/foo/IMG?num=29512&FIELD=kuva0_pieni&R=064184 > >Other urls with '&' work fine, but these with 'Name=something&' do >not. >http://www.helsinki2005.fi/index.php?Name=newsitem&item=322 >http://www.helsinki2005.fi/index.php?Name=newsitem&item=405 >http://www.helsinki2005.fi/index.php?Name=tickets&lang=eng > >When I make in wera a query >url:http://www.helsinki2005.fi/index.php?Name=tickets&lang=eng >it reports 102 hits, the first one being >http://www.helsinki2005.fi/index.php?Name=tickets_1 >but wera only wants to display hits 1-10 and 11-16. > >For some reason all images with a '%' character in url still refuse to >come out. This could apply to html file urls as well if there were any >in the archive. >http://www.helsinki2005.fi/files/pics/1079364264_mascot%20medium.gif > >I'm not sure which part of nutchwax&wera combination causes it. > >Kaisa > > >------------------------------------------------------- >SF.Net email is sponsored by: >Tame your development challenges with Apache's Geronimo App Server. Download >it for free - -and be entered to win a 42" plasma tv or your very own >Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php >_______________________________________________ >Archive-access-discuss mailing list >Arc...@li... >https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > |
|
From: <kau...@cs...> - 2005-11-11 12:39:10
|
That would be a useful option to have. On 11/10/2005, "stack" <st...@ar...> wrote: > Its almost as though we shouldn't be counting outline anchor text as > part of the current document, especially for the > (common-on-front-of-sites-like-newspapers) case you describe below. >=20 > Should we make an option for the parse-html that excludes outlink anchor > text? >=20 > St.Ack >=20 >=20 > kau...@cs... wrote: >=20 > >Here's the example again, in pseudo html, hopefully browsers don't warp > >this one. > > > ><pre> > >----- Page start ----- > >[h1]English cricket results 9.11.2005[/h1] > > > >........ lots of text ..... > > > >[h2]Other top stories now[/h2] > > > >[a href=3D"http://www.tekstia.com/news/top/earthquake+timbuktu/107994"] > >Earthquake in Timbuktu second time this year[/a] > > > >[a href=3D"http://www.tekstia.com/news/top/tokyo+stocks+explode/107996"] > >Tokyo stocks explode[/a] > > > >10 further links to different subjects .. > > > >----- Page end ----- > ></pre> > > |
|
From: <kau...@cs...> - 2005-11-11 12:35:07
|
On 11/11/2005, "stack" <st...@ar...> wrote: > The below should be fixed by upgrade to nutchwax 0.4.1, if you haven't > already. >=20 > St.Ack Yes I noticed that errors in my archive were of the same type as reported later by other people. After installing nutchwax 0.4.1 the archive looks better now, thanks very much. I still have some isolated cases where a file is inside archive but wera shows 'not found'. Here are some examples of problem urls Perhaps '&' right after '?' is too much http://www.helsinki2005.fi/index.php?&Lang=3Deng http://www.helsinki2005.fi/index.php?&Name=3Dxteams http://www.helsinki2005.fi/index.php?&Name=3Dtickets http://www.noc.fi/mp/db/tiedotteet/foo/IMG?num=3D18157&FIELD=3Dkuva0_kk&R=3D8= 97830 http://www.noc.fi/taustasivut/artikkeliarkisto/?num=3D17075&JKNUM=3D17075 http://www.slu.fi/mp/db/tiedotteet/foo/IMG?num=3D29512&FIELD=3Dkuva0_pieni&R= =3D064184 Other urls with '&' work fine, but these with 'Name=3Dsomething&' do not. http://www.helsinki2005.fi/index.php?Name=3Dnewsitem&item=3D322 http://www.helsinki2005.fi/index.php?Name=3Dnewsitem&item=3D405 http://www.helsinki2005.fi/index.php?Name=3Dtickets&lang=3Deng When I make in wera a query url:http://www.helsinki2005.fi/index.php?Name=3Dtickets&lang=3Deng it reports 102 hits, the first one being http://www.helsinki2005.fi/index.php?Name=3Dtickets_1 but wera only wants to display hits 1-10 and 11-16. For some reason all images with a '%' character in url still refuse to come out. This could apply to html file urls as well if there were any in the archive. http://www.helsinki2005.fi/files/pics/1079364264_mascot%20medium.gif I'm not sure which part of nutchwax&wera combination causes it. Kaisa |
|
From: <kau...@cs...> - 2005-11-09 12:43:22
|
On 11/7/2005, "stack" <st...@ar...> wrote: > >Many news sites have a structure which may distort ranking of search > >results. On these sites each page focuses on one current news item but > >it also has loads of tiny links to other news, like 'other top > >stories' , 'news of previous day' or 'stories of this week'. > Just to be clear, in the above, you mean that the outlink anchor text > says 'other top stories' and 'stories of this week'? Hi, I tried to describe pages and links like ------- Page start ------- <h1>English cricket results 9.11.2005</h1> ........ lots of text ..... <h2>Other top stories now</h2> <a href=3D"http://www.tekstia.com/news/top/earthquake+timbuktu/107994"> Earthquake in Timbuktu second time this year</a> <a href=3D"http://www.tekstia.com/news/top/tokyo+stocks+explode/107996"> Tokyo stocks explode</a> 10 further links to different subjects .. ---- Page end ------------- Above the links are <a href=3Durl>text</a> , and both text and url contain words which actually don't belong to the body text of the cricket news article. > >In an indexed archive, you could search for 'earth quake' and find > >close to top search results a page with headline 'cricket results'. > >Reason being that the cricket page has one link to earth quake news of > >previous day. > > > >This feature is noticeable when there are few pages with real earth quake > >reports but lots of other pages having links to them. > > > > > > > Yes. Makes sense. > > You can set the boost on inlink anchor text -- see the just-added FAQ -- > but looks like you want to be able to set separately the boost on a > documents' outlink anchor text. Looking at the Nutch html parser code, > currently the outlink anchor text just gets added to the StringBuffer > accumulating all document parsed 'text'; the outlink anchor text is not > distingushed in any way from the general text of the document. There is > currently no means of making its boost be different from that of the > general document 'text'. >=20 > Should we add such a feature Kaisa? |
|
From: stack <st...@ar...> - 2005-11-07 20:47:26
|
kau...@cs... wrote: >Hi and thanks for the addition to faq. > >Many news sites have a structure which may distort ranking of search >results. On these sites each page focuses on one current news item but >it also has loads of tiny links to other news, like 'other top >stories' , 'news of previous day' or 'stories of this week'. > > Just to be clear, in the above, you mean that the outlink anchor text says 'other top stories' and 'stories of this week'? >In an indexed archive, you could search for 'earth quake' and find >close to top search results a page with headline 'cricket results'. >Reason being that the cricket page has one link to earth quake news of >previous day. > >This feature is noticeable when there are few pages with real earth quake >reports but lots of other pages having links to them. > > > Yes. Makes sense. >Link texts should have a low priority in indexing .. probably I can make >this happen when I find the correct parameters. > > > You can set the boost on inlink anchor text -- see the just-added FAQ -- but looks like you want to be able to set separately the boost on a documents' outlink anchor text. Looking at the Nutch html parser code, currently the outlink anchor text just gets added to the StringBuffer accumulating all document parsed 'text'; the outlink anchor text is not distingushed in any way from the general text of the document. There is currently no means of making its boost be different from that of the general document 'text'. Should we add such a feature Kaisa? Yours, St.Ack >kaisa > >On 11/2/2005, "stack" <st...@ar...> wrote: > > >>It should be doing this for you Kaisa. In general, are you not seeing >>the most significant links showing first in results? >> >> > > >------------------------------------------------------- >SF.Net email is sponsored by: >Tame your development challenges with Apache's Geronimo App Server. Download >it for free - -and be entered to win a 42" plasma tv or your very own >Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php >_______________________________________________ >Archive-access-discuss mailing list >Arc...@li... >https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > |
|
From: <kau...@cs...> - 2005-11-05 11:49:47
|
Hi and thanks for the addition to faq. Many news sites have a structure which may distort ranking of search results. On these sites each page focuses on one current news item but it also has loads of tiny links to other news, like 'other top stories' , 'news of previous day' or 'stories of this week'. In an indexed archive, you could search for 'earth quake' and find close to top search results a page with headline 'cricket results'.=20 Reason being that the cricket page has one link to earth quake news of previous day. This feature is noticeable when there are few pages with real earth quake reports but lots of other pages having links to them. Link texts should have a low priority in indexing .. probably I can make this happen when I find the correct parameters. kaisa On 11/2/2005, "stack" <st...@ar...> wrote: > It should be doing this for you Kaisa. In general, are you not seeing > the most significant links showing first in results? |
|
From: <st...@ar...> - 2005-11-04 15:28:10
|
Thanks to all for helping isolate the problem. St.Ack |
|
From: stack <st...@ar...> - 2005-11-03 19:14:25
|
Kristinn Sigurdsson wrote: >Looks like you managed to fix the problem on your end. > >The issue of 0/0 versions is tied to the incorrect coding of characters in the XML (namely & being double escaped to &amp; rather then just &). This causes any URIs that contain & (or other special characters like < and >) to show up as 0/0 versions. > >Any idea what fixed the problem? > > Looks like old version of the nutchwax WAR file with an earlier version of OpenSearchServlet fixed the problem. Trying to get to the bottom of it. Will update the list when leanr more (Looking like need point release of nutchwax). St.Ack >- Kris > > > >>-----Original Message----- >>From: arc...@li... >>[mailto:arc...@li...] >>On Behalf Of Lukáš Matìjka >>Sent: 3. nóvember 2005 08:30 >>To: Sve...@nb... >>Cc: arc...@li... >>Subject: RE: [Archive-access-discuss] wera results >> >> >> >> >>______________________________________________________________ >> >> >>>Od: Sve...@nb... >>>Komu: mat...@ce... >>>CC: >>>Datum: 02.11.2005 19:41 >>>Předmět: RE: [Archive-access-discuss] wera results >>> >>>I tried the latest opensearch servlet myself. It messed up >>> >>> >>my Wera, lots >> >> >>>of 0/0 ... >>> >>>;-) >>> >>> >>now, i'm using what you send to me...and everything seems fine... >>i can't find any 0/0 :) >> >>i will test it more:) >> >>-lm >> >> >> >>>Sverre >>> >>> >>>-----Original Message----- >>>From: Lukás Matejka [mailto:mat...@ce...] >>>Sent: Wed 11/2/2005 4:43 PM >>>To: Sverre Bang >>>Cc: arc...@li... >>>Subject: RE: [Archive-access-discuss] wera results >>> >>> >>> >>>______________________________________________________________ >>> >>> >>>>Od: sve...@nb... >>>>Komu: arc...@li... >>>>CC: >>>>Datum: 02.11.2005 14:33 >>>>Predmet: RE: [Archive-access-discuss] wera results >>>> >>>>Hi there, >>>>Definitely something wrong in NutchWax. If i execute >>>> >>>> >>>> >>http://war.mzk.cz/~nwa/wera/wera/index.php?query=kniha&year_fr >>om=&year_to= >> >> >>>>and click the tmeline link of the first hit showing 0/0 hits i get >>>> >>>> >>>where did you find hit showing 0/0? >>>it works fine for me(i've just explored 150 urls..and no 0/0 hits ) >>>did you remeber number of total hits?(if it's same - i >>> >>> >>experimented with >> >> >>>previous version of nutchwax,starting tomcat on various instances) >>> >>>i had for word "kniha" >>>Total number of versions found : 49087. Displaying URL's 1-10 >>> >>>-lm >>> >>> >>> >>>>'Sorry, no documents with the given uri were found'. The >>>> >>>> >>url displyed >> >> >>>>seems fine, but if you look in the source of the >>>> >>>> >>uppermost frame you >> >> >>>>will see that the url sent to the script was >>>>http://full.nkp.cz/nkdb/rejstriky/rejstrik.asp?irj=12&start=V. >>>>The & separating the parameters irj and start has been >>>> >>>> >>replaced by its >> >> >>>>html character entity reference. >>>> >>>>If i press the go button now the url submitted to the >>>> >>>> >>script will be ok. >> >> >>>>If i look in the NutchWax result set of the initial >>>> >>>> >>search (add &debug=1 >> >> >>>>to the search url to bring out the NutchWax search urls) >>>> >>>> >>i see that the >> >> >>>>url (link element) returned is wrong already here. >>>> >>>>Conclusion : NutchWax mangles the url returned by introducing html >>>>entities instead of keeping the url in its original form. >>>> >>>>What version of NutchWax are you using? >>>> >>>>Sverre >>>> >>>>On Wed, 2005-11-02 at 12:41 +0000, Kristinn Sigurdsson wrote: >>>> >>>> >>>>>This looks like the same (or very similar) problem as >>>>> >>>>> >>I've got. I've >> >> >>>>been discussing it (offlist) with Stack and Sverre Bang, >>>> >>>> >>so I know it is >> >> >>>>being looked into. >>>> >>>> >>>>>I notice in your search results (as in mine) that URIs >>>>> >>>>> >>with & in them >> >> >>>>are showing up as 0/0 versions. I believe that both >>>> >>>> >>problems are due to >> >> >>>>the escaping (or unescaping) of HTML characters in the >>>> >>>> >>NutchWAX XML that >> >> >>>>is used to pass the results to WERA. >>>> >>>> >>>>>Possibly this is a misconfiguration of either Tomcat or >>>>> >>>>> >>Apache...? >> >> >>>>>- Kris >>>>> >>>>> >>>>> >>>>>>-----Original Message----- >>>>>>From: arc...@li... >>>>>>[mailto:arc...@li...] >>>>>>On Behalf Of LukAALA MatAZjka >>>>>>Sent: 2. nAlvember 2005 11:21 >>>>>>To: arc...@li... >>>>>>Subject: [Archive-access-discuss] wera results >>>>>> >>>>>> >>>>>>Hi, >>>>>> >>>>>>for example >>>>>>http://war.mzk.cz/~nwa/wera/wera/index.php?query=kniha&year_fr >>>>>> >>>>>> >>>>>om=&year_to= >>>>> >>>>>description of each record is not well-displayed >>>>> >>>>>1. SKIP, Moje kniha (http://skip.nkp.cz/akcMojekn.htm) >>>>>(<b> ... </b>prístupu k internetu v knihovnách >>>>> >>>>> >>>>propagovat vyuzití internetu pri >>>>zjistování názoru obyvatel 2. Anketa >>>>Pomocí krátké ankety bude zjistována >>>>nejoblíbenejsí <b>kniha</b> obyvatel >>>>Ceské republiky. Pojem nejoblíbenejsí >>>><b>kniha</b> je specifikován dalsími výklady, >>>>jako "<b>kniha</b>, která me nejvíce >>>>ovlivnila", "<b>kniha</b>, ke které se casto >>>>vracím", "<b>kniha</b>, kterou bych doporucil/a >>>>dobrým prátelum", "<b>kniha</b>, >>>>která zmenila muj zivot", "<b>kniha</b> na >>>>kterou nemohu zapomenout", "<b>kniha</b>, která mne uvedla >>>>do jiného sveta", "<b>kniha</b>, kterou bych si s >>>>sebou vzal/a jako jedinou<b> ... </b>) >>>> >>>> >>>>>Versions (matching query/total) 3/3 >>>>>Timeline | Overview >>>>> >>>>>"prístupu" should be "pLAstupu"(without diacritics >>>>> >>>>> >>>>"pristupu") >>>> >>>> >>>>>does anybody have same problem? >>>>> >>>>>-lm >>>>> >>>>> >>>>> >>>>>------------------------------------------------------- >>>>>SF.Net email is sponsored by: >>>>>Tame your development challenges with Apache's Geronimo >>>>> >>>>> >>App Server. >> >> >>>>Download >>>> >>>> >>>>>it for free - -and be entered to win a 42" plasma tv or >>>>> >>>>> >>your very own >> >> >>>>>Sony(tm)PSP. Click here to play: >>>>> >>>>> >>http://sourceforge.net/geronimo.php >> >> >>_______________________________________________ >> >> >>>>>Archive-access-discuss mailing list >>>>>Arc...@li... >>>>> >>>>> >>>>> >>https://lists.sourceforge.net/lists/listinfo/archive-access-di >> >> >scuss > > >>>> >>>>------------------------------------------------------- >>>>SF.Net email is sponsored by: >>>>Tame your development challenges with Apache's Geronimo App Server. >>>> >>>> >>>Download >>> >>> >>>>it for free - -and be entered to win a 42" plasma tv or your very own >>>>Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php >>>>_______________________________________________ >>>>Archive-access-discuss mailing list >>>>Arc...@li... >>>>https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >>>> >>>> >>>------------------------------------------------------- >>>SF.Net email is sponsored by: >>>Tame your development challenges with Apache's Geronimo App Server. >>>Download >>>it for free - -and be entered to win a 42" plasma tv or your very own >>>Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php >>>_______________________________________________ >>>Archive-access-discuss mailing list >>>Arc...@li... >>>https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >>> >>> >>> >> >> >> >> > > > >------------------------------------------------------- >SF.Net email is sponsored by: >Tame your development challenges with Apache's Geronimo App Server. Download >it for free - -and be entered to win a 42" plasma tv or your very own >Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php >_______________________________________________ >Archive-access-discuss mailing list >Arc...@li... >https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > >------------------------------------------------------- >SF.Net email is sponsored by: >Tame your development challenges with Apache's Geronimo App Server. Download >it for free - -and be entered to win a 42" plasma tv or your very own >Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php >_______________________________________________ >Archive-access-discuss mailing list >Arc...@li... >https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > |
|
From: Kristinn S. <kr...@ar...> - 2005-11-03 09:04:18
|
Looks like you managed to fix the problem on your end.=20 The issue of 0/0 versions is tied to the incorrect coding of characters = in the XML (namely & being double escaped to &amp; rather then just = &). This causes any URIs that contain & (or other special characters = like < and >) to show up as 0/0 versions. Any idea what fixed the problem? - Kris > -----Original Message----- > From: arc...@li...=20 > [mailto:arc...@li...]=20 > On Behalf Of Luk=C3=A1=C5=A1 Mat=C3=ACjka > Sent: 3. n=C3=B3vember 2005 08:30 > To: Sve...@nb... > Cc: arc...@li... > Subject: RE: [Archive-access-discuss] wera results >=20 >=20 >=20 >=20 > ______________________________________________________________ > > Od: Sve...@nb... > > Komu: mat...@ce... > > CC:=20 > > Datum: 02.11.2005 19:41 > > P=C5=99edm=C4=9Bt: RE: [Archive-access-discuss] wera results > > > > I tried the latest opensearch servlet myself. It messed up=20 > my Wera, lots > > of 0/0 ... > >=20 > > ;-) >=20 >=20 > now, i'm using what you send to me...and everything seems fine... > i can't find any 0/0 :) >=20 > i will test it more:) >=20 > -lm >=20 > >=20 > > Sverre > >=20 > >=20 > > -----Original Message----- > > From: Luk=C3=A1s Matejka [mailto:mat...@ce...] > > Sent: Wed 11/2/2005 4:43 PM > > To: Sverre Bang > > Cc: arc...@li... > > Subject: RE: [Archive-access-discuss] wera results > >=20 > >=20 > >=20 > > ______________________________________________________________ > > > Od: sve...@nb... > > > Komu: arc...@li... > > > CC:=20 > > > Datum: 02.11.2005 14:33 > > > Predmet: RE: [Archive-access-discuss] wera results > > > > > > Hi there, > > > Definitely something wrong in NutchWax. If i execute > > > > >=20 > http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_fr > om=3D&year_to=3D > > > and click the tmeline link of the first hit showing 0/0 hits i get > >=20 > > where did you find hit showing 0/0? > > it works fine for me(i've just explored 150 urls..and no 0/0 hits ) > > did you remeber number of total hits?(if it's same - i=20 > experimented with > > previous version of nutchwax,starting tomcat on various instances) > >=20 > > i had for word "kniha" > > Total number of versions found : 49087. Displaying URL's 1-10 > >=20 > > -lm > >=20 > > > 'Sorry, no documents with the given uri were found'. The=20 > url displyed > > > seems fine, but if you look in the source of the=20 > uppermost frame you > > > will see that the url sent to the script was > > > http://full.nkp.cz/nkdb/rejstriky/rejstrik.asp?irj=3D12&start=3DV. > > > The & separating the parameters irj and start has been=20 > replaced by its > > > html character entity reference.=20 > > >=20 > > > If i press the go button now the url submitted to the=20 > script will be ok. > > >=20 > > > If i look in the NutchWax result set of the initial=20 > search (add &debug=3D1 > > > to the search url to bring out the NutchWax search urls)=20 > i see that the > > > url (link element) returned is wrong already here. > > >=20 > > > Conclusion : NutchWax mangles the url returned by introducing html > > > entities instead of keeping the url in its original form. > > >=20 > > > What version of NutchWax are you using? > > >=20 > > > Sverre > > >=20 > > > On Wed, 2005-11-02 at 12:41 +0000, Kristinn Sigurdsson wrote: > > > > This looks like the same (or very similar) problem as=20 > I've got. I've > > > been discussing it (offlist) with Stack and Sverre Bang,=20 > so I know it is > > > being looked into. > > > >=20 > > > > I notice in your search results (as in mine) that URIs=20 > with & in them > > > are showing up as 0/0 versions. I believe that both=20 > problems are due to > > > the escaping (or unescaping) of HTML characters in the=20 > NutchWAX XML that > > > is used to pass the results to WERA. > > > >=20 > > > > Possibly this is a misconfiguration of either Tomcat or=20 > Apache...? > > > >=20 > > > > - Kris > > > >=20 > > > > > -----Original Message----- > > > > > From: arc...@li...=20 > > > > > [mailto:arc...@li...]=20 > > > > > On Behalf Of LukAALA MatAZjka > > > > > Sent: 2. nAlvember 2005 11:21 > > > > > To: arc...@li... > > > > > Subject: [Archive-access-discuss] wera results > > > > >=20 > > > > >=20 > > > > > Hi, > > > > >=20 > > > > > for example > > > > > = http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_fr > > > > om=3D&year_to=3D > > > >=20 > > > > description of each record is not well-displayed > > > >=20 > > > > 1. SKIP, Moje kniha (http://skip.nkp.cz/akcMojekn.htm) > > > > (<b> ... </b>pr=C3=ADstupu k internetu v knihovn=C3=A1ch > > > propagovat vyuzit=C3=AD internetu pri > > > zjistov=C3=A1n=C3=AD n=C3=A1zoru obyvatel 2. Anketa > > > Pomoc=C3=AD kr=C3=A1tk=C3=A9 ankety bude zjistov=C3=A1na > > > nejobl=C3=ADbenejs=C3=AD <b>kniha</b> obyvatel > > > Cesk=C3=A9 republiky. Pojem nejobl=C3=ADbenejs=C3=AD > > > <b>kniha</b> je specifikov=C3=A1n dals=C3=ADmi v=C3=BDklady, > > > jako "<b>kniha</b>, kter=C3=A1 me nejv=C3=ADce > > > ovlivnila", "<b>kniha</b>, ke kter=C3=A9 se casto > > > vrac=C3=ADm", "<b>kniha</b>, kterou bych doporucil/a > > > dobr=C3=BDm pr=C3=A1telum", "<b>kniha</b>, > > > kter=C3=A1 zmenila muj zivot", "<b>kniha</b> na > > > kterou nemohu zapomenout", "<b>kniha</b>, kter=C3=A1 mne uvedla > > > do jin=C3=A9ho sveta", "<b>kniha</b>, kterou bych si s > > > sebou vzal/a jako jedinou<b> ... </b>) > > > > Versions (matching query/total) 3/3 > > > > Timeline | Overview > > > >=20 > > > > "pr=C3=ADstupu" should be "pLA=C2=ADstupu"(without diacritics > > > "pristupu") > > > >=20 > > > > does anybody have same problem? > > > >=20 > > > > -lm > > > >=20 > > > >=20 > > > >=20 > > > > ------------------------------------------------------- > > > > SF.Net email is sponsored by: > > > > Tame your development challenges with Apache's Geronimo=20 > App Server. > > > Download > > > > it for free - -and be entered to win a 42" plasma tv or=20 > your very own > > > > Sony(tm)PSP. Click here to play:=20 > http://sourceforge.net/geronimo.php > > > >=20 > _______________________________________________ > > > > Archive-access-discuss mailing list > > > > Arc...@li... > > > >=20 > https://lists.sourceforge.net/lists/listinfo/archive-access-di scuss > > >=20 > > >=20 > > >=20 > > > ------------------------------------------------------- > > > SF.Net email is sponsored by: > > > Tame your development challenges with Apache's Geronimo App = Server. > > Download > > > it for free - -and be entered to win a 42" plasma tv or your very = own > > > Sony(tm)PSP. Click here to play: = http://sourceforge.net/geronimo.php > > > _______________________________________________ > > > Archive-access-discuss mailing list > > > Arc...@li... > > > = https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > >=20 > >=20 > > ------------------------------------------------------- > > SF.Net email is sponsored by: > > Tame your development challenges with Apache's Geronimo App Server. > > Download > > it for free - -and be entered to win a 42" plasma tv or your very = own > > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > >=20 >=20 >=20 >=20 > ------------------------------------------------------- SF.Net email is sponsored by: Tame your development challenges with Apache's Geronimo App Server. = Download it for free - -and be entered to win a 42" plasma tv or your very own Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php _______________________________________________ Archive-access-discuss mailing list Arc...@li... https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From:
<mat...@ce...> - 2005-11-03 08:54:07
|
______________________________________________________________ > Od: st...@ar... > Komu: mat...@ce... > CC: arc...@li... > Datum: 03.11.2005 02:09 > P=F8edm=ECt: Re: [Archive-access-discuss] wera results > > Luk=E1=9A Mat=ECjka wrote: >=20 > >Hi, > > > >for example > >http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_from=3D= &year_to=3D > > > >description of each record is not well-displayed > > > >1. SKIP, Moje kniha (http://skip.nkp.cz/akcMojekn.htm) > >(<b> ... </b>p=F8=EDstupu k internetu v knihovn=E1ch > propagovat vyu=9Eit=ED internetu p=F8i > zji=9A=9Dov=E1n=ED n=E1zor=F9 obyvatel 2. Anketa > Pomoc=ED kr=E1tk=E9 ankety bude zji=9A=9Dov=E1na > nejobl=EDben=ECj=9A=ED <b>kniha</b> obyvatel > =C8esk=E9 republiky. Pojem nejobl=EDben=ECj=9A=ED > <b>kniha</b> je specifikov=E1n dal=9A=EDmi v=FDklady, > jako "<b>kniha</b>, kter=E1 m=EC nejv=EDce > ovlivnila", "<b>kniha</b>, ke kter=E9 se =E8asto > vrac=EDm", "<b>kniha</b>, kterou bych doporu=E8il/a > dobr=FDm p=F8=E1tel=F9m", "<b>kniha</b>, > kter=E1 zm=ECnila m=F9j =9Eivot", "<b>kniha</b> na > kterou nemohu zapomenout", "<b>kniha</b>, kter=E1 mne uvedla > do jin=E9ho sv=ECta", "<b>kniha</b>, kterou bych si s > sebou vzal/a jako jedinou<b> ... </b>) > >Versions (matching query/total) 3/3 > >Timeline | Overview > > > >"p=F8=EDstupu" should be "p=F8=EDstupu"(without diacritics > "pristupu") > > > >does anybody have same problem? > > > > Did you change something Luk=E1=9A? When I browse to the link given a= bove, > the display looks correct: i.e. See "p=F8=EDstupu" in the below (Hope= fully > this mail makes it across preserving original characters). > St.Ack Yes, in previous discussion Stack sent me a file - old nutchwax.war He said that problem was in opesearchservlet.... now results seem to be ok. -lm > * > 1. SKIP, Moje kniha* (http://skip.nkp.cz/akcMojekn.htm) > (* ... *p=F8=EDstupu k internetu v knihovn=E1ch propagovat vyu=9Eit=ED= internetu p=F8i > zji=9A=9Dov=E1n=ED n=E1zor=F9 obyvatel 2. Anketa Pomoc=ED kr=E1tk=E9 = ankety bude zji=9A=9Dov=E1na > nejobl=EDben=ECj=9A=ED *kniha* obyvatel =C8esk=E9 republiky. Pojem ne= jobl=EDben=ECj=9A=ED > *kniha* je specifikov=E1n dal=9A=EDmi v=FDklady, jako "*kniha*, kter=E1= m=EC nejv=EDce > ovlivnila", "*kniha*, ke kter=E9 se =E8asto vrac=EDm", "*kniha*, kter= ou bych > doporu=E8il/a dobr=FDm p=F8=E1tel=F9m", "*kniha*, kter=E1 zm=ECnila m= =F9j =9Eivot", "*kniha* > na kterou nemohu zapomenout", "*kniha*, kter=E1 mne uvedla do jin=E9h= o sv=ECta", > "*kniha*, kterou bych si s sebou vzal/a jako jedinou* ... *) > Versions (matching query/total) 3/3 > *Timeline > <http: //war.mzk.cz/%7enwa/wera/wera/result.php?time=3D"2004121218092= 8&url=3DhttpINDX3AINDX2FINDX2FskipINDXDOTnkpINDXDOTczINDX2FakcMojek= nINDXDOThtm"> > | Overview > <http: //war.mzk.cz/%7enwa/wera/wera/overview.php?url=3D"httpINDX3AIN= DX2FINDX2FskipINDXDOTnkpINDXDOTczINDX2FakcMojeknINDXDOThtm">* >=20 >=20 > >-lm > > > > > > > >------------------------------------------------------- > >SF.Net email is sponsored by: > >Tame your development challenges with Apache's Geronimo App Server. > Download > >it for free - -and be entered to win a 42" plasma tv or your very ow= n > >Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > >_______________________________________________ > >Archive-access-discuss mailing list > >Arc...@li... > >https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > >=20 >=20 > </http:></http:> |
|
From:
<mat...@ce...> - 2005-11-03 08:54:06
|
______________________________________________________________ > Od: Sve...@nb... > Komu: mat...@ce... > CC:=20 > Datum: 02.11.2005 19:41 > P=F8edm=ECt: RE: [Archive-access-discuss] wera results > > I tried the latest opensearch servlet myself. It messed up my Wera, l= ots > of 0/0 ... >=20 > ;-) now, i'm using what you send to me...and everything seems fine... i can't find any 0/0 :) i will test it more:) -lm >=20 > Sverre >=20 >=20 > -----Original Message----- > From: Luk=E1s Matejka [mailto:mat...@ce...] > Sent: Wed 11/2/2005 4:43 PM > To: Sverre Bang > Cc: arc...@li... > Subject: RE: [Archive-access-discuss] wera results >=20 >=20 >=20 > ______________________________________________________________ > > Od: sve...@nb... > > Komu: arc...@li... > > CC:=20 > > Datum: 02.11.2005 14:33 > > Predmet: RE: [Archive-access-discuss] wera results > > > > Hi there, > > Definitely something wrong in NutchWax. If i execute > > > http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_from=3D= &year_to=3D > > and click the tmeline link of the first hit showing 0/0 hits i get >=20 > where did you find hit showing 0/0? > it works fine for me(i've just explored 150 urls..and no 0/0 hits ) > did you remeber number of total hits?(if it's same - i experimented w= ith > previous version of nutchwax,starting tomcat on various instances) >=20 > i had for word "kniha" > Total number of versions found : 49087. Displaying URL's 1-10 >=20 > -lm >=20 > > 'Sorry, no documents with the given uri were found'. The url disply= ed > > seems fine, but if you look in the source of the uppermost frame yo= u > > will see that the url sent to the script was > > http://full.nkp.cz/nkdb/rejstriky/rejstrik.asp?irj=3D12&start=3DV. > > The & separating the parameters irj and start has been replaced by = its > > html character entity reference.=20 > >=20 > > If i press the go button now the url submitted to the script will b= e ok. > >=20 > > If i look in the NutchWax result set of the initial search (add &de= bug=3D1 > > to the search url to bring out the NutchWax search urls) i see that= the > > url (link element) returned is wrong already here. > >=20 > > Conclusion : NutchWax mangles the url returned by introducing html > > entities instead of keeping the url in its original form. > >=20 > > What version of NutchWax are you using? > >=20 > > Sverre > >=20 > > On Wed, 2005-11-02 at 12:41 +0000, Kristinn Sigurdsson wrote: > > > This looks like the same (or very similar) problem as I've got. I= 've > > been discussing it (offlist) with Stack and Sverre Bang, so I know = it is > > being looked into. > > >=20 > > > I notice in your search results (as in mine) that URIs with & in = them > > are showing up as 0/0 versions. I believe that both problems are du= e to > > the escaping (or unescaping) of HTML characters in the NutchWAX XML= that > > is used to pass the results to WERA. > > >=20 > > > Possibly this is a misconfiguration of either Tomcat or Apache...= ? > > >=20 > > > - Kris > > >=20 > > > > -----Original Message----- > > > > From: arc...@li...=20 > > > > [mailto:arc...@li...]=20 > > > > On Behalf Of LukAALA MatAZjka > > > > Sent: 2. nAlvember 2005 11:21 > > > > To: arc...@li... > > > > Subject: [Archive-access-discuss] wera results > > > >=20 > > > >=20 > > > > Hi, > > > >=20 > > > > for example > > > > http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_f= r > > > om=3D&year_to=3D > > >=20 > > > description of each record is not well-displayed > > >=20 > > > 1. SKIP, Moje kniha (http://skip.nkp.cz/akcMojekn.htm) > > > (<b> ... </b>pr=EDstupu k internetu v knihovn=E1ch > > propagovat vyuzit=ED internetu pri > > zjistov=E1n=ED n=E1zoru obyvatel 2. Anketa > > Pomoc=ED kr=E1tk=E9 ankety bude zjistov=E1na > > nejobl=EDbenejs=ED <b>kniha</b> obyvatel > > Cesk=E9 republiky. Pojem nejobl=EDbenejs=ED > > <b>kniha</b> je specifikov=E1n dals=EDmi v=FDklady, > > jako "<b>kniha</b>, kter=E1 me nejv=EDce > > ovlivnila", "<b>kniha</b>, ke kter=E9 se casto > > vrac=EDm", "<b>kniha</b>, kterou bych doporucil/a > > dobr=FDm pr=E1telum", "<b>kniha</b>, > > kter=E1 zmenila muj zivot", "<b>kniha</b> na > > kterou nemohu zapomenout", "<b>kniha</b>, kter=E1 mne uvedla > > do jin=E9ho sveta", "<b>kniha</b>, kterou bych si s > > sebou vzal/a jako jedinou<b> ... </b>) > > > Versions (matching query/total) 3/3 > > > Timeline | Overview > > >=20 > > > "pr=EDstupu" should be "pLA=ADstupu"(without diacritics > > "pristupu") > > >=20 > > > does anybody have same problem? > > >=20 > > > -lm > > >=20 > > >=20 > > >=20 > > > ------------------------------------------------------- > > > SF.Net email is sponsored by: > > > Tame your development challenges with Apache's Geronimo App Serve= r. > > Download > > > it for free - -and be entered to win a 42" plasma tv or your very= own > > > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.= php > > > _______________________________________________ > > > Archive-access-discuss mailing list > > > Arc...@li... > > > https://lists.sourceforge.net/lists/listinfo/archive-access-discu= ss > > >=20 > > >=20 > > >=20 > > > ------------------------------------------------------- > > > SF.Net email is sponsored by: > > > Tame your development challenges with Apache's Geronimo App Serve= r. > > Download > > > it for free - -and be entered to win a 42" plasma tv or your very= own > > > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.= php > > > _______________________________________________ > > > Archive-access-discuss mailing list > > > Arc...@li... > > > https://lists.sourceforge.net/lists/listinfo/archive-access-discu= ss > >=20 > >=20 > > ------------------------------------------------------- > > SF.Net email is sponsored by: > > Tame your development challenges with Apache's Geronimo App Server. > > Download > > it for free - -and be entered to win a 42" plasma tv or your very o= wn > > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.ph= p > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > >=20 >=20 >=20 >=20 > |
|
From: stack <st...@ar...> - 2005-11-03 01:09:06
|
Luk=E1=9A Mat=ECjka wrote: >Hi, > >for example >http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_from=3D&ye= ar_to=3D > >description of each record is not well-displayed > >1. SKIP, Moje kniha (http://skip.nkp.cz/akcMojekn.htm) >(<b> ... </b>přístupu k internetu v knihovnách propag= ovat využití internetu při zjišťován&ia= cute; názorů obyvatel 2. Anketa Pomocí krátk&ea= cute; ankety bude zjišťována nejoblíbeněj= 53;í <b>kniha</b> obyvatel České republiky. Pojem nejo= blíbenější <b>kniha</b> je specifikován da= lšími výklady, jako "<b>kniha</b>, která m= ě nejvíce ovlivnila", "<b>kniha</b>, ke které= ; se často vracím", "<b>kniha</b>, kterou bych dopo= ručil/a dobrým přátelům", "<b>knih= a</b>, která změnila můj život", "<b>knih= a</b> na kterou nemohu zapomenout", "<b>kniha</b>, která= mne uvedla do jiného světa", "<b>kniha</b>, kterou= bych si s sebou vzal/a jako jedinou<b> ... </b>) >Versions (matching query/total) 3/3 >Timeline | Overview > >"přístupu" should be "p=F8=EDstupu"(without diacritics "pris= tupu") > >does anybody have same problem? > =20 > Did you change something Luk=E1=9A? When I browse to the link given abov= e,=20 the display looks correct: i.e. See "p=F8=EDstupu" in the below (Hopefull= y=20 this mail makes it across preserving original characters). St.Ack * 1. SKIP, Moje kniha* (http://skip.nkp.cz/akcMojekn.htm) (* ... *p=F8=EDstupu k internetu v knihovn=E1ch propagovat vyu=9Eit=ED in= ternetu=20 p=F8i zji=9A=9Dov=E1n=ED n=E1zor=F9 obyvatel 2. Anketa Pomoc=ED kr=E1tk=E9= ankety bude=20 zji=9A=9Dov=E1na nejobl=EDben=ECj=9A=ED *kniha* obyvatel =C8esk=E9 republ= iky. Pojem=20 nejobl=EDben=ECj=9A=ED *kniha* je specifikov=E1n dal=9A=EDmi v=FDklady, j= ako "*kniha*,=20 kter=E1 m=EC nejv=EDce ovlivnila", "*kniha*, ke kter=E9 se =E8asto vrac=ED= m",=20 "*kniha*, kterou bych doporu=E8il/a dobr=FDm p=F8=E1tel=F9m", "*kniha*, k= ter=E1=20 zm=ECnila m=F9j =9Eivot", "*kniha* na kterou nemohu zapomenout", "*kniha*= ,=20 kter=E1 mne uvedla do jin=E9ho sv=ECta", "*kniha*, kterou bych si s sebou= =20 vzal/a jako jedinou* ... *) Versions (matching query/total) 3/3 *Timeline=20 <http://war.mzk.cz/%7Enwa/wera/wera/result.php?time=3D20041212180928&url=3D= httpINDX3AINDX2FINDX2FskipINDXDOTnkpINDXDOTczINDX2FakcMojeknINDXDOThtm>=20 | Overview=20 <http://war.mzk.cz/%7Enwa/wera/wera/overview.php?url=3DhttpINDX3AINDX2FIN= DX2FskipINDXDOTnkpINDXDOTczINDX2FakcMojeknINDXDOThtm>*=20 >-lm > > > >------------------------------------------------------- >SF.Net email is sponsored by: >Tame your development challenges with Apache's Geronimo App Server. Down= load >it for free - -and be entered to win a 42" plasma tv or your very own >Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php >_______________________________________________ >Archive-access-discuss mailing list >Arc...@li... >https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > =20 > |