You can subscribe to this list here.
| 2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(4) |
Sep
(5) |
Oct
(17) |
Nov
(30) |
Dec
(3) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2006 |
Jan
(4) |
Feb
(14) |
Mar
(8) |
Apr
(11) |
May
(2) |
Jun
(13) |
Jul
(9) |
Aug
(2) |
Sep
(2) |
Oct
(9) |
Nov
(20) |
Dec
(9) |
| 2007 |
Jan
(6) |
Feb
(4) |
Mar
(6) |
Apr
(7) |
May
(6) |
Jun
(6) |
Jul
(4) |
Aug
(3) |
Sep
(9) |
Oct
(26) |
Nov
(23) |
Dec
(2) |
| 2008 |
Jan
(17) |
Feb
(19) |
Mar
(16) |
Apr
(27) |
May
(3) |
Jun
(21) |
Jul
(21) |
Aug
(8) |
Sep
(13) |
Oct
(7) |
Nov
(8) |
Dec
(8) |
| 2009 |
Jan
(18) |
Feb
(14) |
Mar
(27) |
Apr
(14) |
May
(10) |
Jun
(14) |
Jul
(18) |
Aug
(30) |
Sep
(18) |
Oct
(12) |
Nov
(5) |
Dec
(26) |
| 2010 |
Jan
(27) |
Feb
(3) |
Mar
(8) |
Apr
(4) |
May
(6) |
Jun
(13) |
Jul
(25) |
Aug
(11) |
Sep
(2) |
Oct
(4) |
Nov
(7) |
Dec
(6) |
| 2011 |
Jan
(25) |
Feb
(17) |
Mar
(25) |
Apr
(23) |
May
(15) |
Jun
(12) |
Jul
(8) |
Aug
(13) |
Sep
(4) |
Oct
(17) |
Nov
(7) |
Dec
(6) |
| 2012 |
Jan
(4) |
Feb
(7) |
Mar
(1) |
Apr
(10) |
May
(11) |
Jun
(5) |
Jul
(7) |
Aug
(1) |
Sep
(1) |
Oct
(5) |
Nov
(6) |
Dec
(13) |
| 2013 |
Jan
(9) |
Feb
(7) |
Mar
(3) |
Apr
(1) |
May
(3) |
Jun
(19) |
Jul
(3) |
Aug
(3) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
| 2014 |
Jan
(11) |
Feb
(1) |
Mar
|
Apr
(2) |
May
(6) |
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
| 2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
(4) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
| 2016 |
Jan
(4) |
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| 2018 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
(1) |
Dec
|
| 2019 |
Jan
(2) |
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
|
Sep
(1) |
Oct
(1) |
Nov
|
Dec
|
|
From: Jian J. <jia...@gm...> - 2008-03-28 01:10:53
|
Hello, Brad, Thanks for you reply. Yes, you are right. I think the correct full URL is not correctly sent to Wayback. But I still have no idea how to solve it. The problem probably lies in the communication between the Apache2 and Tomcat. Actually, we have AJP connectors setup on our back end servers and we map the port 80 to 8080. Suppose the back end servers A and B, and load balancer C. When I tried the URL http://A/wayback-webapp-1.2.0/wayback/, I got a successful message from access_log "GET /wayback-webapp-1.2.0/wayback/ HTTP/1.1" 200 3610. But when I tried the URL http://C/wayback-webapp-1.2.0/wayback/, I got an error message from access_log of Apache2 "GET /wayback-webapp-1.2.0/wayback/ HTTP/1.1" 404 1042. Obviously, the two requests are both successfully received by Apache2, why the responses are different? In the http.conf, I already added "JkMount /wayback* ajp13". Since we already use the AJP, so I prefer to do modification based on that. Would you please explain more clearly and detailedly what should I do to ensure the correct requests are received by Wayback. Thanks very much and best regards. Jian On Thu, Mar 27, 2008 at 1:06 PM, Brad Tofel <br...@ar...> wrote: > Hi Jian, > > The problem is probably that the Wayback needs to know the fully > qualified hostname where it is running, so it can generate links correctly. > > To do the kind of setup you're trying to, I know of two solutions using > the current software: > 1) use AJP to ensure that the requests are received by the Wayback with > the correct hostname, port, and context information. > 2) use the "ProxyHost" (and possibly "ProxyPort") settings on the > "Connector" tag within tomcat's server.xml configuration file. This > allows you to explicitly set the values returned by > HttpServletRequest.getServer*(). > > These two settings do not allow as much flexibility (specifically with > proxying a different path from a front end node to a backend wayback > access point) so probably going forward we will change the software to > allow the AccessPoint URI to be set explicitly within the wayback > configuration. > > Please let me know if you have questions on this, and how it works for you. > > Brad > > > > Jian Jiao wrote: > > Hello, > > > > I have a problem when I try to deploy Wayback. > > I have two servers running Wayback at the back end and a balancer > > controller at the front. > > My Wayback works well separatedly on the two back end servers. > > However, When I try to access Wayback using the URL of the balancer > > controller, > > there is always an error. > > > > It seems that Wayback does not know the access point. > > > > In the statement in index.jsp > > ArrayList<String> accessPoints = (ArrayList<String>) names > > accessPoints is always null. > > > > I already modified the wayback.xml to change the replayURIPrefix to > > the new domain. > > > > I don't know what's the wrong with it. Can you please help me to > > figure out what is the problem? > > > > Any help or suggestions would be appreciated. > > > > > > -- Best wishes, Jian Jiao |
|
From: Brad T. <br...@ar...> - 2008-03-27 17:05:32
|
Hi Jian, The problem is probably that the Wayback needs to know the fully qualified hostname where it is running, so it can generate links correctly. To do the kind of setup you're trying to, I know of two solutions using the current software: 1) use AJP to ensure that the requests are received by the Wayback with the correct hostname, port, and context information. 2) use the "ProxyHost" (and possibly "ProxyPort") settings on the "Connector" tag within tomcat's server.xml configuration file. This allows you to explicitly set the values returned by HttpServletRequest.getServer*(). These two settings do not allow as much flexibility (specifically with proxying a different path from a front end node to a backend wayback access point) so probably going forward we will change the software to allow the AccessPoint URI to be set explicitly within the wayback configuration. Please let me know if you have questions on this, and how it works for you. Brad Jian Jiao wrote: > Hello, > > I have a problem when I try to deploy Wayback. > I have two servers running Wayback at the back end and a balancer > controller at the front. > My Wayback works well separatedly on the two back end servers. > However, When I try to access Wayback using the URL of the balancer > controller, > there is always an error. > > It seems that Wayback does not know the access point. > > In the statement in index.jsp > ArrayList<String> accessPoints = (ArrayList<String>) names > accessPoints is always null. > > I already modified the wayback.xml to change the replayURIPrefix to > the new domain. > > I don't know what's the wrong with it. Can you please help me to > figure out what is the problem? > > Any help or suggestions would be appreciated. > > |
|
From: Jian J. <jia...@gm...> - 2008-03-24 19:38:51
|
Hello, I have a problem when I try to deploy Wayback. I have two servers running Wayback at the back end and a balancer controller at the front. My Wayback works well separatedly on the two back end servers. However, When I try to access Wayback using the URL of the balancer controller, there is always an error. It seems that Wayback does not know the access point. In the statement in index.jsp ArrayList<String> accessPoints = (ArrayList<String>) names accessPoints is always null. I already modified the wayback.xml to change the replayURIPrefix to the new domain. I don't know what's the wrong with it. Can you please help me to figure out what is the problem? Any help or suggestions would be appreciated. -- Best wishes, Jian |
|
From: Lukáš M. <lma...@gm...> - 2008-03-20 08:37:05
|
Hello, we've been using and testing Wayback for several years in WebArchiv.cz and we're familiar with the fact, that so far IA's done a lot of effort in i18n especially in last releases. In particular, we appreciate support for language properties and configuration of individual jsp pages, nevertheless we're still facing issues with utf-8 encoding. I'd like to ask for experiences from others (non-ascii countries) how they solved this issue. In general, with a new release, we have to always make following changes (with assumption that we usually store our language properties in utf-8): 1. Convert all jsp into utf-8 2. Add meta tag "<meta equiv="Content-Type" content="text/html; charset=UTF-8">" to JSP in order to browser can recognize right encoding 3. Add directive <%@ page language="java" pageEncoding="utf-8" contentType="text/html;charset=utf-8"%>' to each JSP to say that server should send response in UTF-8 4. if we also want to send a unicode text from form to server we have to implement a filter that sets encoding to request (req.setCharacterEncoding(encoding);) With respect to this changes, we're able to customize each release, however it might help to other non-english speaking countries to incorporate this into wayback. Or is there any other intent how to treat this issue? Thanks in advance for reply. Best Regards -- Lukas Matejka WebArchiv.cz CZ National Library |
|
From: Jian J. <jia...@gm...> - 2008-03-20 01:34:11
|
Hello, there, I want to change the working directory of Wayback from /tmp to another directory. However, when I try to do that by modifying the configuration file wayback.xml, every directory seems to work well except one /tmp/index. When I change the /tmp/index to other place, Wayback just does not work. How can I do that? Any help would be appreciated. -- Best wishes, Jian Jiao |
|
From: Ignacio G. <igc...@gm...> - 2008-03-12 13:32:32
|
Hello Brad, I just started "playing" with this new version of Wayback, and there is one thing that seems very extrange to me. On every page resource I visit, I always get the header information plastered at the top of the page. (i.e. HTTP/1.1 200 OK Server: Microsoft-IIS/5.0 Date: Tue, 03 Oct 2000 07:31:49 GMT Connection: Keep-Alive Content-Length: 13027 Content-Type: text/html Set-Cookie: GWBSiteCookie=header%5Ftype=Text&mode=false&browser=Default&browser%5Fchecked=true&browser%5Fwidth=0; path=/ Cache-control: private) This information denotes the header information that was retrieved at the time of crawl (as you can see by the date), the thing I do not understand is why am I seeing it when I access a page via Wayback. It appears at the very top, even over the TimeLine section. Any ideas on why this might be, or how to get rid of it? Thanks. On 2/29/08, Brad Tofel <br...@ar...> wrote: > > Hi Thomas, > > Thanks for the kind feedback. > > Couple of suggestions, and also some follow-up questions interspersed: > > Thomas Beekman wrote: > > Hi all, > > > > At the KB we are severely testing Wayback 1.2.0 at the moment. My first > > impression is quite positive; many new functions are added, it is quite > > easy to implement different modules for different access points and > > several indexing threads can live side by side now. > > > > I have a few questions though. First of all, I'm experiencing errors > > which did not occur in older versions; java.lang.OutOfMemoryError: GC > > overhead limit exceeded. Does anyone know how to fix this? > > > > > I haven't seen this before, and some quick google searches indicate it > may be one of: > > A) a JVM problem (which JVM are you using?) > B) too little heap space in the java startup arguments > C) the wayback software doing lots of object creation+destruction. > > Since we have large installations in production at the IA, one using > 700+ Collections and 1400+ AccessPoints. Note that these all use CDX > indexes, which are more resource efficient. I'm hoping that C is not the > problem, but we haven't yet needed to do a heavy optimization pass over > the code, so it could be Wayback itself. Are you using IBM's JVM? Have > you tried increasing the heap? If that doesn't address the problem, can > you please send me a copy of your wayback.xml Spring configuration? > > > Second; when closing down Wayback in Tomcat, the lock file for the > > localbdb is not erased. A restart is therefore not possible. Could this > > be fixed so that if the webapp is closed down, the lock file is erased? > > > > > > On what platform (OS+JVM) are you running Wayback? Is the BDB index > stored over NFS or another networked file system? I haven't experienced > this problem on any of our systems -- the BDBJE just starts up, even > with the lock file still existing. I haven't looked into this, but > guessed that it was using the lock file via flock() type semantics, > instead of using it's existence to indicate a lock. BDBJE may determine > that the DB is on a remote system, where flock() semantics don't work, > in which case it may be falling back to using the existence of the lock > file to indicate usage.. > > In any case, I've just implemented the "clean shutdown" processing in my > development environment, but will probably hold off to do more testing > before including it in a release. > > We are preparing a 1.2.1 release which addresses a couple bugs > discovered by folks in the field, but are holding this release for > feedback from one more user having trouble reading some ARC files. > > > Third; with a few websites the timeline GUI is scrambled. I get a full > > yellow screen with on every line a mark. After scrolling down that page, > > the website is presented normally. This is not the case with every > > website. > > > > > Yes, the css implementation in the current timeline is prone to > inheriting some styles from some web pages. Could you please send me a > few example pages on the live web that demonstrate the problem you're > seeing? > > > My fourth and last problem is in the configuration. I would like to do > > some tests using the remote NutchWAX search, but there is not a clear > > manual of how to implement this precisely, which beans to use for > > example. Does anyone have a good example for me? > > > > > > Setting up a collection with this bean: > > <property name="resourceIndex"> > <bean class="org.archive.wayback.resourceindex.NutchResourceIndex" > init-method="init"> > <property name="searchUrlBase" > value="http://webteam-ws.us.archive.org:8080/katrina/opensearch" /> > <property name="maxRecords" value="100" /> > </bean> > </property> > > Should do the trick. Note that if using Archival URL mode, you should be > sure to set the maxRecords property on the RequestParser to the same > value for maxRecords.. This may be a bug -- would be more friendly to > use the min() of both values.. > > <property name="parser"> > <bean class="org.archive.wayback.archivalurl.ArchivalUrlRequestParser" > init-method="init"> > <property name="maxRecords" value="100" /> > <property name="earliestTimestamp" value="1996" /> > </bean> > </property> > > > Hopefully this works for you, and please let me know about the questions > above. > > Brad > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: Miguel C. <mig...@fc...> - 2008-03-08 11:01:26
|
Hi Brad, The problem is more like a bug than a future feature that should be implemented. You say "The Wayback will return the document that is closest to the current document being viewed.". This is the expected behavior, but this is only partially true. Look at this example. A version of a page with timestamp 20080228191941 is presented: http://t3.tomba.fccn.pt:8080/wayback/wayback/20080228191941/http://xldb.fc.u l.pt/daniel/ This page contains an embedded image with a later timestamp 20080228192041 (1 more minute), but the Wayback submits a query to find the versions of this image from a minimum bound (20010101000000) to a maximum bound (timestamp of the source page = 20080228191941): http://t3.tomba.fccn.pt:8080/nutchwax/opensearch?query=date%3A20010101000000 -20080228191941+exacturl%3Ahttp%3A%2F%2Fxldb.fc.ul.pt%2Fdaniel%2Fimages%2Fre tratoDanielGomes.jpg The desired version is excluded by the date range constraint in the query, since the image has 1 more minute. The Wayback ONLY THEN computes the closest version from the versions with timestamp up to 20080228192041. To achieve the desired behavior and include the embedded image in the page using the closest date, the query should be upper limited by a broader date range or alternatively the date range constraint should be removed. Currently in the code, the getRequestUrl() method on the NutchResourceIndex class receives a WaybackRequest parameter, containing 3 values: startDateStr (minimum bound) = 20010101000000 exactDateStr (URL timestamp) = 20080228191945 endDateStr (maximum bound = current time) = 20080305190212 These values are used to compute the date range constraint at line 287 in the NutchResourceIndex class: if ((endDateStr == null || endDateStr.length() == 0) && exactDateStr != null && exactDateStr.length() > 0) { ms.append("date%3A").append(exactDateStr).append('+'); } else { ms.append("date%3A").append(startDateStr).append('-').append( exactDateStr != null ? exactDateStr : endDateStr).append('+'); } This date range constraint is computed as "startDateStr-exactDateStr", but it should be "startDateStr-endDateStr" or "startDateStr-exactDateStr + xTime". Best regards, -- Miguel Costa -----Original Message----- From: arc...@li... [mailto:arc...@li...] On Behalf Of Brad Tofel Sent: terça-feira, 4 de Março de 2008 23:13 To: Daniel Gomes Cc: archive-access discussion list Subject: Re: [Archive-access-discuss] url bounded by timestamp: explanation Hi Daniel, Thanks for the elaboration and the excellent suggestions. We've been discussing adding functionality to Wayback to allow users to target a specific date they want to stay "near" within a replay session. Currently when retrieving an embedded object for a web page, or when navigating between two archived web page, the Wayback will return the document that is closest to the current document being viewed. We'd like to add the capability for users to specify a specific date, as well as a maximum range before and after that date to stay within for these embedded requests, and for navigations. In somewhat more detail, we plan to expand greatly the "in page presence" of the Wayback software, which in this particular case would mean including a banner or additional element in the page that would allow users to temporarily expand the maximum range of embedded elements in a specific page to potentially allow replay of captures that were archived, but are outside the standard maximum range. I think this is the same functionality you're suggesting, and we're hoping to have this in the 1.4 release, in a 2-3 month time frame. Wayback HEAD may include this functionality before that, and I'll let you know how that progresses. However, in the context you're using Wayback, with a Nutch ResourceIndex, this may require more functionality within Nutch as well. I'm not sure what the schedule might be for that, but again will keep you posted. Please let me know if I've misunderstood your suggestion, and the functionality we've discussed is not the same as your suggestions. Brad Daniel Gomes wrote: > The last email from my colleague Miguel Costa might have been a bit > confusing. I will try to clarify the problem we identified because it > looked quite important to us. > > We noticed that the wayback machine issues a query ranged by date to > find embedded objects, such as images in an HTML page. > > Our first question is "Why is the query ranged by date instead of > being restricted to the collection identifier?". > > A search by collection identifier would be more efficient because the > search would be based on an exact match of the collection id and would > present the images that most likely belong to that page. > One may argue that this way if the image was not crawled in the last > collection it would not be presented in the page. While using a date > range query the image would still be included. The problem we see in > this approach, is that we might be including images that although > exist in the archive, were never published in the page. > > This situation lead to our second doubt: > > The date range issued in the query is from a static date of the first > collection (e.g. 20010101000000) to the timestamp of the page (e.g. > 20080218201945). > > We believe this situation leads to several problems: > > 1. The date range of the query is unnecessarily broad, if we are > looking for the images embedded in a page crawled in 2008, looking for > them since 2001 seems excessive. > > 2. Pages can be presented containing old images that were never > published together (problem mentioned above) > > 3. Embedded images that have timestamps posterior to the page date > (even some minutes later) are not found and not rendered along with the page. > Notice, that pages must be crawled first to extract links to the > embedded images, so most images will have a date later than the page > and will not be presented by the wayback. In theory, it makes sense to > not present pages including contents "from the future", but > considering that crawls can not be executed instantly, using a sliding > time window seems to be more adequate to find embedded objects and even links to other pages. > > We propose that the wayback/nutchwax should be configurable to: > > 1. Find contents to be rendered together based on the collection id > or; > > 2.Find contents within a configurable date range centered on the date > of the page. Say if the page date is 2008/01/03, we would consider > that the embedded URLs crawled 3 days before and after this date could > be rendered along with it. Notice, that if one is performing a crawl > every > 3 months, the timespan could be 1 month instead of 3 days. The > timespan should be configured according to the duration and frequency of the crawls. > We believe that contents from previous crawls should not be rendered > together with a page. > > We would deeply appreciate that you validate our conclusions and gave > us feedback about this issue. > > Best regards, > /Daniel Gomes > Portuguese web archive > http://xldb.fc.ul.pt/daniel/ > > > *From:* arc...@li... > [mailto:arc...@li...] *On > Behalf Of *Miguel Costa > *Sent:* segunda-feira, 3 de Março de 2008 19:15 > *To:* arc...@li... > *Subject:* [Archive-access-discuss] url bounded by timestamp > > Hi, > > When a page is presented in the wayback machine, the linked images > (and other resources) are searched to be presented also. > The problem is that my wayback machine is searching using the nutchwax > index, through the opensearch servlet, and the nutchwax bounds the > search of the images (resources) by date (the timestamp of the source page): > > eg: date:20010101000000-20080218201945 > exacturl:http://xldb.fc.ul.pt/daniel/scripts/statCounter.js > > after the url be called inside the source page: > http://t3.tomba.fccn.pt:8080/wayback/wayback/20080218201945/http://xld > b.fc.u > l.pt/daniel/scripts/statCounter.js > > If the statCounter.js, for instance, has a higher timestamp (eg: > 20080218201955), that is usual, this resource is not found. > Does anyone know why these nutchwax searches don't use the collection > id instead the timestamp, to find the linked images (resources). Does > anyone know a solution for the problem? > > > Regards, > > > > -- Miguel Costa > > Portuguese Web Archive > > > -- > /Daniel Gomes > FCCN > Av. do Brasil, n.º 101 > 1700-066 Lisboa > Tel.: +351 21 8440190 > Fax: +351 218472167 > www.fccn.pt > > Aviso de Confidencialidade > > Esta mensagem é exclusivamente destinada ao seu destinatário, podendo > conter informação CONFIDENCIAL, cuja divulgação está expressamente > vedada nos termos da lei. Caso tenha recepcionado indevidamente esta > mensagem, solicitamos-lhe que nos comunique esse mesmo facto por esta > via ou para o telefone +351 218440100 devendo apagar o seu conteúdo de > imediato. This message is intended exclusively for its addressee. It > may contain CONFIDENTIAL information protected by law. If this message > has been received by error, please notify us via e-mail or by > telephone +351 218440100 and delete it immediately. > > > > > ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Archive-access-discuss mailing list Arc...@li... https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From: Arturas S. <Art...@si...> - 2008-03-06 12:12:18
|
Hello, we are using NetarchiveSuite + Wayback-1.2.0 + Nutchwax in National library of Lithuania. Our archive can be accessed at http://eia.libis.lt:8080/archyvas/viesas/ Most pages are displayed correctly, but on some pages we have problems with wayback. Links http://eia.libis.lt:8080/archyvas/viesas/20080208070802/http://www.arzinai.lt/, http://eia.libis.lt:8080/archyvas/viesas/20080216094637/http://www.valstietis.lt/ displays error HTTP Status 500 - type Exception report message description The server encountered an internal error () that prevented it from fulfilling this request. exception java.lang.StringIndexOutOfBoundsException: String index out of range: -3 java.lang.String.substring(String.java:1768) org.archive.wayback.replay.TagMagix.markupStyleUrls(TagMagix.java:176) ..... Other problem is that old data harvested with Nedlib harvester and converted to arc.gz format with Nedlib2Arc tool not always displayed correctly. When searching www.lzinios.lt results from 2008 year is displayed good, but links from 2002 year doesn't work. Also when using server side rendering then pages with frameset can't be displayed. One more feature would be great if wayback could access and index files from resourceStore dataDir subdirectories, because we will have very many arc files (>10000) and there is problem to put symbolic links to them in one directory. Regards, Artūras Sagidulinas |
|
From: Daniel G. <dan...@fc...> - 2008-03-05 14:37:07
|
Hi everyone. Heritrix and the Wayback machine already support ARC files. However, we haven't found any documentation about the capability of NutchWax to index wARC files. Is this supported? Thank you for your attention. Best regards, -- /Daniel Gomes FCCN Av. do Brasil, n.º 101 1700-066 Lisboa Tel.: +351 21 8440190 Fax: +351 218472167 www.fccn.pt Aviso de Confidencialidade Esta mensagem é exclusivamente destinada ao seu destinatário, podendo conter informação CONFIDENCIAL, cuja divulgação está expressamente vedada nos termos da lei. Caso tenha recepcionado indevidamente esta mensagem, solicitamos-lhe que nos comunique esse mesmo facto por esta via ou para o telefone +351 218440100 devendo apagar o seu conteúdo de imediato. This message is intended exclusively for its addressee. It may contain CONFIDENTIAL information protected by law. If this message has been received by error, please notify us via e-mail or by telephone +351 218440100 and delete it immediately. |
|
From: Brad T. <br...@ar...> - 2008-03-04 23:09:52
|
Hi Daniel, Thanks for the elaboration and the excellent suggestions. We've been discussing adding functionality to Wayback to allow users to target a specific date they want to stay "near" within a replay session. Currently when retrieving an embedded object for a web page, or when navigating between two archived web page, the Wayback will return the document that is closest to the current document being viewed. We'd like to add the capability for users to specify a specific date, as well as a maximum range before and after that date to stay within for these embedded requests, and for navigations. In somewhat more detail, we plan to expand greatly the "in page presence" of the Wayback software, which in this particular case would mean including a banner or additional element in the page that would allow users to temporarily expand the maximum range of embedded elements in a specific page to potentially allow replay of captures that were archived, but are outside the standard maximum range. I think this is the same functionality you're suggesting, and we're hoping to have this in the 1.4 release, in a 2-3 month time frame. Wayback HEAD may include this functionality before that, and I'll let you know how that progresses. However, in the context you're using Wayback, with a Nutch ResourceIndex, this may require more functionality within Nutch as well. I'm not sure what the schedule might be for that, but again will keep you posted. Please let me know if I've misunderstood your suggestion, and the functionality we've discussed is not the same as your suggestions. Brad Daniel Gomes wrote: > The last email from my colleague Miguel Costa might have been a bit > confusing. I will try to clarify the problem we identified because it looked > quite important to us. > > We noticed that the wayback machine issues a query ranged by date to find > embedded objects, such as images in an HTML page. > > Our first question is "Why is the query ranged by date instead of being > restricted to the collection identifier?". > > A search by collection identifier would be more efficient because the search > would be based on an exact match of the collection id and would present the images that most > likely belong to that page. > One may argue that this way if the image was not crawled in the last > collection it would not be presented in the page. While using a date range > query the image would still be included. The problem we see in this approach, is that > we might be including images that although exist in the archive, were never > published in the page. > > This situation lead to our second doubt: > > The date range issued in the query is from a static date of the first > collection (e.g. 20010101000000) to the timestamp of the page (e.g. > 20080218201945). > > We believe this situation leads to several problems: > > 1. The date range of the query is unnecessarily broad, if we are looking for > the images embedded in a page crawled in 2008, looking for them since 2001 > seems excessive. > > 2. Pages can be presented containing old images that were never published > together (problem mentioned above) > > 3. Embedded images that have timestamps posterior to the page date (even > some minutes later) are not found and not rendered along with the page. > Notice, that pages must be crawled first to extract links to the embedded > images, so most images will have a date later than the page and will not be > presented by the wayback. In theory, it makes sense to not present pages including contents "from the > future", but considering that crawls can not be executed instantly, using a > sliding time window seems to be more adequate to find embedded objects and > even links to other pages. > > We propose that the wayback/nutchwax should be configurable to: > > 1. Find contents to be rendered together based on the collection id or; > > 2.Find contents within a configurable date range centered on the date of the > page. Say if the page date is 2008/01/03, we would consider that the > embedded URLs crawled 3 days before and after this date could be rendered > along with it. Notice, that if one is performing a crawl every > 3 months, the timespan could be 1 month instead of 3 days. The timespan > should be configured according to the duration and frequency of the crawls. > We believe that contents from previous crawls should not be rendered > together with a page. > > We would deeply appreciate that you validate our conclusions and gave us > feedback about this issue. > > Best regards, > /Daniel Gomes > Portuguese web archive > http://xldb.fc.ul.pt/daniel/ > > > *From:* arc...@li... > [mailto:arc...@li...] *On Behalf Of > *Miguel Costa > *Sent:* segunda-feira, 3 de Março de 2008 19:15 > *To:* arc...@li... > *Subject:* [Archive-access-discuss] url bounded by timestamp > > Hi, > > When a page is presented in the wayback machine, the linked images (and > other resources) are searched to be presented also. > The problem is that my wayback machine is searching using the nutchwax > index, through the opensearch servlet, and the nutchwax bounds the search of > the images (resources) by date (the timestamp of the source page): > > eg: date:20010101000000-20080218201945 > exacturl:http://xldb.fc.ul.pt/daniel/scripts/statCounter.js > > after the url be called inside the source page: > http://t3.tomba.fccn.pt:8080/wayback/wayback/20080218201945/http://xldb.fc.u > l.pt/daniel/scripts/statCounter.js > > If the statCounter.js, for instance, has a higher timestamp (eg: > 20080218201955), that is usual, this resource is not found. > Does anyone know why these nutchwax searches don't use the collection id > instead the timestamp, to find the linked images (resources). Does anyone > know a solution for the problem? > > > Regards, > > > > -- Miguel Costa > > Portuguese Web Archive > > > -- > /Daniel Gomes > FCCN > Av. do Brasil, n.º 101 > 1700-066 Lisboa > Tel.: +351 21 8440190 > Fax: +351 218472167 > www.fccn.pt > > Aviso de Confidencialidade > > Esta mensagem é exclusivamente destinada ao seu destinatário, podendo conter > informação CONFIDENCIAL, cuja divulgação está expressamente vedada nos > termos da lei. Caso tenha recepcionado indevidamente esta mensagem, > solicitamos-lhe que nos comunique esse mesmo facto por esta via ou para o > telefone +351 218440100 devendo apagar o seu conteúdo de imediato. This > message is intended exclusively for its addressee. It may contain > CONFIDENTIAL information protected by law. If this message has been received > by error, please notify us via e-mail or by telephone +351 218440100 and > delete it immediately. > > > > > |
|
From: Brad T. <br...@ar...> - 2008-03-04 21:43:24
|
Hi Arnaud, This is a good question, which has come up several times in the past month or so, which hopefully means we'll be addressing it better in the 1.4 release, 2-3 month time frame. You do need to put all the ARC/WARC files in a single directory to serve them all from a single collection. This can be accomplished by copying/moving them to a single directory, or by using symbolic links. As another, more complex, alternative you could set up an ARC Proxy and expose all the ARCs in their various directories via HTTP 1.1, possibly just for access to the local machine, which would then be configured to use a RemoteResourceStore. The primary downside to doing this is that you need to manage updating your index yourself. Or, possibly better, would be to create multiple collections, each with a distinct LocalResourceStore pointing at the correct directory containing the appropriate ARC files. This would also require creating a separate ResourceIndex for each collection, and a separate AccessPoint for each of those collections. IA and other users have been doing this extensively in our deployments, but the down side is that you won't be able to search across all collections with a single query, but that may be what you want. So, the simplest is to move the files, or use symbolic links, but there are other options that can accomplish this. Brad Arn...@he... wrote: > Hello, > sorry certainly for this stupid question but after spending time into > heritrix manuals, wayback manuals and mailing list archive , I hope > users of this mailing can help me! > 1/ In heritrix Arcs files are created in 'arcs' directories under each > different Job. So several directories. > 2/ In 'wayback.xml' I have to define the 'dataDir'. So one directory. > How to organized my arcs files in one directory to be used by the > wayback machine? > Do I need to regularly copy the arcs files in a specific directory? > > Currently I tested by setting dataDir to one of my job arcs directory > but I obtain this error message when I hit the 'take me back' button > > > Etat HTTP 404 - /wayback-webapp-1.2.0/query > > I haven't error messages in tomcat log file. > > Arnaud. > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > ------------------------------------------------------------------------ > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: Daniel G. <dan...@fc...> - 2008-03-04 17:55:13
|
The last email from my colleague Miguel Costa might have been a bit confusing. I will try to clarify the problem we identified because it looked quite important to us. We noticed that the wayback machine issues a query ranged by date to find embedded objects, such as images in an HTML page. Our first question is "Why is the query ranged by date instead of being restricted to the collection identifier?". A search by collection identifier would be more efficient because the search would be based on an exact match of the collection id and would present the images that most likely belong to that page. One may argue that this way if the image was not crawled in the last collection it would not be presented in the page. While using a date range query the image would still be included. The problem we see in this approach, is that we might be including images that although exist in the archive, were never published in the page. This situation lead to our second doubt: The date range issued in the query is from a static date of the first collection (e.g. 20010101000000) to the timestamp of the page (e.g. 20080218201945). We believe this situation leads to several problems: 1. The date range of the query is unnecessarily broad, if we are looking for the images embedded in a page crawled in 2008, looking for them since 2001 seems excessive. 2. Pages can be presented containing old images that were never published together (problem mentioned above) 3. Embedded images that have timestamps posterior to the page date (even some minutes later) are not found and not rendered along with the page. Notice, that pages must be crawled first to extract links to the embedded images, so most images will have a date later than the page and will not be presented by the wayback. In theory, it makes sense to not present pages including contents "from the future", but considering that crawls can not be executed instantly, using a sliding time window seems to be more adequate to find embedded objects and even links to other pages. We propose that the wayback/nutchwax should be configurable to: 1. Find contents to be rendered together based on the collection id or; 2.Find contents within a configurable date range centered on the date of the page. Say if the page date is 2008/01/03, we would consider that the embedded URLs crawled 3 days before and after this date could be rendered along with it. Notice, that if one is performing a crawl every 3 months, the timespan could be 1 month instead of 3 days. The timespan should be configured according to the duration and frequency of the crawls. We believe that contents from previous crawls should not be rendered together with a page. We would deeply appreciate that you validate our conclusions and gave us feedback about this issue. Best regards, /Daniel Gomes Portuguese web archive http://xldb.fc.ul.pt/daniel/ *From:* arc...@li... [mailto:arc...@li...] *On Behalf Of *Miguel Costa *Sent:* segunda-feira, 3 de Março de 2008 19:15 *To:* arc...@li... *Subject:* [Archive-access-discuss] url bounded by timestamp Hi, When a page is presented in the wayback machine, the linked images (and other resources) are searched to be presented also. The problem is that my wayback machine is searching using the nutchwax index, through the opensearch servlet, and the nutchwax bounds the search of the images (resources) by date (the timestamp of the source page): eg: date:20010101000000-20080218201945 exacturl:http://xldb.fc.ul.pt/daniel/scripts/statCounter.js after the url be called inside the source page: http://t3.tomba.fccn.pt:8080/wayback/wayback/20080218201945/http://xldb.fc.u l.pt/daniel/scripts/statCounter.js If the statCounter.js, for instance, has a higher timestamp (eg: 20080218201955), that is usual, this resource is not found. Does anyone know why these nutchwax searches don't use the collection id instead the timestamp, to find the linked images (resources). Does anyone know a solution for the problem? Regards, -- Miguel Costa Portuguese Web Archive -- /Daniel Gomes FCCN Av. do Brasil, n.º 101 1700-066 Lisboa Tel.: +351 21 8440190 Fax: +351 218472167 www.fccn.pt Aviso de Confidencialidade Esta mensagem é exclusivamente destinada ao seu destinatário, podendo conter informação CONFIDENCIAL, cuja divulgação está expressamente vedada nos termos da lei. Caso tenha recepcionado indevidamente esta mensagem, solicitamos-lhe que nos comunique esse mesmo facto por esta via ou para o telefone +351 218440100 devendo apagar o seu conteúdo de imediato. This message is intended exclusively for its addressee. It may contain CONFIDENTIAL information protected by law. If this message has been received by error, please notify us via e-mail or by telephone +351 218440100 and delete it immediately. -- /Daniel Gomes FCCN Av. do Brasil, n.º 101 1700-066 Lisboa Tel.: +351 21 8440190 Fax: +351 218472167 www.fccn.pt Aviso de Confidencialidade Esta mensagem é exclusivamente destinada ao seu destinatário, podendo conter informação CONFIDENCIAL, cuja divulgação está expressamente vedada nos termos da lei. Caso tenha recepcionado indevidamente esta mensagem, solicitamos-lhe que nos comunique esse mesmo facto por esta via ou para o telefone +351 218440100 devendo apagar o seu conteúdo de imediato. This message is intended exclusively for its addressee. It may contain CONFIDENTIAL information protected by law. If this message has been received by error, please notify us via e-mail or by telephone +351 218440100 and delete it immediately. |
|
From: Miguel C. <mig...@fc...> - 2008-03-03 19:15:32
|
Hi, When a page is presented in the wayback machine, the linked images (and other resources) are searched to be presented also. The problem is that my wayback machine is searching using the nutchwax index, through the opensearch servlet, and the nutchwax bounds the search of the images (resources) by date (the timestamp of the source page): eg: date:20010101000000-20080218201945 exacturl:http://xldb.fc.ul.pt/daniel/scripts/statCounter.js after the url be called inside the source page: http://t3.tomba.fccn.pt:8080/wayback/wayback/20080218201945/http://xldb.fc.u l.pt/daniel/scripts/statCounter.js If the statCounter.js, for instance, has a higher timestamp (eg: 20080218201955), that is usual, this resource is not found. Does anyone know why these nutchwax searches don't use the collection id instead the timestamp, to find the linked images (resources). Does anyone know a solution for the problem? Regards, -- Miguel Costa Portuguese Web Archive |
|
From: Miguel C. <mig...@fc...> - 2008-03-03 16:05:35
|
Hi,
I found the bug in the ImportArcs class. This bug makes the import command
to build segments with wrong arc names.
The map method receives a "value" parameter containing an ARCRecord.
This ARCRecord has the url, arc filename and offset. All values are used in
this method, except the arc filename that is set on the first time the map
method is called. So, when a thread is working over a new arc file, the
output for the index will refer the old arc filename.
The bug occur at line 301 ("checkArcName(rec);"). I commented line 545 of
the checkArcName() method to fix the bug.
Regards,
-- Miguel Costa
Portuguese Web Archive
-----Original Message-----
From: Miguel Costa [mailto:mig...@fc...]
Sent: quinta-feira, 28 de Fevereiro de 2008 11:44
To: 'Brad Tofel'
Cc: 'Daniel Gomes'
Subject: FW: [Archive-access-discuss] org.archive.io.NoGzipMagicException
Hi,
I don't know if you found anything else about this problem, but I found the
reason of the problem.
The index has bad references for the ARC files. The offsets returned are ok
but not the ARC files, usually one ARC filename behind:
e.g. returns IAH-20080218190013-00000-T4.arc.gz instead of
IAH-20080218190013-00001-T4.arc.gz
You can see the ARC file and offset debugging the NutchResourceIndex (line
122: document = getHttpDocument(requestUrl)) or much more simple, by
submitting the url in the browser
e.g.
http://localhost:8080/nutchwax/opensearch?query=date%3A20010101000000-200802
18190351+exacturl%3Ahttp%3A%2F%2Fwww.icat.fc.ul.pt%2Fstyles.css&hitsPerPage=
1000&start=0&dedupField=site&hitsPerDup=1000&hitsPerSite=1000
The wayback machine is using the nutchwax index through the opensearch link.
The nutchwax send the XML information from url match. This shows an ARC file
and an offset, but if you use the ARC READER over all ARCs to find this
offset:
e.g. find offset 24042995
arcreader `ls *.arc.gz` | grep 24042995
20080218190054 194.117.42.131
http://www.icat.fc.ul.pt/images/background_voltar.gif image/gif - - 24042995
388 IAH-20080218190013-00002-T4
you get an ARC file different from the expected. In the cases where the
NoGzipMagicException doesn't occur, the ARC file is the correct.
This occurs with one or more reduce tasks in hadoop, so it doesn't seems a
problem from the merge command.
Do you have any idea to solve this?
Regards,
-----Original Message-----
From: Miguel Costa [mailto:mig...@fc...]
Sent: sexta-feira, 22 de Fevereiro de 2008 15:22
To: 'Brad Tofel'
Subject: RE: [Archive-access-discuss] org.archive.io.NoGzipMagicException
More help to the problem.
I'm debuging the code using the org.archive.io.arc.ARCReader class in the
command line.
I can parse and dump all URLs from the arc.gz file.
When I use the offset returned by this dump I can see that the file is OK.
e.g.
/home/nutchwax/heritrix-1.12.1/src/scripts/arcreader -o 2332619
/home/nutchwax/arcs/IAH-20080123100910-00023-thessalian.arc.gz
20080123110842 70.85.38.82
http://www.gastronomias.com/moirasencantadas/imagens/logo.jpg image/jpeg -
4AN5CYCD3OYOZH7ZMOMJEKR37NTTXLT6 2332629 7722
IAH-20080123100910-00023-thessalian
When I put another offset I get the same exception:
Exception in thread "main" java.io.IOException: Not in GZIP format
at
java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:137)
So, the problem seems to be in the creation of the index because the offsets
are computed wrong.
-----Original Message-----
From: Brad Tofel [mailto:br...@ar...]
Sent: quinta-feira, 21 de Fevereiro de 2008 19:52
To: Miguel Costa
Subject: Re: [Archive-access-discuss] org.archive.io.NoGzipMagicException
Darn. The problem didn't surface given the small input you sent.. Ran into
"Unepected End of ZLIB input stream" before the problem you are seeing.
Is there someplace online where you can post the entire file so I can
download it and examine it?
I should be able to receive a 100MB (how large is the original?) attachment
as well, if sending the whole file via email is an option for you.
Thanks,
Brad
Miguel Costa wrote:
|
|
From: Brad T. <br...@ar...> - 2008-03-01 02:08:51
|
Hi Thomas, Thanks for the kind feedback. Couple of suggestions, and also some follow-up questions interspersed: Thomas Beekman wrote: > Hi all, > > At the KB we are severely testing Wayback 1.2.0 at the moment. My first > impression is quite positive; many new functions are added, it is quite > easy to implement different modules for different access points and > several indexing threads can live side by side now. > > I have a few questions though. First of all, I'm experiencing errors > which did not occur in older versions; java.lang.OutOfMemoryError: GC > overhead limit exceeded. Does anyone know how to fix this? > > I haven't seen this before, and some quick google searches indicate it may be one of: A) a JVM problem (which JVM are you using?) B) too little heap space in the java startup arguments C) the wayback software doing lots of object creation+destruction. Since we have large installations in production at the IA, one using 700+ Collections and 1400+ AccessPoints. Note that these all use CDX indexes, which are more resource efficient. I'm hoping that C is not the problem, but we haven't yet needed to do a heavy optimization pass over the code, so it could be Wayback itself. Are you using IBM's JVM? Have you tried increasing the heap? If that doesn't address the problem, can you please send me a copy of your wayback.xml Spring configuration? > Second; when closing down Wayback in Tomcat, the lock file for the > localbdb is not erased. A restart is therefore not possible. Could this > be fixed so that if the webapp is closed down, the lock file is erased? > > On what platform (OS+JVM) are you running Wayback? Is the BDB index stored over NFS or another networked file system? I haven't experienced this problem on any of our systems -- the BDBJE just starts up, even with the lock file still existing. I haven't looked into this, but guessed that it was using the lock file via flock() type semantics, instead of using it's existence to indicate a lock. BDBJE may determine that the DB is on a remote system, where flock() semantics don't work, in which case it may be falling back to using the existence of the lock file to indicate usage.. In any case, I've just implemented the "clean shutdown" processing in my development environment, but will probably hold off to do more testing before including it in a release. We are preparing a 1.2.1 release which addresses a couple bugs discovered by folks in the field, but are holding this release for feedback from one more user having trouble reading some ARC files. > Third; with a few websites the timeline GUI is scrambled. I get a full > yellow screen with on every line a mark. After scrolling down that page, > the website is presented normally. This is not the case with every > website. > > Yes, the css implementation in the current timeline is prone to inheriting some styles from some web pages. Could you please send me a few example pages on the live web that demonstrate the problem you're seeing? > My fourth and last problem is in the configuration. I would like to do > some tests using the remote NutchWAX search, but there is not a clear > manual of how to implement this precisely, which beans to use for > example. Does anyone have a good example for me? > > Setting up a collection with this bean: <property name="resourceIndex"> <bean class="org.archive.wayback.resourceindex.NutchResourceIndex" init-method="init"> <property name="searchUrlBase" value="http://webteam-ws.us.archive.org:8080/katrina/opensearch" /> <property name="maxRecords" value="100" /> </bean> </property> Should do the trick. Note that if using Archival URL mode, you should be sure to set the maxRecords property on the RequestParser to the same value for maxRecords.. This may be a bug -- would be more friendly to use the min() of both values.. <property name="parser"> <bean class="org.archive.wayback.archivalurl.ArchivalUrlRequestParser" init-method="init"> <property name="maxRecords" value="100" /> <property name="earliestTimestamp" value="1996" /> </bean> </property> Hopefully this works for you, and please let me know about the questions above. Brad |
|
From: Miguel C. <mig...@fc...> - 2008-02-28 17:49:58
|
Hi, I would like to know the ratio between (index size)/(collection size) for collections larger than 1 TB. My objective is to have all index in memory, so having I X GB of memory, what is the maximum size of a collection I can index? Anyone can give me some numbers? Regards, -- Miguel Costa FCCN-Fundação para a Computação Científica Nacional Av. do Brasil, n.º 101 1700-066 Lisboa Tel.: +351 21 8440190 Fax: +351 218472167 <file:///C:/Documents%20and%20Settings/mcosta/Application%20Data/Microsoft/S ignatures/www.fccn.pt> www.fccn.pt Aviso de Confidencialidade Esta mensagem é exclusivamente destinada ao seu destinatário, podendo conter informação CONFIDENCIAL, cuja divulgação está expressamente vedada nos termos da lei. Caso tenha recepcionado indevidamente esta mensagem, solicitamos-lhe que nos comunique esse mesmo facto por esta via ou para o telefone +351 218440100 devendo apagar o seu conteúdo de imediato. This message is intended exclusively for its addressee. It may contain CONFIDENTIAL information protected by law. If this message has been received by error, please notify us via e-mail or by telephone +351 218440100 and delete it immediately. |
|
From: <Arn...@he...> - 2008-02-21 12:43:03
|
Hello, sorry certainly for this stupid question but after spending time into heritrix manuals, wayback manuals and mailing list archive , I hope users of this mailing can help me! 1/ In heritrix Arcs files are created in 'arcs' directories under each different Job. So several directories. 2/ In 'wayback.xml' I have to define the 'dataDir'. So one directory. How to organized my arcs files in one directory to be used by the wayback machine? Do I need to regularly copy the arcs files in a specific directory? Currently I tested by setting dataDir to one of my job arcs directory but I obtain this error message when I hit the 'take me back' button Etat HTTP 404 - /wayback-webapp-1.2.0/query I haven't error messages in tomcat log file. Arnaud. |
|
From: Thomas B. <Tho...@KB...> - 2008-02-21 11:48:41
|
Hi all, At the KB we are severely testing Wayback 1.2.0 at the moment. My first impression is quite positive; many new functions are added, it is quite easy to implement different modules for different access points and several indexing threads can live side by side now. I have a few questions though. First of all, I'm experiencing errors which did not occur in older versions; java.lang.OutOfMemoryError: GC overhead limit exceeded. Does anyone know how to fix this? Second; when closing down Wayback in Tomcat, the lock file for the localbdb is not erased. A restart is therefore not possible. Could this be fixed so that if the webapp is closed down, the lock file is erased? Third; with a few websites the timeline GUI is scrambled. I get a full yellow screen with on every line a mark. After scrolling down that page, the website is presented normally. This is not the case with every website. My fourth and last problem is in the configuration. I would like to do some tests using the remote NutchWAX search, but there is not a clear manual of how to implement this precisely, which beans to use for example. Does anyone have a good example for me? Keep up the good work! Wayback is really becoming a beautiful piece of software. Cheers, Thomas Beekman Technical Lead KB (National Library of the Netherlands) |
|
From: Miguel C. <mig...@fc...> - 2008-02-18 16:26:04
|
Hi, Does anyone know how to split a segment into N sub-segments? With the "org.apache.nutch.segment.SegmentMerger -split" command I can split a segment per number of URLs, but how to split into N parts? Regards, -- Miguel Costa |
|
From: Miguel C. <mig...@fc...> - 2008-02-15 15:36:15
|
Hi Lee, Thanks for your reply. I have 2 doubts about your response 1- After I deploy on 10 (n) machines, should I index locally each subset in parallel on the 10 machines or distributly (indexing the 10 subsets sequently)? 2 - If I split the ARCS, the ranking values will use local statistics from the ARC subset or global statistics from all the collection and web graph. If local the ranking will not be normalized between subsets, if global, when are these values merged? At runtime during query responses? Regards, _____ From: arc...@li... [mailto:arc...@li...] On Behalf Of John H. Lee Sent: terça-feira, 5 de Fevereiro de 2008 20:30 To: arc...@li... Subject: Re: [Archive-access-discuss] how to partition the index? Hi Miguel. To use distributed search, you need to plan ahead a bit and generate multiple indices. I don't know of a way to partition an existing large index into smaller chunks. For example, if you're indexing 100,000 ARCs and want to deploy on 10 machines, you should split your list of ARCs into 10 chunks of 10,000, invoke ImportArcs for each chunk, and invoke NutchwaxIndexer for each chunk. This will produce 10 segment/index pairs, each of which could be deployed on one of your 10 machines. For large jobs, I usually split the ARCs into groups of 1000. This produces segment/index pairs that are small enough to be manageable and flexible when it comes to deployment layout. Hope this helps. -J On Feb 5, 2008, at 5:12 AM, Miguel Costa wrote: Hi to all, After reading the nutchwax + nutch documentation I can index ARC files and search them using the nutchwax + wayback machine. However, I would like to perform a distributed search but I don't find any documentation on how to partition the index in n parts/segments for n machines. On the other hand there is information explaining how to distribute search using the search-servers.txt file, but I need to partition the index first. Can anyone explain me or give me a clue on how to partition an index for n machines? Regards, Miguel Costa ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/_____________________ __________________________ Archive-access-discuss mailing list Arc...@li... https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From: Oskar G. <Osk...@kb...> - 2008-02-15 11:03:42
|
Hi! I've downloaded WB 1.2.0 (which is excellent by the way) and got it working right away with arc-files. But when I later turned my attention warc-files (downloaded with Heritrix 1.12.1) I couldn't get it to work at first. When using "warc-indexer" to create cdx- files it just spitted out a NullPointerException and crashed. After some browsing through the code and testing I found that it was caused by the fact that the method getUrl() in ArchiveRecordHeader in the jar-file commons-2.0.1-SNAPSHOT.jar ALWAYS returned null. Why that is I haven't looked at, but it caused line 300 in WARCRecordToSearchResultAdapter.java ( String origHost = uriStr.substring(WaybackConstants.DNS_URL_PREFIX.length()); ) to throw the NPE. The solution for me was to replace "commons-2.0.1-SNAPSHOT.jar" with "heritrix.1.12.1.jar" and then it worked fine. Best regards, Oskar |
|
From: stack <st...@du...> - 2008-02-08 16:51:13
|
Pope, Jackson wrote: > > Hiya all, > > I’ve created a lot of nutchwax indices deployed the segments and index > for each to the search directory, and got nutchwax/wayback to search > these successfully. > > However, when I try to add more than 40 I hit the ‘too many open > files’ problem I mentioned before. Several people have suggested > upping the ‘ulimit’ to 32678, but I’ve already got it set to 1024, so > upping it to 32768 would theoretically allow me to create 30 x 40 > indices, still an order of magnitude smaller than I need. > Regards 1024, do the math. Each index is made of, say, 20 files (Do a listing of an index to figure for sure). 40 * 20 = 800 not counting the other files the application needs to open (jar files, configuration files, etc.). As you can see, 1024 probably ain't enough. Searching many indices is slower than searching a single index. Thats another reason to do merging. > > Next step I’ve tried is index merging. > > I’ve run the IndexMerger over some of my indices successfully, but > when I replace the indexes directory (which contains the individual > indices) with the new index, nutchwax stops working. It tells me that > it’s found some hits for my search term, but it doesn’t list them, and > wayback claims the index is unavailable. What else do I need to do to > deploy a merged index? > Any exceptions in tomcat log? Or looking at the logging, is it looking in right place for index? You might need to add an empty index.done to the merged index if its not there already (See end of this FAQ: http://archive-access.sourceforge.net/projects/nutchwax/faq.html#incremental) -- but I'm fuzzy on this stuff so that might not be it. St.Ack > Cheers, > > Jack > > Jackson Pope > > Technical Lead > > Web Archiving Team > > The British Library > > +44 (0)1937 54 6942 > > ************************************************************************** > Experience the British Library online at www.bl.uk <http://www.bl.uk/> > The British Library’s new interactive Annual Report and Accounts > 2006/07 : www.bl.uk/mylibrary <http://www.bl.uk/mylibrary> > Help the British Library conserve the world's knowledge. Adopt a Book. > www.bl.uk/adoptabook <http://www.bl.uk/adoptabook> > The Library's St Pancras site is WiFi - enabled > ************************************************************************* > The information contained in this e-mail is confidential and may be > legally privileged. It is intended for the addressee(s) only. If you > are not the intended recipient, please delete this e-mail and notify > the pos...@bl... <mailto:pos...@bl...> : The contents of this > e-mail must not be disclosed or copied without the sender's consent. > The statements and opinions expressed in this message are those of the > author and do not necessarily reflect those of the British Library. > The British Library does not take any responsibility for the views of > the author. > ************************************************************************* > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > ------------------------------------------------------------------------ > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: Pope, J. <Jac...@bl...> - 2008-02-08 11:38:04
|
Hiya all, I've created a lot of nutchwax indices deployed the segments and index for each to the search directory, and got nutchwax/wayback to search these successfully. However, when I try to add more than 40 I hit the 'too many open files' problem I mentioned before. Several people have suggested upping the 'ulimit' to 32678, but I've already got it set to 1024, so upping it to 32768 would theoretically allow me to create 30 x 40 indices, still an order of magnitude smaller than I need. Next step I've tried is index merging. I've run the IndexMerger over some of my indices successfully, but when I replace the indexes directory (which contains the individual indices) with the new index, nutchwax stops working. It tells me that it's found some hits for my search term, but it doesn't list them, and wayback claims the index is unavailable. What else do I need to do to deploy a merged index? Cheers, Jack Jackson Pope Technical Lead Web Archiving Team The British Library +44 (0)1937 54 6942 ************************************************************************** Experience the British Library online at www.bl.uk The British Library's new interactive Annual Report and Accounts 2006/07 : www.bl.uk/mylibrary Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabook The Library's St Pancras site is WiFi - enabled ************************************************************************* The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the pos...@bl... : The contents of this e-mail must not be disclosed or copied without the sender's consent. The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author. ************************************************************************* |
|
From: Brad T. <br...@ar...> - 2008-02-07 02:19:49
|
Wayback is an open-source java implementation of the Internet Archive's Wayback Machine service. The 1.2.0 release includes support for compressed and uncompressed ARC and WARC files, support for duplicate reduction WARC records, a new JavaScript-free ArchivalURL replay mode, and many bug fixes and other minor enhancements. For detailed features and changes, please see the Release Notes page on the Wayback project site at http://archive-access.sourceforge.net/projects/wayback/release_notes.html. Yours, Internet Archive Webteam |
|
From: Brad T. <br...@ar...> - 2008-02-06 00:06:34
|
Good question. This was a bug/missing feature in the software, but I've just tested a checkin to HEAD, which is the 1.2.0 release candidate that addresses this issue. We're still not handling non-http protocols correctly, but this will wait till 1.4.0, which will have a new index format that will allow better searches, and should expose additional search options via the UI, allowing end users to relax canonicalization if they are not finding the documents they want. So, as of now, the following tar.gz is the release candidate, and should fix this issue, as well as numerous other bugs. http://builds.archive.org:8080/maven2/org/archive/wayback/wayback/1.1.0-SNAPSHOT/wayback-1.1.0-20080204.230115-24-1.1.0-SNAPSHOT.tar.gz Let me know if this works for you, and if you find any other problems with this version. Brad Chris Vicary wrote: > Hi all, > > I am having a problem retrieving harvested resources whose urls include port > numbers using Wayback 1.0.1. We have a seed that includes a port number that > was harvested using heritrix. The resulting arc files were indexed using > wayback, and the urls stored in the index include the port number. Using the > wayback web address search interface, I am able to find the urls by > including the port number in the search string (if the port number is not > included, no results are found - which is expected). The link for the search > result does not include the port number, however, and clicking it does not > retrieve the harvested resource. If the port number is inserted into the > search result link, retrieval works fine. Even so, rewritten links on the > retrieved page do not include a port number where applicable. So my question > is, how do I ensure that port numbers are preserved in wayback search > results and in rewritten links? > > Thanks, > > Chris > > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > ------------------------------------------------------------------------ > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |