|
From: Brad T. <br...@ar...> - 2008-03-04 23:09:52
|
Hi Daniel, Thanks for the elaboration and the excellent suggestions. We've been discussing adding functionality to Wayback to allow users to target a specific date they want to stay "near" within a replay session. Currently when retrieving an embedded object for a web page, or when navigating between two archived web page, the Wayback will return the document that is closest to the current document being viewed. We'd like to add the capability for users to specify a specific date, as well as a maximum range before and after that date to stay within for these embedded requests, and for navigations. In somewhat more detail, we plan to expand greatly the "in page presence" of the Wayback software, which in this particular case would mean including a banner or additional element in the page that would allow users to temporarily expand the maximum range of embedded elements in a specific page to potentially allow replay of captures that were archived, but are outside the standard maximum range. I think this is the same functionality you're suggesting, and we're hoping to have this in the 1.4 release, in a 2-3 month time frame. Wayback HEAD may include this functionality before that, and I'll let you know how that progresses. However, in the context you're using Wayback, with a Nutch ResourceIndex, this may require more functionality within Nutch as well. I'm not sure what the schedule might be for that, but again will keep you posted. Please let me know if I've misunderstood your suggestion, and the functionality we've discussed is not the same as your suggestions. Brad Daniel Gomes wrote: > The last email from my colleague Miguel Costa might have been a bit > confusing. I will try to clarify the problem we identified because it looked > quite important to us. > > We noticed that the wayback machine issues a query ranged by date to find > embedded objects, such as images in an HTML page. > > Our first question is "Why is the query ranged by date instead of being > restricted to the collection identifier?". > > A search by collection identifier would be more efficient because the search > would be based on an exact match of the collection id and would present the images that most > likely belong to that page. > One may argue that this way if the image was not crawled in the last > collection it would not be presented in the page. While using a date range > query the image would still be included. The problem we see in this approach, is that > we might be including images that although exist in the archive, were never > published in the page. > > This situation lead to our second doubt: > > The date range issued in the query is from a static date of the first > collection (e.g. 20010101000000) to the timestamp of the page (e.g. > 20080218201945). > > We believe this situation leads to several problems: > > 1. The date range of the query is unnecessarily broad, if we are looking for > the images embedded in a page crawled in 2008, looking for them since 2001 > seems excessive. > > 2. Pages can be presented containing old images that were never published > together (problem mentioned above) > > 3. Embedded images that have timestamps posterior to the page date (even > some minutes later) are not found and not rendered along with the page. > Notice, that pages must be crawled first to extract links to the embedded > images, so most images will have a date later than the page and will not be > presented by the wayback. In theory, it makes sense to not present pages including contents "from the > future", but considering that crawls can not be executed instantly, using a > sliding time window seems to be more adequate to find embedded objects and > even links to other pages. > > We propose that the wayback/nutchwax should be configurable to: > > 1. Find contents to be rendered together based on the collection id or; > > 2.Find contents within a configurable date range centered on the date of the > page. Say if the page date is 2008/01/03, we would consider that the > embedded URLs crawled 3 days before and after this date could be rendered > along with it. Notice, that if one is performing a crawl every > 3 months, the timespan could be 1 month instead of 3 days. The timespan > should be configured according to the duration and frequency of the crawls. > We believe that contents from previous crawls should not be rendered > together with a page. > > We would deeply appreciate that you validate our conclusions and gave us > feedback about this issue. > > Best regards, > /Daniel Gomes > Portuguese web archive > http://xldb.fc.ul.pt/daniel/ > > > *From:* arc...@li... > [mailto:arc...@li...] *On Behalf Of > *Miguel Costa > *Sent:* segunda-feira, 3 de Março de 2008 19:15 > *To:* arc...@li... > *Subject:* [Archive-access-discuss] url bounded by timestamp > > Hi, > > When a page is presented in the wayback machine, the linked images (and > other resources) are searched to be presented also. > The problem is that my wayback machine is searching using the nutchwax > index, through the opensearch servlet, and the nutchwax bounds the search of > the images (resources) by date (the timestamp of the source page): > > eg: date:20010101000000-20080218201945 > exacturl:http://xldb.fc.ul.pt/daniel/scripts/statCounter.js > > after the url be called inside the source page: > http://t3.tomba.fccn.pt:8080/wayback/wayback/20080218201945/http://xldb.fc.u > l.pt/daniel/scripts/statCounter.js > > If the statCounter.js, for instance, has a higher timestamp (eg: > 20080218201955), that is usual, this resource is not found. > Does anyone know why these nutchwax searches don't use the collection id > instead the timestamp, to find the linked images (resources). Does anyone > know a solution for the problem? > > > Regards, > > > > -- Miguel Costa > > Portuguese Web Archive > > > -- > /Daniel Gomes > FCCN > Av. do Brasil, n.º 101 > 1700-066 Lisboa > Tel.: +351 21 8440190 > Fax: +351 218472167 > www.fccn.pt > > Aviso de Confidencialidade > > Esta mensagem é exclusivamente destinada ao seu destinatário, podendo conter > informação CONFIDENCIAL, cuja divulgação está expressamente vedada nos > termos da lei. Caso tenha recepcionado indevidamente esta mensagem, > solicitamos-lhe que nos comunique esse mesmo facto por esta via ou para o > telefone +351 218440100 devendo apagar o seu conteúdo de imediato. This > message is intended exclusively for its addressee. It may contain > CONFIDENTIAL information protected by law. If this message has been received > by error, please notify us via e-mail or by telephone +351 218440100 and > delete it immediately. > > > > > |