|
From: Daniel G. <dan...@fc...> - 2008-03-04 17:55:13
|
The last email from my colleague Miguel Costa might have been a bit confusing. I will try to clarify the problem we identified because it looked quite important to us. We noticed that the wayback machine issues a query ranged by date to find embedded objects, such as images in an HTML page. Our first question is "Why is the query ranged by date instead of being restricted to the collection identifier?". A search by collection identifier would be more efficient because the search would be based on an exact match of the collection id and would present the images that most likely belong to that page. One may argue that this way if the image was not crawled in the last collection it would not be presented in the page. While using a date range query the image would still be included. The problem we see in this approach, is that we might be including images that although exist in the archive, were never published in the page. This situation lead to our second doubt: The date range issued in the query is from a static date of the first collection (e.g. 20010101000000) to the timestamp of the page (e.g. 20080218201945). We believe this situation leads to several problems: 1. The date range of the query is unnecessarily broad, if we are looking for the images embedded in a page crawled in 2008, looking for them since 2001 seems excessive. 2. Pages can be presented containing old images that were never published together (problem mentioned above) 3. Embedded images that have timestamps posterior to the page date (even some minutes later) are not found and not rendered along with the page. Notice, that pages must be crawled first to extract links to the embedded images, so most images will have a date later than the page and will not be presented by the wayback. In theory, it makes sense to not present pages including contents "from the future", but considering that crawls can not be executed instantly, using a sliding time window seems to be more adequate to find embedded objects and even links to other pages. We propose that the wayback/nutchwax should be configurable to: 1. Find contents to be rendered together based on the collection id or; 2.Find contents within a configurable date range centered on the date of the page. Say if the page date is 2008/01/03, we would consider that the embedded URLs crawled 3 days before and after this date could be rendered along with it. Notice, that if one is performing a crawl every 3 months, the timespan could be 1 month instead of 3 days. The timespan should be configured according to the duration and frequency of the crawls. We believe that contents from previous crawls should not be rendered together with a page. We would deeply appreciate that you validate our conclusions and gave us feedback about this issue. Best regards, /Daniel Gomes Portuguese web archive http://xldb.fc.ul.pt/daniel/ *From:* arc...@li... [mailto:arc...@li...] *On Behalf Of *Miguel Costa *Sent:* segunda-feira, 3 de Março de 2008 19:15 *To:* arc...@li... *Subject:* [Archive-access-discuss] url bounded by timestamp Hi, When a page is presented in the wayback machine, the linked images (and other resources) are searched to be presented also. The problem is that my wayback machine is searching using the nutchwax index, through the opensearch servlet, and the nutchwax bounds the search of the images (resources) by date (the timestamp of the source page): eg: date:20010101000000-20080218201945 exacturl:http://xldb.fc.ul.pt/daniel/scripts/statCounter.js after the url be called inside the source page: http://t3.tomba.fccn.pt:8080/wayback/wayback/20080218201945/http://xldb.fc.u l.pt/daniel/scripts/statCounter.js If the statCounter.js, for instance, has a higher timestamp (eg: 20080218201955), that is usual, this resource is not found. Does anyone know why these nutchwax searches don't use the collection id instead the timestamp, to find the linked images (resources). Does anyone know a solution for the problem? Regards, -- Miguel Costa Portuguese Web Archive -- /Daniel Gomes FCCN Av. do Brasil, n.º 101 1700-066 Lisboa Tel.: +351 21 8440190 Fax: +351 218472167 www.fccn.pt Aviso de Confidencialidade Esta mensagem é exclusivamente destinada ao seu destinatário, podendo conter informação CONFIDENCIAL, cuja divulgação está expressamente vedada nos termos da lei. Caso tenha recepcionado indevidamente esta mensagem, solicitamos-lhe que nos comunique esse mesmo facto por esta via ou para o telefone +351 218440100 devendo apagar o seu conteúdo de imediato. This message is intended exclusively for its addressee. It may contain CONFIDENTIAL information protected by law. If this message has been received by error, please notify us via e-mail or by telephone +351 218440100 and delete it immediately. -- /Daniel Gomes FCCN Av. do Brasil, n.º 101 1700-066 Lisboa Tel.: +351 21 8440190 Fax: +351 218472167 www.fccn.pt Aviso de Confidencialidade Esta mensagem é exclusivamente destinada ao seu destinatário, podendo conter informação CONFIDENCIAL, cuja divulgação está expressamente vedada nos termos da lei. Caso tenha recepcionado indevidamente esta mensagem, solicitamos-lhe que nos comunique esse mesmo facto por esta via ou para o telefone +351 218440100 devendo apagar o seu conteúdo de imediato. This message is intended exclusively for its addressee. It may contain CONFIDENTIAL information protected by law. If this message has been received by error, please notify us via e-mail or by telephone +351 218440100 and delete it immediately. |