From: Graham, L. <lg...@lo...> - 2011-04-27 17:36:47
|
Thanks, we did find the heretrix template and will test naming without tildas. It seems reasonably easy to do. I did share here the benefits that Brad outlined in having the tildas. While we weren't having issues receiving bags with files named with tildas in our system, the advice was, if you can avoid them, do so. Maybe have an option in 3.1 to turn-on tildas and have hyphens remain the default? (Also, sorry, realize I'm probably on the wrong list with this heritrix issue.) Thanks much. Laura Graham Library of Congress ---------------------------------------------------------------------- Message: 1 Date: Mon, 25 Apr 2011 09:04:41 -0700 From: Gordon Mohr <go...@ar...> Subject: Re: [Archive-access-discuss] heritrix-3.1.0-beta: tildas in warc filename To: arc...@li... Message-ID: <4DB...@ar...> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Thanks for the note; we've heard some concern about the use of tildes, but so far no reports of anyplace where it actually breaks things. We could consider changing the defaults before the official release. However, by using something other than '-', the new 'pid~host~port' construct can be thought of as filling the exact same dash-delimited position as 'host' previously did. (Any processing based on the old naming that assumed dashes and the prior fields should continue to work, getting the same number of tokens.) One recommendation: if for any reasons projects do need a different naming formula than the default, it's best to change it in the crawler configuraton (specifically the 'template' property on the W/ARCWriterProcessor) so that the file is initially written with the desired name, rather than by a separate renaming step after it is written. Both the ARC and WARC formats include an internal reference to their filename-as-originally-written, and thus any later renamings create a mismatch with their internally-declared name and thus some risk of confusion. - Gordon @ IA On 4/25/11 6:41 AM, Graham, Laura wrote: > This is just a comment, as it may not be an issue for other institutions. But we have noticed in heritrix-3.1.0-beta that warcs are written with tildas in the filenames. From the documentation, these are to indicate host and port and pid values in the filename. > > After consulting with our respository tool development team here at the Library of Congress, we will likely rename these files going forward, replacing the tildas with hyphens. > > While our current bit preservation inventory tool accepts the tildas, and while there are lots of issues in filenames in general, which any system or set of tools will need to deal with, we've decided to take this extra renaming step. It only takes a moment, and it's one less possible issue to track in our work going forward. > > Again, just a comment. > > Thanks, > Laura Graham > Library of Congress > > ------------------------------------------------------------------------------ > Fulfilling the Lean Software Promise > Lean software platforms are now widely adopted and the benefits have been > demonstrated beyond question. Learn why your peers are replacing JEE > containers with lightweight application servers - and what you can gain > from the move. http://p.sf.net/sfu/vmware-sfemails > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |