From: Bradley T. <br...@ar...> - 2011-04-25 19:15:04
|
Hi Laura, Note that the initial WARCINFO record contains the name of the file in the WARC-Filename field. This some possible benefits: 1) later processing steps can deduce the filename from the data stream itself, without relying on actual filename, or on wrapping tools to forward the actual filename. Any software which assumes this field will reflect the actual original filename (perhaps for other processes to later access the data) may break. 1b) similar to #1, the format allows multiple WARC file contents to be concatenated into a single stream and fed to a processing tool, which can use the warcinfo record, specifically the WARC-Filename field, to detect original file input boundaries, and correctly report original source of the records. 2) With some filesystems, a corrupt directory block may cause file name information to be lost (for example, EXT-3|4 will place files into ./lost+found/ named with their inode). In the past at IA, we've simplified the reconstruction of the original filenames by inspecting the first ARC/WARC record, rather than having to deduce the original filename based on a content digest hash + a lookup of that hash in an external hash-to-filename database. So, none of these things are insurmountable obstacles to a simple rename, but I wanted to point out some possible downstream issues. Brad On 4/25/11 6:41 AM, Graham, Laura wrote: > This is just a comment, as it may not be an issue for other institutions. But we have noticed in heritrix-3.1.0-beta that warcs are written with tildas in the filenames. From the documentation, these are to indicate host and port and pid values in the filename. > > After consulting with our respository tool development team here at the Library of Congress, we will likely rename these files going forward, replacing the tildas with hyphens. > > While our current bit preservation inventory tool accepts the tildas, and while there are lots of issues in filenames in general, which any system or set of tools will need to deal with, we've decided to take this extra renaming step. It only takes a moment, and it's one less possible issue to track in our work going forward. > > Again, just a comment. > > Thanks, > Laura Graham > Library of Congress > > ------------------------------------------------------------------------------ > Fulfilling the Lean Software Promise > Lean software platforms are now widely adopted and the benefits have been > demonstrated beyond question. Learn why your peers are replacing JEE > containers with lightweight application servers - and what you can gain > from the move. http://p.sf.net/sfu/vmware-sfemails > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |