From: Christiaan F. <chr...@ad...> - 2008-07-29 09:14:14
|
Antoni Myłka wrote: > Hi Chris, > > I've just spotted your commit. Brilliant. I was really afraid that > tarballs can't be properly detected. In this case, I think I can remove > the '.tar' hack from the AbstractCompressorSubCrawler. Don't you think so? According to http://www.astro.keele.ac.uk/oldusers/rno/Computing/File_magic.html, POSIX tar files have a magic number, pre-POSIX files don't, and that page is already five years old. No idea whether you will still encounter pre-POSIX tar files in practice, but I think you can safely remove the hack. I've glanced through the other classes, here are some thoughts. Can you tell me a bit more (also in Javadoc) about the difference between the two SubCrawler base classes, i.e. AbstractArchiverSubCrawler vs. AbstractCompressorSubCrawler? My guess is that the latter is meant for archives that are created using a two-step process, e.g. a tar file that is zipped using gzip. What's the purpose of this, not having to create an extra DataObject that represents the entire uncompressed tar file itself? I.e., the DataObject for the tar.gz file has the members of the tar file as *direct* children? I wonder whether this is a good thing. For most search apps it may be more user-friendly to see the info this way, but perhaps other types of apps would want to see the inbetween DataObject. This also depends on whether compressors like gzip and bzip2 keep the file metadata for the tar file intact, or whether all you have is really the byte stream. From a purist point of view, you could see it as part of the UI design whether such DataObject should be shown. I'm not (yet) saying that we should return the inbetween DataObject as well, I'm just trying to determine the pros and cons of the current approach. A question regarding zip files: are all entries in the zip file direct children of the zip DataObject, or are they also ordered in a folder hierarchy? What's the mechanism for creating URIs of nested DataObjects? Judging from the code, I think it's the archive URI followed by a slash followed by the entry's path. Isn't a hash ('#') more appropriate as separator? At least then you can see where the archive name ends and the nested DataObject's name starts (may be necessary to facilitate opening of DataObjects). Crawling of archives is now in place, but is there a way to do a getDataObject somewhere, to facilitate opening of such DataObjects? Finally, how hard would it be to add SubCrawler support to the FileInspector? Right now an Extractor is applied when one is registered for the MIME type, so you see the metadata and extracted full-text. We could do something similar with SubCrawlers: show all metadata of the parent DataObject and perhaps a message telling how many nested DataObjects there are. BTW: compilation in Eclipse currently gives a build error: "The type java.lang.Enum cannot be resolved. It is indirectly referenced from required .class files". Regards, Chris -- |