Currently DocFetcher is filling up /tmp/ with all kinds of temporary files, some of them are not even deleted after finishing the indexing operation (see also bug 1955 and 1956). For a large directory I'm trying to index, these files occupy tens of Gigabytes of space (on the notoriously full SSD, not on the RAID with all the data being indexed), which ultimately causes a failure of the indexing process (after several days of indexing). It would be very helpful to have all temporary files created during indexing in separate temporary directorie (one directory per extracted archive). This can make the process of cleaning up files after finishing an archive a lot easier/more reliable (just recursively delete the temporary directory after indexing the archive contents).
Since Java 7 there is a dedicated API to create a temporary directory with a given prefix in the name, so this should be rather easy to implement:
https://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#createTempDirectory%28java.nio.file.Path,%20java.lang.String,%20java.nio.file.attribute.FileAttribute...%29
The prefix could be e.g. "docfetcher" to clearly indicate where the directory is coming from (and allow the user to quickly clean up all leftovers from a failed/crashed indexing run without deleting any other, unrelated files in /tmp/). It may also be a good idea to add a small text file (e.g. "archive_path.txt") with the path of the archive being extracted (so that a user can quickly figure out the source of the problem in case extracting a big archive completely fills up the disk parititon containing /tmp).
Best regards
Vincent
Anonymous
Hi,
all good suggestions, but unlikely to be acted upon because (1) DocFetcher's indexing algorithm is a huge tangled mess, and (2) the project is not being actively developed anymore.
Best regards
q:-) <= Quang