No new features are planned for DocFetcher, only bugfixes. Development continues in DocFetcher Pro.
Buying a copy of DocFetcher Pro, the commercial big brother of DocFetcher, is equivalent to making a donation, plus you get a bunch of new features. If you don't need those features and/or DocFetcher Pro costs more than you're willing to donate, you can also "buy" the otherwise free demo of DocFetcher Pro for a price of your choosing.
NEW: DocFetcher does not work with the latest Java runtime. See the pinned notice on the bug tracker for workarounds.
If you experienced problems with the installed version of DocFetcher, consider using the portable version instead (download page). The latter runs on all supported platforms and does not try to detect or download Java runtimes. However, note that the portable version has to be put in a location where you have write permissions. The reason for this is that on the first start, the program figures out what operating system it's running on and whether the operating system is 32-bit or 64-bit. It then tries to unpack the right library files into a subfolder under its own folder, and this will fail without write permissions.
In some cases where DocFetcher doesn't start, the solution is to uninstall all currently installed Java runtimes and then reinstall the latest Java runtime from the Java website. On that website, be sure to pick either the 32-bit or the 64-bit Java runtime, depending on whether your operating system is 32-bit or 64-bit.
Another potential problem: Running DocFetcher with a memory setting of more than 1 GB requires a 64-bit Java runtime. It will not work with 32-bit.
On some systems, the embedded web browser that is used for displaying the manual and HTML files can crash the entire program. As a workaround, disable the embedded web browser by modifying DocFetcher's settings file. Look for the settings file in one of the following locations:
C:\Documents and Settings\<UserName>\Application Data\DocFetcher\conf\settings-conf.txtC:\Users\<UserName>\AppData\Roaming\DocFetcher\conf\settings-conf.txtDocFetcher\conf\settings-conf.txt/Users/<UserName>/.docfetcher/conf/settings-conf.txtIf the settings file doesn't exist at the expected location, create a new, empty text file there named settings-conf.txt. Now, first close DocFetcher, then open the settings file in a text editor and set ShowManualOnStartup = false and PreferHtmlPreview = false in it. While you're at it, you may also set HotkeyEnabled = false to disable the global hotkey. Save and close the file, then try to start DocFetcher.
Some users reported startup issues caused by faulty NVIDIA drivers, version 378.xx. See this thread.
If none of the above helps, try launching DocFetcher via one of the alternative launchers:
DocFetcher\misc, there's a file named DocFetcher.bat. Move this file one level up into the DocFetcher folder. Then open a command prompt and use the cd command to navigate to the DocFetcher folder, like so: cd C:\Program Files (x86)\DocFetcher. Then try launching DocFetcher from the command prompt by entering DocFetcher.bat and pressing Enter. If DocFetcher doesn't start, then chances are an error message will be printed in the command prompt. Post this error message on the DocFetcher forum.DocFetcher.sh launcher. If that doesn't help, do the following: Open a terminal and use the cd command to navigate to the DocFetcher folder. Then try launching DocFetcher from the terminal by running ./DocFetcher.sh. If DocFetcher doesn't start, then chances are an error message will be printed in the terminal. Post this error message on the DocFetcher forum.cd command to navigate to the folder Contents/MacOS inside the application bundle, then launch the DocFetcher script from there. If it doesn't start, an error message might be printed in the terminal. Post this error message on the DocFetcher forum.In the crash report, there's probably a line that says "SWTException: Unable to load graphics library [GDI+ is required]". This indicates that you need to install a package called GDI+ for supporting advanced graphics operations. Here's where you can download GDI+: http://www.microsoft.com/en-us/download/details.aspx?id=18909
Short answer: It's complicated. Last time I checked it had something to do with vector spaces and stuff. For further information, have a look at the scoring page of Lucene (DocFetcher's underlying search engine) and the Wikipedia article about the Vector Space Model.
Here's an extremely simplified explanation of how the scoring works: Suppose you have two files file1.doc and file2.doc with the following contents:
file1.doc contains the word "dog" 10 times, and nothing elsefile2.doc contains 100 words, 20 of which are "dog"Now, if you search for "dog", both files will show up in the results, but file1.doc gets a higher score because 10/10=100%, and 20/100=20%. This illustrates the basic idea: Dividing the number of hits by the word count gives you a measure of how "relevant" a document is with respect to your query. Why is that so? Because:
Occasionally, you'll see score values greater than 100%. This is because the actual formula used is much more complicated, and the calculated score is not really a percentage, but a fraction greater than or equal to 0.
Some people have asked for a column on the result table that displays for each file the “hit count”, i.e., the number of occurrences of the query string in the file. This information is currently only displayed for the selected file in a small box at the top of the preview pane.
There are currently no plans to implement a hit count column, due to performance reasons:
This is not really a bug, but a consequence of the fact that DocFetcher splits documents into individual words during indexing, a.k.a. tokenization. This is done in order to build a dictionary (i.e. the index), which DocFetcher then uses to do quick searches. In general, DocFetcher works best with natural language, but not quite as well with text containing digits or special characters.
That being said, there's an Analyzer option in the Advanced Settings which allows you to switch to an alternative tokenization mechanism that works better with source code and other kinds of text not written in natural language.
Additionally, take a look at the Query Syntax section in the manual. Some of the concepts explained in there, e.g. wildcards and phrase searches, might help to work around the above issues.
On the indexing dialog, use this regex exclusion pattern: .*/\.svn/.*
Note the usage of forward slashes to match against path separators (even on Windows!), and escaping the "." with a backward slash.
In addition, "Match Against" must be set to "Absolute path".
DocFetcher does not include folder names or file paths in the search, only filenames and file contents. That was a fundamental design decision that was made back then when the core program was written. It may or may not have been a good decision, but the idea was that (1) most of the important stuff the user may want to search for is in the filename and file contents, (2) if searching for words matching some file path component brings up all files on that path, this will decrease the overall quality of search results, and (3) there are already a lot of programs to search filenames and folder names, such as Everything.
Note: DocFetcher Pro is capable of finding folders by name.
DocFetcher uses third-party libraries to perform text extraction. For example, Apache POI is used for MS Office files, and Apache PDFBox for PDF files. Most of the errors that are shown during indexing come directly from the respective extraction libraries, without further translation by DocFetcher.
If DocFetcher gives an error on some file, there's usually not much one can do about it, except waiting for the developers of the respective library to release an update of their software, and then waiting for this update to be included in DocFetcher.
Certain errors can be circumvented as follows:
.doc files aren't MS Word files, you can enable mime-type detection for .doc files by putting the pattern .*\.doc in the pattern table on the indexing configuration dialog and setting "Detect mime type" as the action to be performed.The location of the index files depends on the version of DocFetcher and the operating system:
indexes folder inside the DocFetcher folder.C:\Documents and Settings\<UserName>\Application Data\DocFetcherC:\Users\<UserName>\AppData\Roaming\DocFetcher/Users/<UserName>/.docfetcherFor customizing the location of the index files, have a look at the file misc/paths.txt inside the DocFetcher folder.
In principle, DocFetcher should work with any amount of data. In practice, however, when dealing with a massive amount of data, there's a high risk that there are some problematic files in there that either cause DocFetcher to run out of memory or to crash altogether. The first often happens with large PDF files, and the second may happen with corrupt or otherwise unusual files. There are a few other potential problems as well. All in all, the following is recommended:
Copy the folders conf and indexes into the new DocFetcher folder. conf contains the program settings, while indexes contains the indexes.
Open a terminal and launch DocFetcher in it via the command docfetcher. In the terminal, you will see a message pointing to the configuration file in which you can set the memory limit. The configuration file contains further instructions.
The likely reason for high CPU usage is that (1) the operating system or some other program is constantly updating some files in one of the indexed folders, and (2) the option "Watch folders for file changes" was selected when the index was created, causing DocFetcher to frequently run index updates.
Accordingly, the workaround is to rebuild the affected index(es) with the option "Watch folders for file changes" turned off.
You can use so-called field searches to search in filenames only. Example: filename:dog
For more info, see the section "Field Searches" on the "Query Syntax" page of the built-in program manual.
For instance, to list all files with the file extension .mm, enter this query: filename:*.mm
This syntax is explained under the section "Field Searches" on the "Query Syntax" page of the built-in program manual.
Note: DocFetcher Pro addresses this problem via a feature called Custom Types. In essence, it is a customizable version of the Document Types pane, allowing you to define your own file types to filter the search results by.
To index files without file extension (i.e. without the dot in the filename), add the following rule in the pattern table on the indexing dialog:
[^\.]*This rule matches all files whose filenames do not contain a dot, and it will make DocFetcher recognize the matched files as plain text files.
Note: In DocFetcher Pro, files without file extension can be indexed by ticking the checkbox "Index files without file extension as text files" on the indexing dialog.
The error message means the folder you're trying to index contains more levels of subfolders than DocFetcher can handle with its current settings. The workaround is to either move the subfolders around in order to reduce the maximum folder depth, or to change the settings. How the latter is done depends on your operating system:
DocFetcher.bat file from the misc folder inside the DocFetcher folder one level up into the DocFetcher folder (important, otherwise the DocFetcher.bat won't run). Now open the DocFetcher.bat file in a text editor. In the last line, you can see a setting -Xss2m. Set this to a higher value, e.g. -Xss4m. From now on, always launch DocFetcher through the DocFetcher.bat.DocFetcher.sh in a text editor. In the last line, you can see a setting -Xss2m. Set this to a higher value, e.g. -Xss4m.FYI, the -Xss setting is the so-called "thread stack size" in megabytes that, among other things, limits the number of folder levels DocFetcher can handle.
Note: DocFetcher Pro is completely immune to the above problem. It can index folder hierarchies of any depth.
If you index a certain folder, say C:\path\to\folder, and then try to index a subfolder of that folder, say C:\path\to\folder\subfolder, DocFetcher will refuse and complain that overlapping indexes aren't allowed. There are technical reasons for this:
This would require determining in advance how many files there are that need to be indexed, which is not a trivial thing to do considering the complexity of DocFetcher's indexing algorithm -- there's really a lot of stuff going on under the hood. One particular problem is that the indexing is designed to work incrementally, so that files that have already been indexed are skipped.
This might be a problem with the specific fonts you are using. Try different font settings on DocFetcher's preferences dialog.
It's possible that the text in these PDF files exists only as scanned images, and is therefore not extractable. You can check this by opening the PDF files and trying to select the visible text. If you can't select the text, that means it's actually just an image. If this is indeed the problem, run your PDF files through OCR software.
To give an example of the problem:
The quick brown fox jumps over the lazy dog."brown jumps"~10 will bring up the file and highlight the match correctly."jumps brown"~10 will also bring up the file, but the match won't be highlighted.This is a known limitation of the preview highlighting; the searching itself is not affected.
As a workaround, you can combine a proxmity search and its reverse with an OR operator, like so: "dog cat"~10 OR "cat dog"~10. With this, you will get highlighting in both directions. Do note that you have to use the OR operator, not the AND operator; OR and AND behave totally differently.
This is appears to be an issue with the DocFetcher daemon on Windows. The daemon runs whenever DocFetcher isn't running, and seems to prevent Outlook from starting. As a workaround, rename the file docfetcher-daemon-windows.exe in the DocFetcher folder to prevent the daemon from starting, and then reboot Windows.
Note that disabling the daemon comes with the downside that you'll have to update your indexes by hand; otherwise your search results will be out of date after file changes.
For more information about what the daemon does, please see the first section in the DocFetcher manual.
DocFetcher ships with translations of its user interface for a couple of languages. At program start, it will detect the language of your operating system and then either choose a matching translation if available or use the English default.
You can override this auto-detection and explicitly set the user interface language as follows. First, take a look at the contents of the lang folder in the DocFetcher program folder. The lang folder contains files named Resource_XX.properties, where XX is a lowercase two-letter language code called ISO-639-1 that specifies a certain language. The lang folder contains all available translations. A complete list of ISO-639-1 language codes for all languages can be found here.
Now, to manually set the user interface language, you have to add a language parameter at the end of the launcher file, which depends on your operating system:
DocFetcher\misc, there's a file named DocFetcher.bat. Open the file in a text editor, add the parameter described in the next section, then save and close the file. Importantly, move the DocFetcher.bat file one level up into the DocFetcher folder. From now on, always start DocFetcher by double-clicking the DocFetcher.bat file.DocFetcher.sh in a text editor, add the parameter described in the next section, then save and close the file. Launch DocFetcher via the modified DocFetcher.sh file.Now, regarding the language parameter that needs to be added: The last line of the launcher file starts with a "java" command which launches the DocFetcher process. It looks like this:
java -enableassertions -Xmx512m -Xss2m -cp %libclasspath% -Djava.library.path=lib net.sourceforge.docfetcher.Main %1 %2 %3 %4 %5 %6 %7 %8 %9
Modify this line like so:
java -enableassertions -Xmx512m -Xss2m -cp %libclasspath% -Djava.library.path=lib **-Duser.language=XX** net.sourceforge.docfetcher.Main %1 %2 %3 %4 %5 %6 %7 %8 %9
Replace XX with the ISO-639-1 language code of the language you want DocFetcher to use.
No, it doesn't, and since no more features will be added to DocFetcher, it will never have one. However, a variant of DocFetcher Pro, called DocFetcher Pro Server, is planned for 2021. Subscribe to the DocFetcher Pro newsletter if you want to follow the development progress.
When DocFetcher isn't running, the daemon detects file changes in the indexed folders, and marks the corresponding indexes as "needs to be updated".
When DocFetcher is running, the daemon remains inactive, because then DocFetcher assumes the responsibility of detecting file changes, provided that the indexes have been created with the folder watching option enabled.
By itself, the daemon does not do any indexing, it only marks indexes as to be updated. When DocFetcher is started the next time, DocFetcher picks up the information left behind by the daemon and runs the required index updates.
In DocFetcher 1.1.20 and later versions, DocFetcher supports Python-based scripting. This can be used to programmatically execute searches and retrieve the results. For an example of how this is done, see the explanation at the top of the file search.py, which can be found in the DocFetcher program folder.
Alternatively, if you feel like tinkering with the DocFetcher source code, have a look at the [source code] page for instructions on how to obtain the source code and build DocFetcher.
For Java-based indexing and searching in general, have a look at these Apache projects:
This is mainly due to the fact that DocFetcher is shipped with lots of built-in text extraction libraries, some of which are quite big. The worst offenders are the libraries for MS Office and PDF files. However, the developers of these libraries aren't to blame here: The libraries have to be big because the respective file formats are immensely complex.
The word "Java" refers both to a platform for programs to run on, and to a programming language for writing such programs. Here's why DocFetcher was written in the Java language: Java is a far easier and far more convenient language to develop in than, say, C++. Java's advantages include: Automatic memory management, 10x less error-prone, 10x less effort to make it work on different platforms. If DocFetcher had been written in C++ instead, development time would probably have been twice as long, and the resulting program would have only half the features, but twice the number of bugs. And perhaps you would have to pay for it, or download some crack, because far fewer developers are willing to go through the ordeal of messing with C++ in their unpaid sparetime.
Also, while Java programs still start up slowly and memory usage is still high, the runtime performance has improved significantly in recent years and is now comparable to native code as produced by C/C++ programs, according to Wikipedia. (Case in point: I've never heard anybody say that DocFetcher's indexing algorithm is "slow".)
As for Java security, here’s the Truth most non-tech people never seem to quite understand:
One part of the answer is that it's a Java program. The other part is that you're feeding it with huge amounts of data.
Because a preview with full formatting, tables, paging, etc. would require a tremendous amount of programming effort. It's sort of like implementing a miniature version of MS Office inside DocFetcher for every single supported document format. That being said, there are some ready-made solutions for MS Office and PDF files out there, although integrating them into DocFetcher wouldn't be easy either. The cost-benefit ratio is really low here, so there are currently no plans to improve the situation.