Would it be possible for you to post one of those HTML files here? Or you could send it to me if privacy is required. My address (in reverse): users.sourceforge.net <- qforce@
Without a concrete example to look at, there isn't much I can do here.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Another suggestion: Open the preferences dialog, then click on the "Advanced Settings" link at the bottom right. In the advanced settings, make sure the HTMLExtensions setting has a sensible value - otherwise DocFetcher will be unable to read HTML files. The setting's default value is:
Ah, I can see the problem: Those aren't valid HTML files. If you open the attached HTML file in a text editor, you'll see that the file ends with closing HTML tags, like so:
</body></html>
However, there are no corresponding opening HTML tags at the beginning of the file, which should look like this:
<html> <body>
I assume these files were cut out from other, valid HTML files. You can still index these broken HTML files with DocFetcher: On the indexing dialog, add the pattern
.*\.html
on the pattern table and set "Detect mime type" as action to be performed on matching files. This will make DocFetcher peek into the contents of HTML files, and it will see that they aren't valid and hence fall back to treating them as plain text files.
Last edit: Nam-Quang Tran 2013-10-28
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Your analysis appears to be correct. However, the solution is not. The indexing proceeds in correct way, that is, the files containing my search terms are found. But the files now are treated as text files and the lower-right panel shows the html-source instead of the rendered html. Still logical behaviour.
Yet such impaired html-tags are quite common in html world. As demonstrated by all browsers that correctly can display the defect html files, as well as DocFetcher itself (before it was made to think of the file as text file). I venture the opinion that the program ought to be able to handle the situation.
Meanwhile I will run a script to missing tags :-)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
But the files now are treated as text files and the lower-right panel shows the html-source instead of the rendered html.
Yes, I was aware of that, and unfortunately, at the moment there's no better solution for this except adding the missing tags.
I venture the opinion that the program ought to be able to handle the situation.
I'm afraid that's not likely to happen. DocFetcher can display broken HTML files because it uses an embedded web browser (probably Internet Explorer, Firefox or Chrome), but it uses something much simpler for text extraction, called Jericho HTML Parser, and the latter probably won't be able to handle broken HTML files anytime soon. So, the display of broken HTML files in DocFetcher may or may not improve, but text extraction on these files won't work correctly.
Last edit: Nam-Quang Tran 2013-10-28
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Try searching with wildcards, e.g. "soft*" instead of "software".
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
Nope. The only search that gives results is when the search term is part of a file-name. No content searching at all.
Here's an example:
Last edit: Anonymous 2013-10-27
Would it be possible for you to post one of those HTML files here? Or you could send it to me if privacy is required. My address (in reverse): users.sourceforge.net <- qforce@
Without a concrete example to look at, there isn't much I can do here.
Another suggestion: Open the preferences dialog, then click on the "Advanced Settings" link at the bottom right. In the advanced settings, make sure the HTMLExtensions setting has a sensible value - otherwise DocFetcher will be unable to read HTML files. The setting's default value is:
HtmlExtensions = html;htm;xhtml;shtml;shtm;php;asp;jsp
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
The HTMLExtensions still have the default value. Example file I include here:
Ah, I can see the problem: Those aren't valid HTML files. If you open the attached HTML file in a text editor, you'll see that the file ends with closing HTML tags, like so:
However, there are no corresponding opening HTML tags at the beginning of the file, which should look like this:
I assume these files were cut out from other, valid HTML files. You can still index these broken HTML files with DocFetcher: On the indexing dialog, add the pattern
on the pattern table and set "Detect mime type" as action to be performed on matching files. This will make DocFetcher peek into the contents of HTML files, and it will see that they aren't valid and hence fall back to treating them as plain text files.
Last edit: Nam-Quang Tran 2013-10-28
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
Your analysis appears to be correct. However, the solution is not. The indexing proceeds in correct way, that is, the files containing my search terms are found. But the files now are treated as text files and the lower-right panel shows the html-source instead of the rendered html. Still logical behaviour.
Yet such impaired html-tags are quite common in html world. As demonstrated by all browsers that correctly can display the defect html files, as well as DocFetcher itself (before it was made to think of the file as text file). I venture the opinion that the program ought to be able to handle the situation.
Meanwhile I will run a script to missing tags :-)
Yes, I was aware of that, and unfortunately, at the moment there's no better solution for this except adding the missing tags.
I'm afraid that's not likely to happen. DocFetcher can display broken HTML files because it uses an embedded web browser (probably Internet Explorer, Firefox or Chrome), but it uses something much simpler for text extraction, called Jericho HTML Parser, and the latter probably won't be able to handle broken HTML files anytime soon. So, the display of broken HTML files in DocFetcher may or may not improve, but text extraction on these files won't work correctly.
Last edit: Nam-Quang Tran 2013-10-28