DocFetcher / Bugs / #609 It just doesn't work

Nam-Quang Tran - 2013-10-25

Try searching with wildcards, e.g. "soft*" instead of "software".

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.
- Comment has been marked as spam.
  Undo
  
  View and moderate all "bugs Discussion" comments posted by this user
  
  Mark all as spam, and block user from posting to "Bugs"
  Anonymous - 2013-10-27
  
  Nope. The only search that gives results is when the search term is part of a file-name. No content searching at all.
  
  Here's an example:
  
  out of the box installation
  
  index folder (F:\downloads\repast)
  
  list documents
  
  select arbitrary (here: F:\downloads\repast\repast-forum_2000-09.html)
  
  select and copy arbitrary word (here: politics)
  
  paste and search for the word
  
  Results: 0
  
  Last edit: Anonymous 2013-10-27
  
  Nope. The only search that gives results is when the search term is part of a file-name. No content searching at all. Here's an example: - out of the box installation - index folder (F:\downloads\repast\) - list documents - select arbitrary (here: F:\downloads\repast\repast-forum_2000-09.html) - select and copy arbitrary word (here: politics) - paste and search for the word - Results: 0
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.
  
  New Attachment:
  
  CropperCapture[1].jpg
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anonymous
    
    Add attachments
    Cancel
    You seem to have CSS turned off. Please don't fill out this field.
    
    You seem to have CSS turned off. Please don't fill out this field.

Nam-Quang Tran - 2013-10-27

Would it be possible for you to post one of those HTML files here? Or you could send it to me if privacy is required. My address (in reverse): users.sourceforge.net <- qforce@

Without a concrete example to look at, there isn't much I can do here.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nam-Quang Tran - 2013-10-28

Another suggestion: Open the preferences dialog, then click on the "Advanced Settings" link at the bottom right. In the advanced settings, make sure the HTMLExtensions setting has a sensible value - otherwise DocFetcher will be unable to read HTML files. The setting's default value is:

HtmlExtensions = html;htm;xhtml;shtml;shtm;php;asp;jsp

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "bugs Discussion" comments posted by this user

Mark all as spam, and block user from posting to "Bugs"

Anonymous - 2013-10-28

The HTMLExtensions still have the default value. Example file I include here:

The HTMLExtensions still have the default value. Example file I include here:

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

repast-forum_2000-09.html

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nam-Quang Tran - 2013-10-28

Ah, I can see the problem: Those aren't valid HTML files. If you open the attached HTML file in a text editor, you'll see that the file ends with closing HTML tags, like so:

</body> </html>

However, there are no corresponding opening HTML tags at the beginning of the file, which should look like this:

<html> <body>

I assume these files were cut out from other, valid HTML files. You can still index these broken HTML files with DocFetcher: On the indexing dialog, add the pattern

.*\.html

on the pattern table and set "Detect mime type" as action to be performed on matching files. This will make DocFetcher peek into the contents of HTML files, and it will see that they aren't valid and hence fall back to treating them as plain text files.

Last edit: Nam-Quang Tran 2013-10-28
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "bugs Discussion" comments posted by this user

Mark all as spam, and block user from posting to "Bugs"

Anonymous - 2013-10-28

Your analysis appears to be correct. However, the solution is not. The indexing proceeds in correct way, that is, the files containing my search terms are found. But the files now are treated as text files and the lower-right panel shows the html-source instead of the rendered html. Still logical behaviour.
Yet such impaired html-tags are quite common in html world. As demonstrated by all browsers that correctly can display the defect html files, as well as DocFetcher itself (before it was made to think of the file as text file). I venture the opinion that the program ought to be able to handle the situation.

Meanwhile I will run a script to missing tags :-)

Your analysis appears to be correct. However, the solution is not. The indexing proceeds in correct way, that is, the files containing my search terms are found. But the files now are treated as text files and the lower-right panel shows the html-source instead of the rendered html. Still logical behaviour. Yet such impaired html-tags are quite common in html world. As demonstrated by all browsers that correctly can display the defect html files, as well as DocFetcher itself (before it was made to think of the file as text file). I venture the opinion that the program ought to be able to handle the situation. Meanwhile I will run a script to missing tags :-)

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nam-Quang Tran - 2013-10-28

But the files now are treated as text files and the lower-right panel shows the html-source instead of the rendered html.

Yes, I was aware of that, and unfortunately, at the moment there's no better solution for this except adding the missing tags.

I venture the opinion that the program ought to be able to handle the situation.

I'm afraid that's not likely to happen. DocFetcher can display broken HTML files because it uses an embedded web browser (probably Internet Explorer, Firefox or Chrome), but it uses something much simpler for text extraction, called Jericho HTML Parser, and the latter probably won't be able to handle broken HTML files anytime soon. So, the display of broken HTML files in DocFetcher may or may not improve, but text extraction on these files won't work correctly.

Last edit: Nam-Quang Tran 2013-10-28

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

It just doesn't work

Desktop search application

Group

Searches

Help

#609 It just doesn't work

Discussion