Menu

#609 It just doesn't work

v1.0_(example)
open
nobody
None
1
2013-10-28
2013-10-25
Anonymous
No
  • Clean install v 1.1.9 portable on Windows 7
  • indexed a folder with html files, using ths default options
    (seeing list of processed files, index appears under search scope)
  • search for words that are copied from the indexed files
    :: Results: 0

Discussion

  • Nam-Quang Tran

    Nam-Quang Tran - 2013-10-25

    Try searching with wildcards, e.g. "soft*" instead of "software".

     
    • Anonymous

      Anonymous - 2013-10-27

      Nope. The only search that gives results is when the search term is part of a file-name. No content searching at all.

      Here's an example:

      • out of the box installation
      • index folder (F:\downloads\repast)
      • list documents
      • select arbitrary (here: F:\downloads\repast\repast-forum_2000-09.html)
      • select and copy arbitrary word (here: politics)
      • paste and search for the word
      • Results: 0
       

      Last edit: Anonymous 2013-10-27
  • Nam-Quang Tran

    Nam-Quang Tran - 2013-10-27

    Would it be possible for you to post one of those HTML files here? Or you could send it to me if privacy is required. My address (in reverse): users.sourceforge.net <- qforce@

    Without a concrete example to look at, there isn't much I can do here.

     
  • Nam-Quang Tran

    Nam-Quang Tran - 2013-10-28

    Another suggestion: Open the preferences dialog, then click on the "Advanced Settings" link at the bottom right. In the advanced settings, make sure the HTMLExtensions setting has a sensible value - otherwise DocFetcher will be unable to read HTML files. The setting's default value is:

    HtmlExtensions = html;htm;xhtml;shtml;shtm;php;asp;jsp

     
  • Anonymous

    Anonymous - 2013-10-28

    The HTMLExtensions still have the default value. Example file I include here:

     
  • Nam-Quang Tran

    Nam-Quang Tran - 2013-10-28

    Ah, I can see the problem: Those aren't valid HTML files. If you open the attached HTML file in a text editor, you'll see that the file ends with closing HTML tags, like so:

    </body> </html>
    

    However, there are no corresponding opening HTML tags at the beginning of the file, which should look like this:

    <html> <body>
    

    I assume these files were cut out from other, valid HTML files. You can still index these broken HTML files with DocFetcher: On the indexing dialog, add the pattern

    .*\.html
    

    on the pattern table and set "Detect mime type" as action to be performed on matching files. This will make DocFetcher peek into the contents of HTML files, and it will see that they aren't valid and hence fall back to treating them as plain text files.

     

    Last edit: Nam-Quang Tran 2013-10-28
  • Anonymous

    Anonymous - 2013-10-28

    Your analysis appears to be correct. However, the solution is not. The indexing proceeds in correct way, that is, the files containing my search terms are found. But the files now are treated as text files and the lower-right panel shows the html-source instead of the rendered html. Still logical behaviour.
    Yet such impaired html-tags are quite common in html world. As demonstrated by all browsers that correctly can display the defect html files, as well as DocFetcher itself (before it was made to think of the file as text file). I venture the opinion that the program ought to be able to handle the situation.

    Meanwhile I will run a script to missing tags :-)

     
  • Nam-Quang Tran

    Nam-Quang Tran - 2013-10-28

    But the files now are treated as text files and the lower-right panel shows the html-source instead of the rendered html.

    Yes, I was aware of that, and unfortunately, at the moment there's no better solution for this except adding the missing tags.

    I venture the opinion that the program ought to be able to handle the situation.

    I'm afraid that's not likely to happen. DocFetcher can display broken HTML files because it uses an embedded web browser (probably Internet Explorer, Firefox or Chrome), but it uses something much simpler for text extraction, called Jericho HTML Parser, and the latter probably won't be able to handle broken HTML files anytime soon. So, the display of broken HTML files in DocFetcher may or may not improve, but text extraction on these files won't work correctly.

     

    Last edit: Nam-Quang Tran 2013-10-28

Anonymous
Anonymous

Add attachments
Cancel