Menu

#2403 Problem with UTF8 text containing "lock" symbol

v1.0_(example)
closed
nobody
None
1
2026-01-07
2024-12-10
Anonymous
No

DocFetcher not indexing (not searching in) cyrillic text if it contains symbol 🔒 and encoded as UFT8.
For example create txt or html file that contains "мама папа hello 🔒 world" and try to index it.
DocFetcher will find "hello" and "world" but will not find "мама" and "папа".
Without this symbol everything works as expected

1 Attachments

Discussion

  • Nam-Quang Tran

    Nam-Quang Tran - 2024-12-11

    Hi,

    in the Preferences, you can try some of the other word segmentation options. This might make the search work with this special character, but will also significantly impact the search results.

    Regards
    q:-) <= Quang

     
    • Anonymous

      Anonymous - 2024-12-13

      Hi.
      I think that problem in codepage detection.
      This "Lock symbol" makes indexer to work in wrong encoding.
      Please look at my screenshot.
      At the bottom you can see that found content has wrong codepage, not the same as indexed document.

       
  • Nam-Quang Tran

    Nam-Quang Tran - 2024-12-13

    Yes, it looks like the lock symbol causes DocFetcher to pick the wrong encoding. You can force it to use a particular encoding, but this will then apply to all text files. To force the encoding, open the file "program-conf.txt" and alter the setting "TextEncodingOverride". In this case, the following value works:

    TextEncodingOverride = utf-8

    Then save the file, restart the program, and rebuild all the relevant indexes.

    Alternatively, you can try the commercial DocFetcher Pro. It seems to handle this case just fine, without any text encoding overrides.

     
  • Nam-Quang Tran

    Nam-Quang Tran - 2026-01-07

    Will be fixed in DocFetcher 1.1.27.

     

Anonymous
Anonymous

Add attachments
Cancel





MongoDB Logo MongoDB