Menu

#1124 Support prefix search for dictionary

4.0
closed-fixed
None
5
2016-09-06
2015-09-13
No

Current dictionary implementation use "exact match" way for looking up.
This makes difficult find relavant articles.

This also helps users to find idioms.

Related

Feature Requests: #1245
Feature Requests: #1250

Discussion

  • Hiroshi Miura

    Hiroshi Miura - 2015-09-15

    It is nessesary to extend IDictionary interface to support this.
    at least, no explorsure of dictionary map to DicrionariesManager is mandatory, ie. drop 'Map<word, object=""> readHeader()'.

    Expansion idea for IDirectory;

    1. methods to indicate what search is supported in dictionary driver.

    boolean hasExactMatch()
    boolean hasPrefixMatch()
    boolean hasMultipleWordMatch()
    other search modes...

    1. methods to provide each search mode.

    Object searchExactMatch(String word)
    Object searchPrefixMatch(String word)
    Object searchMultipleWordMatch(Stirng[] words, boolean condition) where condition = OR/AND

    1. method to retrieve article

    String readArticle(String word, Object searchResult)

     

    Last edit: Hiroshi Miura 2015-09-15
  • Didier Briel

    Didier Briel - 2015-09-15
    • summary: support prefix search for dictionary --> Support prefix search for dictionary
     
  • Hiroshi Miura

    Hiroshi Miura - 2015-09-22

    I found that several test fails.

    One is copyright test. I've imported 3rd party code, because of nessesary modification. It fails.

     

    Last edit: Hiroshi Miura 2015-09-22
  • Hiroshi Miura

    Hiroshi Miura - 2015-11-18

    Updated patch sets.
    1. Add function for EBDict depends on #1123
    2. Add feature for StarDict and LingoDSL.

     
  • Aaron Madlon-Kay

    • assigned_to: Aaron Madlon-Kay
     
  • Aaron Madlon-Kay

    • status: open --> open-fixed
    • Group: future --> 4.0
     
  • Aaron Madlon-Kay

    This is implemented in trunk.

    As part of this development, stopwords are now filtered from dictionary lookup in all cases (previously they were only filtered if Options > Dictionary > Use Fuzzy Matching for Dictionary Entries was enabled).

    Further, prefix search is currently only performed when using fuzzy matching, for words that have no hits under exact search. This is because the tokenizers perform search-oriented stemming instead of lemmatization (e.g. "operation" can become "oper"), thus prefix search is required to get any results in this case. However, unrestricted prefix searching can easily give far too many results, reducing the utility of the Dictionary pane.

    The restrictions on prefix search can be lifted pending improvements in the display layer, which should be a separate RFE.

     

    Last edit: Aaron Madlon-Kay 2016-05-18
  • Didier Briel

    Didier Briel - 2016-05-18

    As part of this development, stopwords are now filtered from dictionary lookup in all cases

    If I consider StopList_en.txt, this is becoming an issue. Not all words in this are trivial words to translate that wouldn't benefit from dictionary entries.

    Didier

     
  • Aaron Madlon-Kay

    I agree. Frankly I think StopList_en.txt is too big. I once tried to research the origin of that list and didn't have much luck divining its intent (beyond the usual use of stop words) or reason for including so many non-trivial words.

    Here's the default stop word set for English in Lucene 5.2.0:

          "a", "an", "and", "are", "as", "at", "be", "but", "by",
          "for", "if", "in", "into", "is", "it",
          "no", "not", "of", "on", "or", "such",
          "that", "the", "their", "then", "there", "these",
          "they", "this", "to", "was", "will", "with"
    
     

    Last edit: Aaron Madlon-Kay 2016-05-18
    • Didier Briel

      Didier Briel - 2016-05-18

      Should we revert to that? We used StopList_en.txt from Okapi only because Lucene had no stop words for English at that time.

      Another option would be an editable list (not only for English, of course), but that's another subject.

      Didier

       
  • Aaron Madlon-Kay

    OK, one more stab at this:

    Since it was originally(?) part of a term extraction step, I can see the logic in including some non-trivial (but common) words. However our usage is different, so I think we should consider cutting down the list or replacing it entirely.

     

    Last edit: Aaron Madlon-Kay 2016-05-18
  • Didier Briel

    Didier Briel - 2016-05-18

    Should we revert to that? We used StopList_en.txt from Okapi only because Lucene had no stop words for English at that time.

    That said, the issue I have is for the dictionary. As an English to French translator, I have not really an issue with these stop words for fuzzy matching.

    Is there any reason we cannot treat the dictionary entries as the glossary ones (where no stop list applies)?

    Didier

     
  • Aaron Madlon-Kay

    We could. But I think stop words are desirable for the dictionary as some dictionaries have entries for every little word that you really don't need to see. My only problem is that StopList_en.txt is overzealous.

     
  • Didier Briel

    Didier Briel - 2016-09-06
    • status: open-fixed --> closed-fixed
     
  • Didier Briel

    Didier Briel - 2016-09-06

    Implemented in the released version 4.0 of OmegaT.

    Didier

     

Log in to post a comment.