OmegaT - multiplatform CAT tool / Feature Requests / #1124 Support prefix search for dictionary

#1124 Support prefix search for dictionary

Milestone: 4.0

Status: closed-fixed

Owner: Aaron Madlon-Kay

Labels: None

Priority: 5

Updated: 2016-09-06

Created: 2015-09-13

Creator: Hiroshi Miura

Private: No

Current dictionary implementation use "exact match" way for looking up.
This makes difficult find relavant articles.

This also helps users to find idioms.

Related

Feature Requests: ~~#1245~~
Feature Requests: ~~#1250~~

Discussion

Hiroshi Miura - 2015-09-15

It is nessesary to extend IDictionary interface to support this.
at least, no explorsure of dictionary map to DicrionariesManager is mandatory, ie. drop 'Map<word, object=""> readHeader()'.

Expansion idea for IDirectory;

methods to indicate what search is supported in dictionary driver.

boolean hasExactMatch()
boolean hasPrefixMatch()
boolean hasMultipleWordMatch()
other search modes...

methods to provide each search mode.

Object searchExactMatch(String word)
Object searchPrefixMatch(String word)
Object searchMultipleWordMatch(Stirng[] words, boolean condition) where condition = OR/AND

method to retrieve article

String readArticle(String word, Object searchResult)

Last edit: Hiroshi Miura 2015-09-15
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2015-09-15

summary: support prefix search for dictionary --> Support prefix search for dictionary
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hiroshi Miura - 2015-09-21

Here is a proposal for this.
https://sourceforge.net/u/miurahr9/omegat/ci/integrate_trie_dictionary/
commit log:
https://sourceforge.net/u/miurahr9/omegat/ci/ad04f3f7ee1522281846b47501939e02e2fe423a/log/

This commits are consist of following improvements:

Stardict delayed loading

EPWING support

Make dictionary search fuzzy by Aaron

Introduce Trie index to StarDict.

Extend IDictionary interface for prefixSearch

Support prefix search for all dictionary formats
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hiroshi Miura - 2015-09-22

I found that several test fails.

One is copyright test. I've imported 3rd party code, because of nessesary modification. It fails.

Last edit: Hiroshi Miura 2015-09-22

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hiroshi Miura - 2015-11-18

Updated patch sets.
1. Add function for EBDict depends on #1123
2. Add feature for StarDict and LingoDSL.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Hiroshi Miura - 2015-11-18
  
  0002-dictionary-EBdict-support-prefixMatch.patch
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Hiroshi Miura - 2015-11-18
    
    0003-Dictionary-improve-search-results.patch
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aaron Madlon-Kay - 2016-05-13

assigned_to: Aaron Madlon-Kay
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aaron Madlon-Kay - 2016-05-18

status: open --> open-fixed

Group: future --> 4.0
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aaron Madlon-Kay - 2016-05-18

This is implemented in trunk.

As part of this development, stopwords are now filtered from dictionary lookup in all cases (previously they were only filtered if Options > Dictionary > Use Fuzzy Matching for Dictionary Entries was enabled).

Further, prefix search is currently only performed when using fuzzy matching, for words that have no hits under exact search. This is because the tokenizers perform search-oriented stemming instead of lemmatization (e.g. "operation" can become "oper"), thus prefix search is required to get any results in this case. However, unrestricted prefix searching can easily give far too many results, reducing the utility of the Dictionary pane.

The restrictions on prefix search can be lifted pending improvements in the display layer, which should be a separate RFE.

Last edit: Aaron Madlon-Kay 2016-05-18

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2016-05-18

As part of this development, stopwords are now filtered from dictionary lookup in all cases

If I consider StopList_en.txt, this is becoming an issue. Not all words in this are trivial words to translate that wouldn't benefit from dictionary entries.

Didier

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aaron Madlon-Kay - 2016-05-18

I agree. Frankly I think StopList_en.txt is too big. I once tried to research the origin of that list and didn't have much luck divining its intent (beyond the usual use of stop words) or reason for including so many non-trivial words.

Here's the default stop word set for English in Lucene 5.2.0:

"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"

Last edit: Aaron Madlon-Kay 2016-05-18
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Didier Briel - 2016-05-18
  
  Should we revert to that? We used StopList_en.txt from Okapi only because Lucene had no stop words for English at that time.
  
  Another option would be an editable list (not only for English, of course), but that's another subject.
  
  Didier
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aaron Madlon-Kay - 2016-05-18

OK, one more stab at this:

Our current StopList_en.txt derives from StopList_en.txt in the omegat-plugins project

That file was added with the log message Add stop words list from Okapi

The Okapi file is stopWords_en.txt and was added with the log message Started simple term extraction step.

It's not clear to me where to look to trace the origin any further than this.

Since it was originally(?) part of a term extraction step, I can see the logic in including some non-trivial (but common) words. However our usage is different, so I think we should consider cutting down the list or replacing it entirely.

Last edit: Aaron Madlon-Kay 2016-05-18
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2016-05-18

Should we revert to that? We used StopList_en.txt from Okapi only because Lucene had no stop words for English at that time.

That said, the issue I have is for the dictionary. As an English to French translator, I have not really an issue with these stop words for fuzzy matching.

Is there any reason we cannot treat the dictionary entries as the glossary ones (where no stop list applies)?

Didier

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aaron Madlon-Kay - 2016-05-18

We could. But I think stop words are desirable for the dictionary as some dictionaries have entries for every little word that you really don't need to see. My only problem is that StopList_en.txt is overzealous.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2016-09-06

status: open-fixed --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2016-09-06

Implemented in the released version 4.0 of OmegaT.

Didier

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.