It is nessesary to extend IDictionary interface to support this.
at least, no explorsure of dictionary map to DicrionariesManager is mandatory, ie. drop 'Map<word, object=""> readHeader()'.
Expansion idea for IDirectory;
methods to indicate what search is supported in dictionary driver.
boolean hasExactMatch()
boolean hasPrefixMatch()
boolean hasMultipleWordMatch()
other search modes...
As part of this development, stopwords are now filtered from dictionary lookup in all cases (previously they were only filtered if Options > Dictionary > Use Fuzzy Matching for Dictionary Entries was enabled).
Further, prefix search is currently only performed when using fuzzy matching, for words that have no hits under exact search. This is because the tokenizers perform search-oriented stemming instead of lemmatization (e.g. "operation" can become "oper"), thus prefix search is required to get any results in this case. However, unrestricted prefix searching can easily give far too many results, reducing the utility of the Dictionary pane.
The restrictions on prefix search can be lifted pending improvements in the display layer, which should be a separate RFE.
Last edit: Aaron Madlon-Kay 2016-05-18
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
As part of this development, stopwords are now filtered from dictionary lookup in all cases
If I consider StopList_en.txt, this is becoming an issue. Not all words in this are trivial words to translate that wouldn't benefit from dictionary entries.
Didier
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I agree. Frankly I think StopList_en.txt is too big. I once tried to research the origin of that list and didn't have much luck divining its intent (beyond the usual use of stop words) or reason for including so many non-trivial words.
It's not clear to me where to look to trace the origin any further than this.
Since it was originally(?) part of a term extraction step, I can see the logic in including some non-trivial (but common) words. However our usage is different, so I think we should consider cutting down the list or replacing it entirely.
Last edit: Aaron Madlon-Kay 2016-05-18
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Should we revert to that? We used StopList_en.txt from Okapi only because Lucene had no stop words for English at that time.
That said, the issue I have is for the dictionary. As an English to French translator, I have not really an issue with these stop words for fuzzy matching.
Is there any reason we cannot treat the dictionary entries as the glossary ones (where no stop list applies)?
Didier
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
We could. But I think stop words are desirable for the dictionary as some dictionaries have entries for every little word that you really don't need to see. My only problem is that StopList_en.txt is overzealous.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It is nessesary to extend IDictionary interface to support this.
at least, no explorsure of dictionary map to DicrionariesManager is mandatory, ie. drop 'Map<word, object=""> readHeader()'.
Expansion idea for IDirectory;
boolean hasExactMatch()
boolean hasPrefixMatch()
boolean hasMultipleWordMatch()
other search modes...
Object searchExactMatch(String word)
Object searchPrefixMatch(String word)
Object searchMultipleWordMatch(Stirng[] words, boolean condition) where condition = OR/AND
String readArticle(String word, Object searchResult)
Last edit: Hiroshi Miura 2015-09-15
Here is a proposal for this.
https://sourceforge.net/u/miurahr9/omegat/ci/integrate_trie_dictionary/
commit log:
https://sourceforge.net/u/miurahr9/omegat/ci/ad04f3f7ee1522281846b47501939e02e2fe423a/log/
This commits are consist of following improvements:
I found that several test fails.
One is copyright test. I've imported 3rd party code, because of nessesary modification. It fails.
Last edit: Hiroshi Miura 2015-09-22
Updated patch sets.
1. Add function for EBDict depends on #1123
2. Add feature for StarDict and LingoDSL.
This is implemented in trunk.
As part of this development, stopwords are now filtered from dictionary lookup in all cases (previously they were only filtered if
Options > Dictionary > Use Fuzzy Matching for Dictionary Entries
was enabled).Further, prefix search is currently only performed when using fuzzy matching, for words that have no hits under exact search. This is because the tokenizers perform search-oriented stemming instead of lemmatization (e.g. "operation" can become "oper"), thus prefix search is required to get any results in this case. However, unrestricted prefix searching can easily give far too many results, reducing the utility of the Dictionary pane.
The restrictions on prefix search can be lifted pending improvements in the display layer, which should be a separate RFE.
Last edit: Aaron Madlon-Kay 2016-05-18
If I consider StopList_en.txt, this is becoming an issue. Not all words in this are trivial words to translate that wouldn't benefit from dictionary entries.
Didier
I agree. Frankly I think
StopList_en.txt
is too big. I once tried to research the origin of that list and didn't have much luck divining its intent (beyond the usual use of stop words) or reason for including so many non-trivial words.Here's the default stop word set for English in Lucene 5.2.0:
Last edit: Aaron Madlon-Kay 2016-05-18
Should we revert to that? We used
StopList_en.txt
from Okapi only because Lucene had no stop words for English at that time.Another option would be an editable list (not only for English, of course), but that's another subject.
Didier
OK, one more stab at this:
omegat-plugins
projectAdd stop words list from Okapi
Started simple term extraction step.
Since it was originally(?) part of a term extraction step, I can see the logic in including some non-trivial (but common) words. However our usage is different, so I think we should consider cutting down the list or replacing it entirely.
Last edit: Aaron Madlon-Kay 2016-05-18
That said, the issue I have is for the dictionary. As an English to French translator, I have not really an issue with these stop words for fuzzy matching.
Is there any reason we cannot treat the dictionary entries as the glossary ones (where no stop list applies)?
Didier
We could. But I think stop words are desirable for the dictionary as some dictionaries have entries for every little word that you really don't need to see. My only problem is that
StopList_en.txt
is overzealous.Implemented in the released version 4.0 of OmegaT.
Didier