Currently, OmegaT only find StarDict index entries when they are in lowercase. However, some dictionaries are in uppercase. For instance, in the French Academy (1935) dictionary (http://download.huzheng.org/fr/) all entries are in full uppercase.
Converting all index entries to lowercase would be a limited strategy, as sometimes case is significant (e.g., German).
Aaron proposes the following strategy:
- I think some kind of normalization is reasonable. For instance at the moment we aren't performing Unicode normalization on dictionary entries either, but really we should be.
- Lowercasing seems OK to me (yes, we should use the right locale). Regarding your point about German, even in English it would be better to distinguish between "post" (the physical object) and "POST" (the HTTP verb). It would take more memory (especially if a dictionary is entirely uppercase), but it seems like the smart thing to do would be to retain both the original key and the lowercased key when they differ.
- We should be doing the same normalization on the search words when doing lookup: (First, Unicode-normalize, then) look up the word as-is; if there are no hits then look up the lowercased word.
Didier
Prototype incorporating this, [#1124] and [#1242] available here: https://omegat.ci.cloudbees.com/job/omegat-prototype/42/
Related
Feature Requests:
#1124Feature Requests:
#1242Last edit: Aaron Madlon-Kay 2016-05-13
We were actually already doing that, in
DictionariesManager.findWords()
.This is addressed in trunk for both StarDict and LingvoDSL.
Implemented in the released version 4.0 of OmegaT.
Didier