#8 OmegaT+ matching functionality

open
None
9
2012-10-23
2006-10-14
Raymond Martin
No

Improve matching, currently implementation lacks state-
of-the-art technology that can provide better matching
and usage.

Edit distance dynamic programming algorithm is non-
optimized.

Discussion

  • Raymond Martin
    Raymond Martin
    2006-10-15

    Logged In: YES
    user_id=1111672

    Improved matching will be added in an upcoming omegat
    milestone release.

     
  • Logged In: YES
    user_id=915082

    All the RFE that you filled in October already have code available in OmegaT
    except for this one. So if you could focus your energy here it would really be
    nice.

    We have identified a number of issues, related to how Java tokenizes strings,
    if you want to be kept up to date don't hesitate to ask questions where the
    things are discussed.

    Also, I don't want to be picky about English, not being a native speaker, but
    when you write "algorithm is non-optimized", do you mean it has been
    consciously made so, or do you rather mean "algorithm is not optimized" ? I
    am confused.

     
  • Raymond Martin
    Raymond Martin
    2006-10-19

    Logged In: YES
    user_id=1111672

    I'll do what I can, as time permits.

    It may be the case that Java string tokenizing affects
    possible use for matching, but a lack of in depth knowledge
    is also a factor affecting OmegaT development in this
    regard.

    "Optimized" is the past tense of the transitive verb
    "Optimize". You are confused.

     
  • Logged In: YES
    user_id=915082

    Rather than a lack of knowledge, it is a lack of time that plagues "us".

    As for tokenization, it affects how languages without spaces between words
    are handled by Java, Japanese for ex. As far as I am concerned that is one of
    the most important issue facing dev right now.

    Do we match on the token or on the token substrings (character by character)
    etc. The previous version of the match engine did not use tokenization and
    the strings were matched characters by characters, whatever the position of
    the characters in the strings and regardless of the overall semantic value of
    sign families (kanji should have more weight than kana for ex) etc.

    I'm willing to test anything that improves on the current situation if you have
    code available.

    As for "non-optimized" and "not optimized" I think there is a major difference
    between the two and wondered why you had chosen "non-optimized".

     
  • Raymond Martin
    Raymond Martin
    2006-10-19

    Logged In: YES
    user_id=1111672

    Obviously, the problem is only with certain languages that
    cannot be tokenized in the simple word-like manner of
    European languages or others.

    Without looking further into it, I would guess that a more
    evolved approach is needed that does not just rely on a
    single methodology to process content. Perhaps a hybrid
    approach that takes into account the features of a
    particular language. The problem is basically one of scope
    (i.e. character vs. word). The development on OmegaT is not
    being considerate of certain issues as it moves forward,
    this shows a lack of knowledge.

    Standard software development practices require regression
    testing to show that new versions do not break working
    functionality. Experienced developers know this and work
    with it. OmegaT does not follow good development practices.

    In regard to "non-optimized", "non-" is a standard prefix
    that can be added to many words in English, even to create
    those that are not listed in dictionaries. Perhaps its
    nature is more colloquial. Nonetheless, it is used. Just do
    a search for it.

    Some people even use "nonoptimized", which to me is
    somewhat, if not completely incorrect.

     
  • Logged In: YES
    user_id=915082

    Well, obviously that is the case. Thank you for reminding me what I just
    wrote. As for the more evolved approach, this is in fact what my remarks are
    about. Since you think the current implementation is non-optimized, I am
    suggesting that this item in particular could be on your priority list.

    As for omegat dev not being considerate, well it is quite the opposite. The
    current behavior, although not producing results as good as I expected is way
    better than matching a string character by character as it was implemented
    before.

    Since omegat dev currently is not able to naturally parse such languages
    there is a problem here that can't easily be solved. One thing that could be
    investigated for ex is to yet sub-tokenize the Java tokens by "semantic"
    means with the help of a dictionary for ex. Obviously that leads to language
    modularization and maybe memory intensive loading.

    Or as was suggested in my preceding comment, thank you for considering it
    is a valid approach btw, by using a "hybrid" approach that does further
    matching within the unmatched tokens. That could be used for any language
    (esp. ones that add a number of suffixes like German or French/Russian etc)
    to match simple plurals or verb conjugations etc.

    If you are really serious about improving the matching engine, you will need
    to look further into it at some time. Hopefully your professional approach will
    produce something that will be of value to the omegat code.

    If you need further information about how natural languages modify their
    structures and how to deal with that in omegat there is a whole team of
    linguists on the other side of your fence.

     
  • Raymond Martin
    Raymond Martin
    2006-10-19

    Logged In: YES
    user_id=1111672

    This is forum is for omega t+ feature requests, not lengthy
    discussions of other issues. Please post your messages on
    the omega t+ development list, Google group, or other
    location.

     
  • Logged In: YES
    user_id=915082

    I am suggesting you approaches that you may have not considered. I am not
    just "discussing".

    You know that the dev list is currently non-functional, it is merely a place
    where you put memos on what you think you'll do someday and as far as
    google is concerned I am not allowed to "discuss" there.

    You may be unclear what the meaning of "add a comment" is. The string right
    above the window in which I am currently typing.

    You are free to ignore my comments regarding your feature requests. If you
    think my comments are out of place here, feel free to ignore them
    altogether.

    I am actually giving you hints here so that you don't have to keep lurking on
    omegat's dev list for fresh ideas on how to deal with issues that you don't
    seem to understand very well.

     
  • Raymond Martin
    Raymond Martin
    2006-10-19

    Logged In: YES
    user_id=1111672

    I don't really care what you think and view anything you
    say with suspicion.

    I am not interested in helping OmegaT. If what I do on
    omega t+ or elsewhere does help in some way, then that is
    just a side effect of the open source method. I have no
    intention of directly aiding those who work against me.

    Your words lack credibility and you cannot be trusted.

     
  • Raymond Martin
    Raymond Martin
    2008-02-16

    Logged In: YES
    user_id=1111672
    Originator: YES

    Improved functionality will be added towards OmegaT+ 1.0. This is a very fundamental issue that is of high importance. Thus it requires serious consideration of design issues, e.g. usefulness, efficiency, optimization, and so forth.

    More comments at a later date.

    Raymond