Menu

#227 Aspell removes some Polish characters if language is non-PL

closed
None
5
2009-01-20
2009-01-18
przemoc
No

Below is shown the problem of removing ą, ć, ę, ł (and capital version of these) depending on used language.

$ echo 'äöüß ÄÖÜ ąćęłńóśźż ĄĆĘŁŃÓŚŹŻ' | aspell list --encoding=utf-8 -l pl
äöüß
ÄÖÜ
ąćęłńóśźż
ĄĆĘŁŃÓŚŹŻ

$ echo 'äöüß ÄÖÜ ąćęłńóśźż ĄĆĘŁŃÓŚŹŻ' | aspell list --encoding=utf-8 -l de
äöüß
ÄÖÜ
ąćę
ńóśźż
ĄĆĘ
ŃÓŚŹŻ

$ echo 'äöüß ÄÖÜ ąćęłńóśźż ĄĆĘŁŃÓŚŹŻ' | aspell list --encoding=utf-8 -l en
äöüß
ÄÖÜ
ńóśźż
ŃÓŚŹŻ

$ aspell -v
@(#) International Ispell Version 3.1.20 (but really Aspell 0.60.6)

Discussion

  • Kevin Atkinson

    Kevin Atkinson - 2009-01-19

    The short answer is that this is by design.

    Can you please be more specific on the problem this is causing you.

     
  • Kevin Atkinson

    Kevin Atkinson - 2009-01-19
    • assigned_to: nobody --> kevina
    • status: open --> pending
     
  • przemoc

    przemoc - 2009-01-19

    It's obvious that each presented output should be the same.
    Why all German characters are preserved? Because this is how it should be done with all languages - none of characters can be removed.

    Maybe I don't understand something, but how can design define which national characters stay intact and which not?

    I'm using aspell for searching misspelled words from "multilingual" texts (e.g. in Poland we sometimes use English words, which doesn't have good Polish analog in current context).
    Generally I would like to normally use something like:

    $ cat textfile | aspell list --encoding=utf-8 -l pl | aspell list --encoding=utf-8 -l en

    but as I stated before - it doesn't work, because every word with ą, ć, ę, ł (and upper case counterparts) is split and splitter is removed.

     
  • przemoc

    przemoc - 2009-01-19
    • summary: Aspell removes some polish characters if language is non-pl --> Aspell removes some Polish characters if language is non-PL
    • status: pending --> open
     
  • Kevin Atkinson

    Kevin Atkinson - 2009-01-19

    Now suppose you text contain Cyrillic characters. Do you want Aspell to recognize those also? Do you want it to spit out Russian words?

    How Aspell splits words is language dependent, that is the part that is by design. For example some languages split on the '-' and others don't, depending on the decision of the dictionary author.

    Technically, Aspell is 8-bit internally, and when splitting words it will reject any words out side of the character set used for the particular language. This is a limitation of Aspell that will be very hard to fix.

    If this is important to you I have a solution in mind. Please post to aspell-user@gnu.org as this is not the place for it.

     
  • przemoc

    przemoc - 2009-01-19

    If you ask, of course I want aspell to spit out Russian words. Recognition depends on chosen language. Everything that is not in the chosen dictionary should be untouched.

    Now I see the point - aspell is 8-bit internally, so UTF8 support is only a feint.

    OK, I'll post to aspell-user@gnu.org.

     
  • Kevin Atkinson

    Kevin Atkinson - 2009-01-20
    • status: open --> closed