GNU Aspell / Bugs / #227 Aspell removes some Polish characters if language is non-PL

#227 Aspell removes some Polish characters if language is non-PL

Status: closed

Owner: Kevin Atkinson

Labels: None

Priority: 5

Updated: 2009-01-20

Created: 2009-01-18

Creator: przemoc

Private: No

Below is shown the problem of removing ą, ć, ę, ł (and capital version of these) depending on used language.

$ echo 'äöüß ÄÖÜ ąćęłńóśźż ĄĆĘŁŃÓŚŹŻ' | aspell list --encoding=utf-8 -l pl
äöüß
ÄÖÜ
ąćęłńóśźż
ĄĆĘŁŃÓŚŹŻ

$ echo 'äöüß ÄÖÜ ąćęłńóśźż ĄĆĘŁŃÓŚŹŻ' | aspell list --encoding=utf-8 -l de
äöüß
ÄÖÜ
ąćę
ńóśźż
ĄĆĘ
ŃÓŚŹŻ

$ echo 'äöüß ÄÖÜ ąćęłńóśźż ĄĆĘŁŃÓŚŹŻ' | aspell list --encoding=utf-8 -l en
äöüß
ÄÖÜ
ńóśźż
ŃÓŚŹŻ

$ aspell -v
@(#) International Ispell Version 3.1.20 (but really Aspell 0.60.6)

Discussion

Kevin Atkinson - 2009-01-19

The short answer is that this is by design.

Can you please be more specific on the problem this is causing you.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kevin Atkinson - 2009-01-19

assigned_to: nobody --> kevina

status: open --> pending
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

przemoc - 2009-01-19

It's obvious that each presented output should be the same.
Why all German characters are preserved? Because this is how it should be done with all languages - none of characters can be removed.

Maybe I don't understand something, but how can design define which national characters stay intact and which not?

I'm using aspell for searching misspelled words from "multilingual" texts (e.g. in Poland we sometimes use English words, which doesn't have good Polish analog in current context).
Generally I would like to normally use something like:

$ cat textfile | aspell list --encoding=utf-8 -l pl | aspell list --encoding=utf-8 -l en

but as I stated before - it doesn't work, because every word with ą, ć, ę, ł (and upper case counterparts) is split and splitter is removed.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

przemoc - 2009-01-19

summary: Aspell removes some polish characters if language is non-pl --> Aspell removes some Polish characters if language is non-PL

status: pending --> open
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kevin Atkinson - 2009-01-19

Now suppose you text contain Cyrillic characters. Do you want Aspell to recognize those also? Do you want it to spit out Russian words?

How Aspell splits words is language dependent, that is the part that is by design. For example some languages split on the '-' and others don't, depending on the decision of the dictionary author.

Technically, Aspell is 8-bit internally, and when splitting words it will reject any words out side of the character set used for the particular language. This is a limitation of Aspell that will be very hard to fix.

If this is important to you I have a solution in mind. Please post to aspell-user@gnu.org as this is not the place for it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

przemoc - 2009-01-19

If you ask, of course I want aspell to spit out Russian words. Recognition depends on chosen language. Everything that is not in the chosen dictionary should be untouched.

Now I see the point - aspell is 8-bit internally, so UTF8 support is only a feint.

OK, I'll post to aspell-user@gnu.org.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kevin Atkinson - 2009-01-20

See:
http://lists.gnu.org/archive/html/aspell-user/2009-01/msg00004.html

For solution.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kevin Atkinson - 2009-01-20

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link: