From: Kevin Patrick Scannell <scannell@sl...> - 2005-09-23 17:48:30
On 09:58 Fri 23 Sep , Jonathon Blake wrote:
>> In Afrikaans we have the word 'n (which is similar to the English word a)
> That should one Unicode character. I don't remember the Unicode
> number offhand, but it is in the Latin Extended A range. [I devote one
> paragraph to that character in _OOo in a Multi-Lingual Environment_,]
Interesting! Thanks Jonathon. I see it now in my print copy - it's
code point 0149. gramadoir-af is currently set up as Latin-1,
so obviously that would need to be changed to permit this
character (it definitely makes tokenizing easier).
This is something that will need some work in the
gramadoir engine eventually. As it stands now, you could
redo gramadoir-af in Unicode and then report every time
the two-character ASCII sequence "'n" appears in place of
your Unicode character above -- but presumably this isn't ideal
since the great majority of "real" texts use "'n" I'm guessing.
It would be nice to allow user-defined filters that preprocess
input texts and normalize them to the "best-practice" encoding which
would also be the one used in the gramadoir-xx lexicon.
I talked about a similar kind of normalization last year
with Pablo S. for Walloon - he told me that proper typesetting
of Walloon texts requires a non-breaking space after apostrophes
like in "dj' a", even though most real texts just use a plain ASCII
space. These could all be fixed up right when the text is loaded.
Get latest updates about Open Source Projects, Conferences and News.