ElixirFM

Otakar Smrž

ElixirFM

Functional Arabic Morphology

ElixirFM is a high-level implementation of Functional Arabic Morphology, described in http://ufal.mff.cuni.cz/~smrz/elixir-thesis.pdf. The core of ElixirFM is written in Haskell, while interfaces in Perl support lexicon editing and other interactions.

ElixirFM is further documented in the ElixirHaskell and ElixirPerl guidelines.

Support

ElixirFM has been developed within the 'Resources and Tools for Information Systems' project (No. 1ET101120413) financed by the Grant Agency of the Academy of Sciences of the Czech Republic.

Introduction

ElixirFM is an implementation of a novel computational model of the morphological processes in Modern Written Arabic, which was introduced in (Smrž, 2007) and is still in active development (Smrž and Bielický, 2009). In the larger context, ElixirFM is related to the Prague Arabic Dependency Treebank project (Hajič et al., 2004, Smrž et. al, 2008).

The core of ElixirFM is written in Haskell, while interfaces in Perl support lexicon editing and other interactions. The system includes two essential components, namely a multi-purpose programming library promoting clear style and abstraction in the model, and a linguistically refined, yet intuitive and efficient, morphological lexicon. There are various interfaces to the system, ranging from command-line interpreters or executables up to graphical linguistic annotation environments or user-friendly web applications, like the ElixirFM Online Interface http://quest.ms.mff.cuni.cz/elixir/.

ElixirFM provides the user with four different modes of operation, in addition to the unique lexical resource and all the other open-source material of the implementation:

  • Resolve
    provides tokenization and morphological analysis of the inserted text, even if one omits some symbols or does not spell everything correctly. The text can be entered not only in the original script and orthography, but also in other notations, including a purely phonetic transcription.
  • Inflect
    transforms words into the forms required by context. The user only needs to define the grammatical parameters of the expected word forms, which can be encoded using either common natural language descriptions or concise positional morphological tags.
  • Derive
    converts words into their counterparts of similar meaning but different grammatical category, specified via natural language descriptions or morphological tags.
  • Lookup
    can lookup lexical entries by the citation form and nests of entries by the root. One can even browse the dictionary in a reverse direction searching for expressions in English.

The online interface includes example queries for each of the modes and incorporates modern web tools to facilitate the input method and to interactively organize the output of the system.

For details on the complete system and its underlying technology, as well as on proper discussion of the relevant literature, please consult the references. ElixirFM is in some aspects inspired by the methodology of Functional Morphology (Forsberg and Ranta, 2004) and initially relied on the re-processed Buckwalter lexicon (Buckwalter, 2002).

Characteristics

One of the crucial abstractions in ElixirFM is that word forms are encoded via carefully designed morphophonemic patterns that interlock with roots or literal word stems. These templates can be merged very efficiently into a string of characters in the extended ArabTeX notation that can be further converted into either the original Arabic orthography, or into a phonetic transcription. The formal type of a pattern that defines a word stem can be one of triliteral, quadriliteral, literal, or string.

Morphology is modeled in terms of paradigms, grammatical categories, lexemes and word classes. ElixirFM implements the comprehensive rules that draw the information from the lexicon and generate the word forms given the appropriate morphosyntactic parameters. Inflected forms need not be merged with roots yet, and can retain their internal structure.

The lexicon and the parameters determine the choice of paradigms. The verbal paradigms are just three, i.e. define affixes for perfectives, imperfectives, and imperatives. The nominal paradigms of inflection need discern no more than five kinds of structural endings. The highlight of the Arabic morphology is that the ‘irregular’ inflection actually rests in strictly observing some additional rules, the nature of which is phonological. The clarity of the model is due to the design of the morphophonemic templates and the merge function in which the phonological rules are actually enforced, independent of and irrespective of the inflectional parameters or lexical information!

The introflexive template selection mechanism differs for nominals (providing plural or feminine word forms) and for verbs (providing all needed stem alternations in the extent of the entry specifications in printed dictionaries like Wehr, 1979), yet it is quite clear-cut. It seems even more explicit and modular than what can be found in the best grammars (Fischer, 2002, Holes, 2004, Badawi et al., 2004, Ryding, 2005). The morphological model is thus greatly simplified while being most accurate. Again, we credit this to the particular design of the morphophonemic patterns.

ElixirFM also implements derivation, in any direction, between verbs, active or passive participles, and masdars, i.e. deverbal nouns (which can be, if necessary, associated with their verbs in the lexicon). Other kinds of lexical derivation processes can be rendered right in the internal structure of the morphophonemic templates of words.

Unless a root consonant is weak, i.e. one of y, w or hamza, and unless it assimilates in some kind of patterns, this consonant will be part of any word form defined with this root. ElixirFM effectively exploits this inflectional invariant during the resolution of word forms. It checks the derivations and inflections of the identified or hypothesized roots only, and need not inflect the whole lexicon before analyzing the given inflected forms in question.
While this seems the obvious way in which learners of Arabic analyze unknown words to look them up in the dictionary, it contrasts strongly with the practice in the design of computational analyzers, where pre-compiled finite-state transducers, or analogously tries, are most often used. Of course, languages other than Arabic need not have such convenient invariants. The situation is thus very fortunate for the generative model in ElixirFM.

When inflected word forms are combined in speech or in writing, additional phonological and orthographic changes can take place. Inverting this process is called tokenization, and is usually nondeterministic. Modeling the morphosyntactic constraints among the reconstructed tokens, such as that verbs be followed by accusatives and nominals by genitives, can be viewed as a special case of restrictions on the hypothesized properties of the tokens. ElixirFM presents the results of tokenization and morphological analysis in form of MorphoTrees (Smrž and Pajas, 2004), which introduce intuitive hierarchies over the tokens and their readings that can be further pruned and disambiguated.

Examples

Screenshots of the ElixirFM lexicon being edited in TrEd (Pajas and Štěpánek, 2008). The left pane displays some valency frames associated with the entries (Bielický and Smrž, 2008).

References

Elsaid Badawi, Mike G. Carter, and Adrian Gully. Modern Written Arabic: A Comprehensive Grammar. Routledge, 2004.

Viktor Bielický and Otakar Smrž. Building the Valency Lexicon of Arabic Verbs. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, 2008.

Viktor Bielický and Otakar Smrž. Enhancing the ElixirFM Lexicon with Verbal Valency Frames. In Proceedings of the Second International Conference on Arabic Language Resources and Tools (MEDAR 2009), Cairo, Egypt, 2009.

Tim Buckwalter. Buckwalter Arabic Morphological Analyzer Version 1.0. LDC2002L49, ISBN 1-58563-257-0, 2002.

Wolfdietrich Fischer. A Grammar of Classical Arabic. Yale Language Series. Yale University Press, 2002.

Markus Forsberg and Aarne Ranta. Functional Morphology. In Proceedings of the Ninth ACM SIGPLAN International Conference on Functional Programming, ICFP 2004, pages 213–223, Snowbird, Utah, 2004.

Jan Hajič, Otakar Smrž, Petr Zemánek, Jan Šnaidauf, and Emanuel Beška. Prague Arabic Dependency Treebank: Development in Data and Tools. In NEMLAR International Conference on Arabic Language Resources and Tools, pages 110–117, Cairo, Egypt, 2004.

Clive Holes. Modern Arabic: Structures, Functions, and Varieties. Georgetown University Press, 2004.

Petr Pajas and Jan Štěpánek. Recent Advances in a Feature-Rich Framework for Treebank Annotation. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 673–680, Manchester, United Kingdom, 2008.

Karin C. Ryding. A Reference Grammar of Modern Standard Arabic. Cambridge University Press, 2005.

Otakar Smrž and Petr Pajas. MorphoTrees of Arabic and Their Annotation in the TrEd Environment. In NEMLAR International Conference on Arabic Language Resources and Tools, pages 38–41, Cairo, Egypt, 2004.

Otakar Smrž. ElixirFM — Implementation of Functional Arabic Morphology. In ACL 2007 Proceedings of the Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, pages 1–8, Prague, Czech Republic, 2007.

Otakar Smrž. Functional Arabic Morphology: Dissertation Summary. Prague Bulletin of Mathematical Linguistics, 88:5–30, 2007.

Otakar Smrž. Functional Arabic Morphology. Formal System and Implementation. PhD thesis, Charles University in Prague, 2007.

Otakar Smrž, Viktor Bielický, Iveta Kouřilová, Jakub Kráčmar, Jan Hajič, and Petr Zemánek. Prague Arabic Dependency Treebank: A Word on the Million Words. In Proceedings of the Workshop on Arabic and Local Languages (LREC 2008), pages 16–23, Marrakech, Morocco, 2008.

Otakar Smrž. ElixirFM Functional Arabic Morphology. ALECSO Workshop on Arabic Morphological Analyzers, Academy of the Arabic Language, Damascus, 2009.

Otakar Smrž. Prague Arabic Dependency Treebank: Research Directions. ALECSO Workshop on Enrichment of Arabic Digital Content, Higher Institute for Applied Science and Technology, Damascus, 2010.

Otakar Smrž and Viktor Bielický. ElixirFM. High-level implementation of Functional Arabic Morphology, http://sourceforge.net/projects/elixir-fm/, 2010.

Otakar Smrž and Hyun-Jo You. Finding the Structure of Words. In Multilingual Natural Language Applications: From Theory to Practice, Daniel Bikel and Imed Zitouni (eds.), Prentice Hall, to appear.

Hans Wehr. Arabic-English Dictionary: The Hans Wehr Dictionary of Modern Written Arabic. Spoken Language Services, 1979.


Related

Wiki: ElixirHaskell
Wiki: ElixirPerl