CSAw - NLP for low-resource languages Wiki

CSAw is an NLP framework for low-resource languages

Brought to you by: patrickcconnor

Home

The CSAw Framework (pronounced “seesaw”) was designed to perform advanced natural language processing functions using a rule-based approach. CSA comes from the “Concept Specification and Abstraction” semantic representation (SR), which forms the basis of this framework. The primary capabilities, upon which advanced tools are to be built are the morphological and syntactic parsing of surface text into a SR, transfer from a SR of a source language to a SR of a receptor language, and synthesis of a surface text from a SR. You might say it bears some resemblance to the action of a seesaw, and hence the name.

The framework’s primary purpose is to provide computer aided translation tools by automatically building a receptor language model based on very little bilingual corpora. It contains many capabilities common to NLP toolkits, such as tokenization, part of speech tagging, labelled dependency parsing etc. Examples of foreseeable advanced tools, besides translation, include rephrasing within a language, grammar checking, search by grammatical form, and translation quality checking.

In a research article, the CSA semantic representation has been thoroughly described. The goal of the CSA SR is to provide the simplest way of canonically describing the exact meaning of a surface text devoid of morphology and word order. We have also made it a goal to provide the most intuitive means of manually building a lexicon and morphosyntactic rules which, together with a few more elements, comprise the model of a language.

CSAw differs from most other rule-based engines used for translation (e.g., Apertium). Instead of performing a shallow transfer, where syntax (or word order) in one language is transferred to another, CSAw does a deep parsing and generation monolingually and the transfer step is purely concept-to-concept. We believe this reduces the challenge of dealing with the sometimes great morphosyntactic differences between disparate languages. Also, the only bilingual knowledge required is conceptual and not morphosyntactic. Monolingual speakers can manually develop their mother-tongue language model fully, if desired.

This framework is currently in the pre-alpha stage and is provided primarily for information purposes. In time, the project may become more active. Further documentation is provided as part of the release, including an overview of the framework's capabilities, the process of building a language model manually, and a living document describing the steps toward building a language model (lexicon and rule set) automatically from limited bilingual text using this approach.