part2 of suggestions: RE: [Indic-computing-devel] Language Info Guidelines
Status: Alpha
Brought to you by:
jkoshy
From: fpohlmann <fpo...@ba...> - 2002-07-08 16:10:59
|
Section 1 Linguistic Analysis -------------------------------- Some more in-depth background of the language from a linguistic perspective, with a focus on issues relevant to computing, display and text processing. "I would be more specific and say : natural language processing (NLP), corpus linguistics, grammatical analysis, searching and sorting, corpus linguistics (covers much of text processing) and visual display." - List of Writing Systems : A list of different writing systems used to represent the language in text. For each writing system, one would try to include: - Graphemes: the basic glyphs present in the system, "Meaning? What do you mean by 'present in the system'" basic combination rules, and mapping to semantic characters. - Usage: Usage details (Is it still used? Where? For what purpose?) "Yes, we also would have to make very clear (if relevant), whether a writing system is relevant for the processing and display of technical documentation as opposed to, say , medieval Kannada literature). " By how many people in what contexts?) - Basic Grammatical Info: Grammatical information about basic sentence structure and grammar rules. "We might want to adopt a somewhat old-fashioned approacn and start off with the basic phonemes and graphemes, then continue with words (and morphemes generally), the proceed to sentences (syntax) and end with texts (corpus linguistics for pre-modern texts and technical/scientific conventions)" Section 2 Character Encoding ----------------------------- - List of Encodings: A list of character encodings to store this language in digital format. - Size of a character: (in bits) - Code Point / Character Map: Map between code points and semantic characters. - Outstanding Issues: Issues with how this encoding represents the language. Types of issues could include the following: - Missing Chars - Missing Semantics - Missing Processing Rules - Redundant / Extra Chars - Erroneous Semantics - Erroneous Processing Rules - Writing Systems / Language Variants Supported: Which variants and different writing systems does this encoding support for the given language. - Who created the encoding? - Who is in charge of the encoding management and modification process? - Software / OS support - What software and OS's support this encoding. "This is a bit of a nightmare to document:) I would focus on the (network) operating systems and system libraries supporting the char sets. You might want to add database support (Oracle, MySQL and the like) and don't forget standards like XML and Unicode. Unicode by no means supports all Indian languages at this point. I would not pay too much attention to e.g Abiword or M$ Excel though." - IDE / DB Support - What Database systems and Development tools support this encoding? Yes, well. Perhaps: 1) (network) operating systems like Linux, FreeBSD, Solaris, Windows XP, Novell etc. 2) Databases 3) Programming languages and their base libraries (C, C++, Java, Perl, Python etc.) 4) Standards (Unicode, ISO 10464, XML, Linux Standards Base (LSB) etc.)" Section 3 Fonts ---------------- - List of fonts or font families available for this language. For each font, - What type of Font is it? (TTF, Type 1, X Window, OTF, other) - What is the availability? - Who is the creator of the font? - Who currently manages / develops / owns the font? - Is it Open Source? - What encodings are supported? - What is the glyph set? - Brief description of semantic character / glyph mapping - Brief description of positioning and substitution issues Section 4 Input Methods ------------------------- - List of Keyboard Layouts for a language - Keyboard Type - keyboard types (hardware) supported - Key - Char Mapping - Mapping between keys and code points - Usage Information - Information about how the layout is used in practice - Prevalence - Types of Users - Encodings Supported Section 5 Text Processing -------------------------- Information about the language useful from a text processing (searching, sorting, spelling, etc.) point of view. - List of Sort Orders - Different ways the language can be sorted. - Searching / Matching Semantics - What it means for one word to equal another. - Word Roots - Prefix / Suffix Rules - Line Break Rules - When to break a line - Hyphenation Rules "This relates very much to info provided in sections 0 and 1. Maybe this should be section 2? It is linguistics that's discussed here." Section 6 Typography and Display -------------------------------- - Basics - Ligatures - Punctuation - Justification "What about questions of mixing languages and having 2 commonly used languages in one window? E.g. English and Hindi, or Hindi and another Indian language" Section 7 Locale Info ---------------------- Locale-Specific Information would include info about the following: - List of Possible Locales - List of locales the language could be applicable for. Could refer to a previously described locale. - Time - Time Systems - Clock Time - Calendar - Numeric System - Measures - Currency - Salutations Section 8 New Areas -------------------- A list of people / projects working on each of the following for the language: - Text to Speech Support - Voice Recognition - OCR Support - Natural Language Processing and Machine Translation Section 9 Language Resources -------------------------------- Other important resources regarding the language: - Local Language Software Available - Different types of software and systems that support the language in one way or another - Organizations - Different organizations, people and institutions interested in the language, either from a computing perspective or not - Dictionaries - On-line and Off-line dictionaries for the language - Other Language Links and Resources "And books, articles, linguistic and otherwise about the languages. What is missing is the question of marking up text in Indian languages, e.g. using SGML, HTML, XML etc." Ok, thanks. -Frank |