part2 of suggestions: RE: [Indic-computing-devel] Language Info Guidelines

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Section 1 Linguistic Analysis
--------------------------------
Some more in-depth background of the language from a linguistic 
perspective, with a focus on issues relevant to computing, display and text 
processing.

"I would be more specific and say : natural language processing (NLP), corpus 
linguistics, grammatical analysis, searching and sorting, corpus linguistics 
(covers much of text processing) and visual display."

	- List of Writing Systems : A list of different writing systems used 
to represent the language in text.  For each writing system, one would 
try to include:
		- Graphemes: the basic glyphs present in the system,

"Meaning?  What do you mean by 'present in the system'"

 basic 
combination rules, and mapping to semantic characters.
		- Usage: Usage details (Is it still used?  Where?  For what purpose?)

"Yes, we also would have to make very clear (if relevant), whether a writing 
system is relevant for the processing and display of technical documentation 
as opposed to, say , medieval Kannada literature). "

By how many people in what contexts?)
	- Basic Grammatical Info: Grammatical information about basic sentence 
structure and grammar rules.

"We might want to adopt a somewhat old-fashioned approacn and start off with 
the basic phonemes and graphemes, then continue with words (and morphemes 
generally), the proceed to sentences (syntax) and end with texts (corpus 
linguistics for pre-modern texts and technical/scientific conventions)"

Section 2 Character Encoding
-----------------------------
	- List of Encodings:  A list of character encodings to store this 
language in digital format.
		- Size of a character: (in bits)
		- Code Point / Character Map: Map between code points and semantic 
characters.
		- Outstanding Issues: Issues with how this encoding represents the 
language.  Types of issues could include the following:
			- Missing Chars
			- Missing Semantics
			- Missing Processing Rules
			- Redundant / Extra Chars
			- Erroneous Semantics
			- Erroneous Processing Rules
		- Writing Systems / Language Variants Supported: Which variants and 
different writing systems does this encoding support for the given 
language.
		- Who created the encoding?
		- Who is in charge of the encoding management and modification 
process?
		- Software / OS support - What software and OS's support this 
encoding.

"This is a bit of a nightmare to document:) I would focus on the (network) 
operating systems and system libraries supporting the char sets. You might 
want to add database support (Oracle, MySQL and the like) and don't forget 
standards like XML and Unicode. Unicode by no means supports all Indian 
languages at this point. I would not pay too much attention to e.g Abiword or  
M$ Excel though."

		- IDE / DB Support - What Database systems and Development tools 
support this encoding?

Yes, well. Perhaps:

1) (network) operating systems like Linux, FreeBSD, Solaris, Windows XP, 
Novell etc.
2) Databases
3) Programming languages and their base libraries (C, C++, Java, Perl, Python 
etc.)
4) Standards (Unicode, ISO 10464, XML, Linux Standards Base (LSB) etc.)"

Section 3 Fonts
----------------
	- List of fonts or font families available for this language.  For 
each font,
		- What type of Font is it? (TTF, Type 1, X Window, OTF, other)
		- What is the availability?
		- Who is the creator of the font?
		- Who currently manages / develops / owns the font?
		- Is it Open Source?
		- What encodings are supported?
		- What is the glyph set?
		- Brief description of semantic character / glyph mapping
		- Brief description of positioning and substitution issues

Section 4 Input Methods
-------------------------
	- List of Keyboard Layouts for a language
		- Keyboard Type - keyboard types (hardware) supported
		- Key - Char Mapping - Mapping between keys and code points
		- Usage Information - Information about how the layout is used in 
practice
			- Prevalence
			- Types of Users
	- Encodings Supported

Section 5 Text Processing
--------------------------
Information about the language useful from a text processing 
(searching, sorting, spelling, etc.) point of view.
	- List of Sort Orders - Different ways the language can be sorted.
	- Searching / Matching Semantics - What it means for one word to equal 
another.
	- Word Roots
		- Prefix / Suffix Rules
	- Line Break Rules - When to break a line
		- Hyphenation Rules

"This relates very much to info provided in sections 0 and 1. Maybe this 
should be section 2? It is linguistics that's discussed here."

Section 6 Typography and Display
--------------------------------
	- Basics
	- Ligatures
	- Punctuation
	- Justification

"What about questions of mixing languages and having 2 commonly used languages 
in one window? E.g. English and Hindi, or Hindi and another Indian language"

Section 7 Locale Info
----------------------
Locale-Specific Information would include info about the following:
	- List of Possible Locales - List of locales the language could be 
applicable for.  Could refer to a previously described locale.
		- Time - Time Systems
			- Clock Time
			- Calendar
		- Numeric System
		- Measures
		- Currency
		- Salutations

Section 8 New Areas
--------------------
A list of people / projects working on each of the following for the 
language:
	- Text to Speech Support
	- Voice Recognition
	- OCR Support
	- Natural Language Processing and Machine Translation

Section 9 Language Resources
--------------------------------
Other important resources regarding the language:
	- Local Language Software Available - Different types of software and 
systems that support the language in one way or another
	- Organizations - Different organizations, people and institutions 
interested in the language, either from a computing perspective or not
	- Dictionaries - On-line and Off-line dictionaries for the language
	- Other Language Links and Resources

"And books, articles, linguistic and otherwise about the languages. What is 
missing  is the question of marking up text in Indian languages, e.g. using 
SGML, HTML, XML etc."

Ok, thanks.

-Frank