Re: [Openadaptxt-linguists] New Language
Brought to you by:
keypoint,
openadaptxt
|
From: Chris B. <cbi...@gm...> - 2012-10-20 06:36:54
|
Hello, last question, if I'm working with bilingual languages, eg in Samoa we use both Samoan and English, Should I incorporate English as well? I have no experience with predictive texting at all. But People will want to text in English sometimes, Samoan other times depending on who they're interacting with. How will that be accomplished? For other software I have made I have incorporated the Samoan into the english so that they still get the full english functionality rather than had a seperate Samoan language. Does that make sense? I'm almost finished with the corpus, so just need to know that really. Regards ChrisB On 10/16/12, Chris Bickers <cbi...@gm...> wrote: > I'm not worried about copyright I'm creating a unique list/corpus from > multiple freely available works. With Authors consent. Soon as I've > finished cleaning everything up, I'll get started. I looked online to > see whats available as Kevin does but it's all so faulty as to be > worthless without fixing every second word. What I'm working with is > perfect (when I've finished with it) > I'm working on 1500+ pages of corpus though so it's quite an effort. > Have a great day guys and thanks for the advice. > ChrisB > > On 10/16/12, Michael Bauer <fi...@ak...> wrote: >> >> 15/10/2012 14:25, sgrìobh Jens Christensen: >>> The dictionary creator can take an input corpus in txt (Unicode) format. >>> You do need a list of valid words for it to work though. Additionally >>> since the input files will be published on SourceForge, there might be >>> copyright issues (if there isn't then it's fine to use a normal text >>> corpus) and we'll have to go with the same approach as we do for our >>> other >>> dictionaries (a corpus with each word repeated as many times as needed >>> and >>> some generic context). >> What that means is that you can either calculate the statistics and >> modify the file accordingly or just stick a piece of text in. For >> example (faking some words) if you know that proportionally aga occurs >> x%, aba y%, ata z% you can create a file like this (where the number of >> re-iterations indicates the % of occurrence): >> aga , >> aga , >> aga , >> aga , >> aga , >> aga , >> aga , >> aga , >> aba , >> aba , >> aba , >> aba , >> aba , >> ata , >> ata , >> ata , >> >> Or you can put in a coherent piece of text like (again faking it) >> ki aga aga le ata shi to ku aga aba ... >> >> and the system, during creating, will calculate the stats for you. But >> as Jens said, that will appear on SourceForge so make sure it's a text >> which isn't copyrighted or something you have permission for. Worst >> case, Bible text are often available. >> >>> I'm not sure how Michael did it for the Gaelic languages, but he might >>> be >>> able to help you. >> Kevin Scannell has statistical data on lots of languages. Because it >> sometimes contains badly spelled words, we used the Gaelic spellchecker >> file (which is clean), compared that against his file, stripped out >> anything that wasn't in the spellchecker and then calucalted the stats >> for the rest. If you have a clean data for the languages in question, >> that might work too. Failing that, especially since Polynesian languages >> don't to that much crazy morphology, you can probably do that manually >> with a bit of guesstimating. Perhaps work off a small learners wordlist >> of something. >> >> For the phrases, if you chuck in a coherent text, the system will do >> those too but I found that adding some manually worked better for Manx >> and Scots Gaelic by going through a learners' textbook and picking out >> common patterns. >> >> Tata >> >> Michael >> >> > > > -- > Christopher Bickers > Managing Director > Bickers Services > Samoa > -- Christopher Bickers Managing Director Bickers Services Samoa |