Re: [Openadaptxt-linguists] New Language
Brought to you by:
keypoint,
openadaptxt
|
From: Michael B. <fi...@ak...> - 2012-10-15 14:07:38
|
15/10/2012 14:25, sgrìobh Jens Christensen: > The dictionary creator can take an input corpus in txt (Unicode) format. You do need a list of valid words for it to work though. Additionally since the input files will be published on SourceForge, there might be copyright issues (if there isn't then it's fine to use a normal text corpus) and we'll have to go with the same approach as we do for our other dictionaries (a corpus with each word repeated as many times as needed and some generic context). What that means is that you can either calculate the statistics and modify the file accordingly or just stick a piece of text in. For example (faking some words) if you know that proportionally aga occurs x%, aba y%, ata z% you can create a file like this (where the number of re-iterations indicates the % of occurrence): aga , aga , aga , aga , aga , aga , aga , aga , aba , aba , aba , aba , aba , ata , ata , ata , Or you can put in a coherent piece of text like (again faking it) ki aga aga le ata shi to ku aga aba ... and the system, during creating, will calculate the stats for you. But as Jens said, that will appear on SourceForge so make sure it's a text which isn't copyrighted or something you have permission for. Worst case, Bible text are often available. > I'm not sure how Michael did it for the Gaelic languages, but he might be able to help you. Kevin Scannell has statistical data on lots of languages. Because it sometimes contains badly spelled words, we used the Gaelic spellchecker file (which is clean), compared that against his file, stripped out anything that wasn't in the spellchecker and then calucalted the stats for the rest. If you have a clean data for the languages in question, that might work too. Failing that, especially since Polynesian languages don't to that much crazy morphology, you can probably do that manually with a bit of guesstimating. Perhaps work off a small learners wordlist of something. For the phrases, if you chuck in a coherent text, the system will do those too but I found that adding some manually worked better for Manx and Scots Gaelic by going through a learners' textbook and picking out common patterns. Tata Michael |