Re: [Openadaptxt-linguists] New Language
Brought to you by:
keypoint,
openadaptxt
|
From: Chris B. <cbi...@gm...> - 2012-10-15 18:22:51
|
I'm not worried about copyright I'm creating a unique list/corpus from multiple freely available works. With Authors consent. Soon as I've finished cleaning everything up, I'll get started. I looked online to see whats available as Kevin does but it's all so faulty as to be worthless without fixing every second word. What I'm working with is perfect (when I've finished with it) I'm working on 1500+ pages of corpus though so it's quite an effort. Have a great day guys and thanks for the advice. ChrisB On 10/16/12, Michael Bauer <fi...@ak...> wrote: > > 15/10/2012 14:25, sgrìobh Jens Christensen: >> The dictionary creator can take an input corpus in txt (Unicode) format. >> You do need a list of valid words for it to work though. Additionally >> since the input files will be published on SourceForge, there might be >> copyright issues (if there isn't then it's fine to use a normal text >> corpus) and we'll have to go with the same approach as we do for our other >> dictionaries (a corpus with each word repeated as many times as needed and >> some generic context). > What that means is that you can either calculate the statistics and > modify the file accordingly or just stick a piece of text in. For > example (faking some words) if you know that proportionally aga occurs > x%, aba y%, ata z% you can create a file like this (where the number of > re-iterations indicates the % of occurrence): > aga , > aga , > aga , > aga , > aga , > aga , > aga , > aga , > aba , > aba , > aba , > aba , > aba , > ata , > ata , > ata , > > Or you can put in a coherent piece of text like (again faking it) > ki aga aga le ata shi to ku aga aba ... > > and the system, during creating, will calculate the stats for you. But > as Jens said, that will appear on SourceForge so make sure it's a text > which isn't copyrighted or something you have permission for. Worst > case, Bible text are often available. > >> I'm not sure how Michael did it for the Gaelic languages, but he might be >> able to help you. > Kevin Scannell has statistical data on lots of languages. Because it > sometimes contains badly spelled words, we used the Gaelic spellchecker > file (which is clean), compared that against his file, stripped out > anything that wasn't in the spellchecker and then calucalted the stats > for the rest. If you have a clean data for the languages in question, > that might work too. Failing that, especially since Polynesian languages > don't to that much crazy morphology, you can probably do that manually > with a bit of guesstimating. Perhaps work off a small learners wordlist > of something. > > For the phrases, if you chuck in a coherent text, the system will do > those too but I found that adding some manually worked better for Manx > and Scots Gaelic by going through a learners' textbook and picking out > common patterns. > > Tata > > Michael > > -- Christopher Bickers Managing Director Bickers Services Samoa |