openadaptxt-linguists Mailing List for OpenAdaptxt
Brought to you by:
keypoint,
openadaptxt
You can subscribe to this list here.
| 2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
(1) |
Jul
(8) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(3) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2012 |
Jan
(11) |
Feb
(7) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
(17) |
Nov
|
Dec
|
|
From: Chris B. <cbi...@gm...> - 2012-10-23 17:38:50
|
Actually, I meant it differently. Samoan and English are seperate languages, we dont need to mix them in the sense that our sentences are mixed (apart from some nouns). But they're both used here, so for instance my spell check autocorrect software will do either or both simultaneously without the need for changing any language settings, keyboard layouts etc.. While Maori entails changing your keyboard layout, downloading additional software, changing language settings and I don't know what else and only works on Windows, there's a whole different process for other OS's, and a bunch of jiggery-pokery before it all starts working. My way you just double-click the install and it starts working, doesn't matter what OS, or version etc.... Then whether you choose to use Samoan or English or a mixture is immaterial, it will handle the lot without any user intervention. Ideally this is what I want my adaptext project to do as well. Regards ChrisB On 10/24/12, Jens Christensen <jch...@ke...> wrote: > Hi Michael, > It’s not quite the same. The main difference is that it’s written with the > Latin alphabet as opposed to the normal alphabet (differs depending on which > Indian language you’re talking about). There’s also a lot of English mixed > into it, so there might be some similarities to Samoan there. > > Cheers, > Jens > ________________________________ > From: Michael Bauer [mailto:fi...@ak...] > Sent: 20 October 2012 12:14 > To: Jens Christensen > Cc: Chris Bickers; ope...@li... > Subject: Re: [Openadaptxt-linguists] New Language > > Is that what those Hinglish and Singlish dictionaries are about? > > Michael > > 20/10/2012 11:21, sgrìobh Jens Christensen: > Additionally if you find that you do want to make a mixed dictionary after > all it’s usually a lot easier to add words than remove them :-) > > Cheers, > Jens > > -- > Akerbeltz<http://www.faclair.com/> > Goireasan Gàidhlig air an lìon > Fòn: +44-141-946 4437 > Facs: +44-141-945 2701 > > Tha Gàidhlig aig a' choimpiutair agad - feuch e: > Addtoany<http://share.lockerz.com/buttons/> ◦ > Facebook<http://userscripts.org/scripts/show/126511> ◦ > Firefox<http://www.mozilla.com/gd/> ◦ > FreeFileSync<http://sourceforge.net/projects/freefilesync/> ◦ > Google<http://www.google.com/webhp?hl=gd> ◦ > Joomla!<http://community.joomla.org/translations/joomla-16-translations.html> > ◦ LibreOffice<http://gd.libreoffice.org/> ◦ > Outlook.com<http://akerbeltz.org/index.php?title=iG%C3%A0idhlig#Outlook.com> > ◦ Opera 11<http://www.opera.com/> ◦ Opera > Mini<http://www.opera.com/mobile/> > phpBB<http://www.foramnagaidhlig.net/foram/viewtopic.php?f=28&t=2071> ◦ > MediaWiki<http://www.mediawiki.org/wiki/MediaWiki> ◦ > Skype<http://www.foramnagaidhlig.net/foram/viewtopic.php?f=28&t=2196> ◦ > Thunderbird<http://www.mozillamessaging.com/gd/thunderbird/> ◦ > WordPress.com<http://gd.wordpress.com/> ◦ > WordPress.org<http://gd.wordpress.org/> ◦ An > Uicipeid<http://gd.wikipedia.org/wiki/Pr%C3%AComh-Dhuilleag> ◦ VLC Media > Player<http://www.videolan.org/vlc/index.gd.html> > Innealan do chleachdaichean Firefox/Thunderbird/LibreOffice: > Accentuate<https://addons.mozilla.org/en-US/firefox/addon/accentuateus/> ◦ > An Dearbhair Beag > (Mozilla)<https://addons.mozilla.org/en-US/firefox/addon/scottish-gaelic-spell-checker/> > ◦ An Dearbhair Beag > (LibreOffice)<http://extensions.libreoffice.org/extension-center/an-dearbhair-beag-scottish-gaelic-spellchecker> > ◦ Lightning<http://www.mozilla.org/projects/calendar/lightning/> ◦ > QLS<https://addons.mozilla.org/en-US/firefox/addon/quick-locale-switcher/> > Geamannan, spòrs ┐ feadhainn eile: > An Crochadair<http://www.foramnagaidhlig.net/stuth/geamannan/crochadair/> ◦ > FreeCiv<http://www.foramnagaidhlig.net/foram/viewtopic.php?f=28&t=2098&p=14625#p14625> > ◦ GenoPro<http://www.genopro.com/> ◦ LiChess<http://gd.lichess.org/> ◦ > Scrabble<http://sourceforge.net/apps/mediawiki/scrabble/index.php?title=Main_Page/gd> > ◦ Tetris<http://www.foramnagaidhlig.net/stuth/geamannan/tetris/> > Faclairean ┐ Briathrachas > Am Faclair Beag<http://www.faclair.com/> ◦ Dwelly-d<http://www.dwelly.info/> > ◦ Wordlink<http://www2.smo.uhi.ac.uk/wordlink/> > -- Christopher Bickers Managing Director Bickers Services Samoa |
|
From: Jens C. <jch...@ke...> - 2012-10-23 16:32:15
|
Hi Michael, It’s not quite the same. The main difference is that it’s written with the Latin alphabet as opposed to the normal alphabet (differs depending on which Indian language you’re talking about). There’s also a lot of English mixed into it, so there might be some similarities to Samoan there. Cheers, Jens ________________________________ From: Michael Bauer [mailto:fi...@ak...] Sent: 20 October 2012 12:14 To: Jens Christensen Cc: Chris Bickers; ope...@li... Subject: Re: [Openadaptxt-linguists] New Language Is that what those Hinglish and Singlish dictionaries are about? Michael 20/10/2012 11:21, sgrìobh Jens Christensen: Additionally if you find that you do want to make a mixed dictionary after all it’s usually a lot easier to add words than remove them :-) Cheers, Jens -- Akerbeltz<http://www.faclair.com/> Goireasan Gàidhlig air an lìon Fòn: +44-141-946 4437 Facs: +44-141-945 2701 Tha Gàidhlig aig a' choimpiutair agad - feuch e: Addtoany<http://share.lockerz.com/buttons/> ◦ Facebook<http://userscripts.org/scripts/show/126511> ◦ Firefox<http://www.mozilla.com/gd/> ◦ FreeFileSync<http://sourceforge.net/projects/freefilesync/> ◦ Google<http://www.google.com/webhp?hl=gd> ◦ Joomla!<http://community.joomla.org/translations/joomla-16-translations.html> ◦ LibreOffice<http://gd.libreoffice.org/> ◦ Outlook.com<http://akerbeltz.org/index.php?title=iG%C3%A0idhlig#Outlook.com> ◦ Opera 11<http://www.opera.com/> ◦ Opera Mini<http://www.opera.com/mobile/> phpBB<http://www.foramnagaidhlig.net/foram/viewtopic.php?f=28&t=2071> ◦ MediaWiki<http://www.mediawiki.org/wiki/MediaWiki> ◦ Skype<http://www.foramnagaidhlig.net/foram/viewtopic.php?f=28&t=2196> ◦ Thunderbird<http://www.mozillamessaging.com/gd/thunderbird/> ◦ WordPress.com<http://gd.wordpress.com/> ◦ WordPress.org<http://gd.wordpress.org/> ◦ An Uicipeid<http://gd.wikipedia.org/wiki/Pr%C3%AComh-Dhuilleag> ◦ VLC Media Player<http://www.videolan.org/vlc/index.gd.html> Innealan do chleachdaichean Firefox/Thunderbird/LibreOffice: Accentuate<https://addons.mozilla.org/en-US/firefox/addon/accentuateus/> ◦ An Dearbhair Beag (Mozilla)<https://addons.mozilla.org/en-US/firefox/addon/scottish-gaelic-spell-checker/> ◦ An Dearbhair Beag (LibreOffice)<http://extensions.libreoffice.org/extension-center/an-dearbhair-beag-scottish-gaelic-spellchecker> ◦ Lightning<http://www.mozilla.org/projects/calendar/lightning/> ◦ QLS<https://addons.mozilla.org/en-US/firefox/addon/quick-locale-switcher/> Geamannan, spòrs ┐ feadhainn eile: An Crochadair<http://www.foramnagaidhlig.net/stuth/geamannan/crochadair/> ◦ FreeCiv<http://www.foramnagaidhlig.net/foram/viewtopic.php?f=28&t=2098&p=14625#p14625> ◦ GenoPro<http://www.genopro.com/> ◦ LiChess<http://gd.lichess.org/> ◦ Scrabble<http://sourceforge.net/apps/mediawiki/scrabble/index.php?title=Main_Page/gd> ◦ Tetris<http://www.foramnagaidhlig.net/stuth/geamannan/tetris/> Faclairean ┐ Briathrachas Am Faclair Beag<http://www.faclair.com/> ◦ Dwelly-d<http://www.dwelly.info/> ◦ Wordlink<http://www2.smo.uhi.ac.uk/wordlink/> |
|
From: Michael B. <fi...@ak...> - 2012-10-22 16:30:54
|
Colleague picked my brains before a presentation in the Basque Country at a conference on small languages and technology and Adaptxt ended up getting a mention :) It's on slide 21 if you go to http://www.euskarabildua.com/euskara/ponentziak/davyth-hicks and click on Aurkezpena (it's in English, just the site is in Basque) Tata Michael |
|
From: Jens C. <jch...@ke...> - 2012-10-20 10:24:56
|
Hi Chris, As Michael says most predictive text applications will allow multiple languages installed at the same time (certainly Adaptxt will), so my advice would be to go for a purely native dictionary, maybe with a few very common English words included (e.g. ‘the’ is so commonly used that in many languages it’s almost become part of the language). Additionally if you find that you do want to make a mixed dictionary after all it’s usually a lot easier to add words than remove them :-) Cheers, Jens ________________________________ From: Michael Bauer [mailto:fi...@ak...] Sent: 20 October 2012 10:54 To: Chris Bickers Cc: Jens Christensen; ope...@li... Subject: Re: [Openadaptxt-linguists] New Language Easy one :) It's the same here in Scotland. You just download two languages, in my case Gaelic and English. In the live system, it only takes a swipe of the spacebar (onscreen) to switch languages. Michael 20/10/2012 07:36, sgrìobh Chris Bickers: Hello, last question, if I'm working with bilingual languages, eg in Samoa we use both Samoan and English, Should I incorporate English as well? I have no experience with predictive texting at all. But People will want to text in English sometimes, Samoan other times depending on who they're interacting with. How will that be accomplished? For other software I have made I have incorporated the Samoan into the english so that they still get the full english functionality rather than had a seperate Samoan language. Does that make sense? I'm almost finished with the corpus, so just need to know that really. Regards ChrisB -- Akerbeltz<http://www.faclair.com/> Goireasan Gàidhlig air an lìon Fòn: +44-141-946 4437 Facs: +44-141-945 2701 Tha Gàidhlig aig a' choimpiutair agad - feuch e: Addtoany<http://share.lockerz.com/buttons/> ◦ Facebook<http://userscripts.org/scripts/show/126511> ◦ Firefox<http://www.mozilla.com/gd/> ◦ FreeFileSync<http://sourceforge.net/projects/freefilesync/> ◦ Google<http://www.google.com/webhp?hl=gd> ◦ Joomla!<http://community.joomla.org/translations/joomla-16-translations.html> ◦ LibreOffice<http://gd.libreoffice.org/> ◦ Outlook.com<http://akerbeltz.org/index.php?title=iG%C3%A0idhlig#Outlook.com> ◦ Opera 11<http://www.opera.com/> ◦ Opera Mini<http://www.opera.com/mobile/> phpBB<http://www.foramnagaidhlig.net/foram/viewtopic.php?f=28&t=2071> ◦ MediaWiki<http://www.mediawiki.org/wiki/MediaWiki> ◦ Skype<http://www.foramnagaidhlig.net/foram/viewtopic.php?f=28&t=2196> ◦ Thunderbird<http://www.mozillamessaging.com/gd/thunderbird/> ◦ WordPress.com<http://gd.wordpress.com/> ◦ WordPress.org<http://gd.wordpress.org/> ◦ An Uicipeid<http://gd.wikipedia.org/wiki/Pr%C3%AComh-Dhuilleag> ◦ VLC Media Player<http://www.videolan.org/vlc/index.gd.html> Innealan do chleachdaichean Firefox/Thunderbird/LibreOffice: Accentuate<https://addons.mozilla.org/en-US/firefox/addon/accentuateus/> ◦ An Dearbhair Beag (Mozilla)<https://addons.mozilla.org/en-US/firefox/addon/scottish-gaelic-spell-checker/> ◦ An Dearbhair Beag (LibreOffice)<http://extensions.libreoffice.org/extension-center/an-dearbhair-beag-scottish-gaelic-spellchecker> ◦ Lightning<http://www.mozilla.org/projects/calendar/lightning/> ◦ QLS<https://addons.mozilla.org/en-US/firefox/addon/quick-locale-switcher/> Geamannan, spòrs ┐ feadhainn eile: An Crochadair<http://www.foramnagaidhlig.net/stuth/geamannan/crochadair/> ◦ FreeCiv<http://www.foramnagaidhlig.net/foram/viewtopic.php?f=28&t=2098&p=14625#p14625> ◦ GenoPro<http://www.genopro.com/> ◦ LiChess<http://gd.lichess.org/> ◦ Scrabble<http://sourceforge.net/apps/mediawiki/scrabble/index.php?title=Main_Page/gd> ◦ Tetris<http://www.foramnagaidhlig.net/stuth/geamannan/tetris/> Faclairean ┐ Briathrachas Am Faclair Beag<http://www.faclair.com/> ◦ Dwelly-d<http://www.dwelly.info/> ◦ Wordlink<http://www2.smo.uhi.ac.uk/wordlink/> |
|
From: Michael B. <fi...@ak...> - 2012-10-20 09:54:09
|
Easy one :) It's the same here in Scotland. You just download two languages, in my case Gaelic and English. In the live system, it only takes a swipe of the spacebar (onscreen) to switch languages. Michael 20/10/2012 07:36, sgrìobh Chris Bickers: > Hello, last question, if I'm working with bilingual languages, eg in > Samoa we use both Samoan and English, Should I incorporate English as > well? > I have no experience with predictive texting at all. But People will > want to text in English sometimes, Samoan other times depending on who > they're interacting with. How will that be accomplished? > For other software I have made I have incorporated the Samoan into the > english so that they still get the full english functionality rather > than had a seperate Samoan language. > Does that make sense? > I'm almost finished with the corpus, so just need to know that really. > > Regards > ChrisB -- *Akerbeltz <http://www.faclair.com/>* Goireasan Gàidhlig air an lìon Fòn: +44-141-946 4437 Facs: +44-141-945 2701 *Tha Gàidhlig aig a' choimpiutair agad - feuch e:* Addtoany <http://share.lockerz.com/buttons/> ◦ Facebook <http://userscripts.org/scripts/show/126511> ◦ Firefox <http://www.mozilla.com/gd/> ◦ FreeFileSync <http://sourceforge.net/projects/freefilesync/> ◦ Google <http://www.google.com/webhp?hl=gd> ◦ Joomla! <http://community.joomla.org/translations/joomla-16-translations.html> ◦ LibreOffice <http://gd.libreoffice.org/> ◦ Outlook.com <http://akerbeltz.org/index.php?title=iG%C3%A0idhlig#Outlook.com> ◦ Opera 11 <http://www.opera.com/> ◦ Opera Mini <http://www.opera.com/mobile/> phpBB <http://www.foramnagaidhlig.net/foram/viewtopic.php?f=28&t=2071> ◦ MediaWiki <http://www.mediawiki.org/wiki/MediaWiki> ◦ Skype <http://www.foramnagaidhlig.net/foram/viewtopic.php?f=28&t=2196> ◦ Thunderbird <http://www.mozillamessaging.com/gd/thunderbird/> ◦ WordPress.com <http://gd.wordpress.com/> ◦ WordPress.org <http://gd.wordpress.org/> ◦ An Uicipeid <http://gd.wikipedia.org/wiki/Pr%C3%AComh-Dhuilleag> ◦ VLC Media Player <http://www.videolan.org/vlc/index.gd.html> *Innealan do chleachdaichean Firefox/Thunderbird/LibreOffice:* Accentuate <https://addons.mozilla.org/en-US/firefox/addon/accentuateus/> ◦ An Dearbhair Beag (Mozilla) <https://addons.mozilla.org/en-US/firefox/addon/scottish-gaelic-spell-checker/> ◦ An Dearbhair Beag (LibreOffice) <http://extensions.libreoffice.org/extension-center/an-dearbhair-beag-scottish-gaelic-spellchecker> ◦ Lightning <http://www.mozilla.org/projects/calendar/lightning/> ◦ QLS <https://addons.mozilla.org/en-US/firefox/addon/quick-locale-switcher/> *Geamannan, spòrs ┐ feadhainn eile: *An Crochadair <http://www.foramnagaidhlig.net/stuth/geamannan/crochadair/> ◦ FreeCiv <http://www.foramnagaidhlig.net/foram/viewtopic.php?f=28&t=2098&p=14625#p14625> ◦ GenoPro <http://www.genopro.com/> ◦ LiChess <http://gd.lichess.org/> ◦ Scrabble <http://sourceforge.net/apps/mediawiki/scrabble/index.php?title=Main_Page/gd> ◦ Tetris <http://www.foramnagaidhlig.net/stuth/geamannan/tetris/> *Faclairean ┐ Briathrachas* Am Faclair Beag <http://www.faclair.com/> ◦ Dwelly-d <http://www.dwelly.info/> ◦ Wordlink <http://www2.smo.uhi.ac.uk/wordlink/> |
|
From: Chris B. <cbi...@gm...> - 2012-10-20 06:36:54
|
Hello, last question, if I'm working with bilingual languages, eg in Samoa we use both Samoan and English, Should I incorporate English as well? I have no experience with predictive texting at all. But People will want to text in English sometimes, Samoan other times depending on who they're interacting with. How will that be accomplished? For other software I have made I have incorporated the Samoan into the english so that they still get the full english functionality rather than had a seperate Samoan language. Does that make sense? I'm almost finished with the corpus, so just need to know that really. Regards ChrisB On 10/16/12, Chris Bickers <cbi...@gm...> wrote: > I'm not worried about copyright I'm creating a unique list/corpus from > multiple freely available works. With Authors consent. Soon as I've > finished cleaning everything up, I'll get started. I looked online to > see whats available as Kevin does but it's all so faulty as to be > worthless without fixing every second word. What I'm working with is > perfect (when I've finished with it) > I'm working on 1500+ pages of corpus though so it's quite an effort. > Have a great day guys and thanks for the advice. > ChrisB > > On 10/16/12, Michael Bauer <fi...@ak...> wrote: >> >> 15/10/2012 14:25, sgrìobh Jens Christensen: >>> The dictionary creator can take an input corpus in txt (Unicode) format. >>> You do need a list of valid words for it to work though. Additionally >>> since the input files will be published on SourceForge, there might be >>> copyright issues (if there isn't then it's fine to use a normal text >>> corpus) and we'll have to go with the same approach as we do for our >>> other >>> dictionaries (a corpus with each word repeated as many times as needed >>> and >>> some generic context). >> What that means is that you can either calculate the statistics and >> modify the file accordingly or just stick a piece of text in. For >> example (faking some words) if you know that proportionally aga occurs >> x%, aba y%, ata z% you can create a file like this (where the number of >> re-iterations indicates the % of occurrence): >> aga , >> aga , >> aga , >> aga , >> aga , >> aga , >> aga , >> aga , >> aba , >> aba , >> aba , >> aba , >> aba , >> ata , >> ata , >> ata , >> >> Or you can put in a coherent piece of text like (again faking it) >> ki aga aga le ata shi to ku aga aba ... >> >> and the system, during creating, will calculate the stats for you. But >> as Jens said, that will appear on SourceForge so make sure it's a text >> which isn't copyrighted or something you have permission for. Worst >> case, Bible text are often available. >> >>> I'm not sure how Michael did it for the Gaelic languages, but he might >>> be >>> able to help you. >> Kevin Scannell has statistical data on lots of languages. Because it >> sometimes contains badly spelled words, we used the Gaelic spellchecker >> file (which is clean), compared that against his file, stripped out >> anything that wasn't in the spellchecker and then calucalted the stats >> for the rest. If you have a clean data for the languages in question, >> that might work too. Failing that, especially since Polynesian languages >> don't to that much crazy morphology, you can probably do that manually >> with a bit of guesstimating. Perhaps work off a small learners wordlist >> of something. >> >> For the phrases, if you chuck in a coherent text, the system will do >> those too but I found that adding some manually worked better for Manx >> and Scots Gaelic by going through a learners' textbook and picking out >> common patterns. >> >> Tata >> >> Michael >> >> > > > -- > Christopher Bickers > Managing Director > Bickers Services > Samoa > -- Christopher Bickers Managing Director Bickers Services Samoa |
|
From: Chris B. <cbi...@gm...> - 2012-10-15 18:22:51
|
I'm not worried about copyright I'm creating a unique list/corpus from multiple freely available works. With Authors consent. Soon as I've finished cleaning everything up, I'll get started. I looked online to see whats available as Kevin does but it's all so faulty as to be worthless without fixing every second word. What I'm working with is perfect (when I've finished with it) I'm working on 1500+ pages of corpus though so it's quite an effort. Have a great day guys and thanks for the advice. ChrisB On 10/16/12, Michael Bauer <fi...@ak...> wrote: > > 15/10/2012 14:25, sgrìobh Jens Christensen: >> The dictionary creator can take an input corpus in txt (Unicode) format. >> You do need a list of valid words for it to work though. Additionally >> since the input files will be published on SourceForge, there might be >> copyright issues (if there isn't then it's fine to use a normal text >> corpus) and we'll have to go with the same approach as we do for our other >> dictionaries (a corpus with each word repeated as many times as needed and >> some generic context). > What that means is that you can either calculate the statistics and > modify the file accordingly or just stick a piece of text in. For > example (faking some words) if you know that proportionally aga occurs > x%, aba y%, ata z% you can create a file like this (where the number of > re-iterations indicates the % of occurrence): > aga , > aga , > aga , > aga , > aga , > aga , > aga , > aga , > aba , > aba , > aba , > aba , > aba , > ata , > ata , > ata , > > Or you can put in a coherent piece of text like (again faking it) > ki aga aga le ata shi to ku aga aba ... > > and the system, during creating, will calculate the stats for you. But > as Jens said, that will appear on SourceForge so make sure it's a text > which isn't copyrighted or something you have permission for. Worst > case, Bible text are often available. > >> I'm not sure how Michael did it for the Gaelic languages, but he might be >> able to help you. > Kevin Scannell has statistical data on lots of languages. Because it > sometimes contains badly spelled words, we used the Gaelic spellchecker > file (which is clean), compared that against his file, stripped out > anything that wasn't in the spellchecker and then calucalted the stats > for the rest. If you have a clean data for the languages in question, > that might work too. Failing that, especially since Polynesian languages > don't to that much crazy morphology, you can probably do that manually > with a bit of guesstimating. Perhaps work off a small learners wordlist > of something. > > For the phrases, if you chuck in a coherent text, the system will do > those too but I found that adding some manually worked better for Manx > and Scots Gaelic by going through a learners' textbook and picking out > common patterns. > > Tata > > Michael > > -- Christopher Bickers Managing Director Bickers Services Samoa |
|
From: Michael B. <fi...@ak...> - 2012-10-15 14:07:38
|
15/10/2012 14:25, sgrìobh Jens Christensen: > The dictionary creator can take an input corpus in txt (Unicode) format. You do need a list of valid words for it to work though. Additionally since the input files will be published on SourceForge, there might be copyright issues (if there isn't then it's fine to use a normal text corpus) and we'll have to go with the same approach as we do for our other dictionaries (a corpus with each word repeated as many times as needed and some generic context). What that means is that you can either calculate the statistics and modify the file accordingly or just stick a piece of text in. For example (faking some words) if you know that proportionally aga occurs x%, aba y%, ata z% you can create a file like this (where the number of re-iterations indicates the % of occurrence): aga , aga , aga , aga , aga , aga , aga , aga , aba , aba , aba , aba , aba , ata , ata , ata , Or you can put in a coherent piece of text like (again faking it) ki aga aga le ata shi to ku aga aba ... and the system, during creating, will calculate the stats for you. But as Jens said, that will appear on SourceForge so make sure it's a text which isn't copyrighted or something you have permission for. Worst case, Bible text are often available. > I'm not sure how Michael did it for the Gaelic languages, but he might be able to help you. Kevin Scannell has statistical data on lots of languages. Because it sometimes contains badly spelled words, we used the Gaelic spellchecker file (which is clean), compared that against his file, stripped out anything that wasn't in the spellchecker and then calucalted the stats for the rest. If you have a clean data for the languages in question, that might work too. Failing that, especially since Polynesian languages don't to that much crazy morphology, you can probably do that manually with a bit of guesstimating. Perhaps work off a small learners wordlist of something. For the phrases, if you chuck in a coherent text, the system will do those too but I found that adding some manually worked better for Manx and Scots Gaelic by going through a learners' textbook and picking out common patterns. Tata Michael |
|
From: Jens C. <jch...@ke...> - 2012-10-15 13:28:18
|
The dictionary creator can take an input corpus in txt (Unicode) format. You do need a list of valid words for it to work though. Additionally since the input files will be published on SourceForge, there might be copyright issues (if there isn't then it's fine to use a normal text corpus) and we'll have to go with the same approach as we do for our other dictionaries (a corpus with each word repeated as many times as needed and some generic context). I'm not sure how Michael did it for the Gaelic languages, but he might be able to help you. Cheers, Jens -----Original Message----- From: Chris Bickers [mailto:cbi...@gm...] Sent: 15 October 2012 12:59 To: Jens Christensen Cc: fi...@ak...; ope...@li... Subject: Re: [Openadaptxt-linguists] New Language Sounds ok, just macrons with this first language. Is there an easy way to seperate words from the corpus so I can build a wordlist from it and avoid repetition? Long weekend here so I haven't had a chance to ask my programmers. Regards ChrisB On 10/16/12, Jens Christensen <jch...@ke...> wrote: > Hi Chris, > Once you have the dictionary files ready we can have a test dictionary ready > within a day or two, depending on any issues that we come across of course. > Given that the Polynesian languages are relatively simple (at least > technically speaking) I don't think there should be too many problems. > > The only thing that can hold us back is if a language has some completely > unsupported feature which will have to be developed first (as was the case > for the Gaelic languages), but I can't immediately see that for the > Polynesian languages (at least not from looking at Wikipedia :-). > > Cheers, > Jens > > -----Original Message----- > From: Michael Bauer [mailto:fi...@ak...] > Sent: 15 October 2012 11:01 > To: Chris Bickers > Cc: Jens Christensen; ope...@li... > Subject: Re: [Openadaptxt-linguists] New Language > > The Gaelics started in July 2011 but before you pass out, we had some > language specific problems (Gaelic has "weird" stuff around prefixes > like h- t- n-) which required some development work. I think Polynesian > languages will be much faster as they tend only to have a macron at most > but I'd say it depends on how many "special requirements" the language > has and how long you take testing. > > Michael > > 15/10/2012 10:51, sgrìobh Chris Bickers: >> Excellent Jens, nice to meet you and great to be here, Michael has >> already given me some pointers on how to get started, I'm cleaning up >> a corpus and should have the first language to submit in a couple of >> days. What sort of timeframes would I be looking at before I have >> something to download and use in the community? >> Regards >> ChrisB > > > > -- Christopher Bickers Managing Director Bickers Services Samoa |
|
From: Chris B. <cbi...@gm...> - 2012-10-15 11:59:24
|
Sounds ok, just macrons with this first language. Is there an easy way to seperate words from the corpus so I can build a wordlist from it and avoid repetition? Long weekend here so I haven't had a chance to ask my programmers. Regards ChrisB On 10/16/12, Jens Christensen <jch...@ke...> wrote: > Hi Chris, > Once you have the dictionary files ready we can have a test dictionary ready > within a day or two, depending on any issues that we come across of course. > Given that the Polynesian languages are relatively simple (at least > technically speaking) I don't think there should be too many problems. > > The only thing that can hold us back is if a language has some completely > unsupported feature which will have to be developed first (as was the case > for the Gaelic languages), but I can't immediately see that for the > Polynesian languages (at least not from looking at Wikipedia :-). > > Cheers, > Jens > > -----Original Message----- > From: Michael Bauer [mailto:fi...@ak...] > Sent: 15 October 2012 11:01 > To: Chris Bickers > Cc: Jens Christensen; ope...@li... > Subject: Re: [Openadaptxt-linguists] New Language > > The Gaelics started in July 2011 but before you pass out, we had some > language specific problems (Gaelic has "weird" stuff around prefixes > like h- t- n-) which required some development work. I think Polynesian > languages will be much faster as they tend only to have a macron at most > but I'd say it depends on how many "special requirements" the language > has and how long you take testing. > > Michael > > 15/10/2012 10:51, sgrìobh Chris Bickers: >> Excellent Jens, nice to meet you and great to be here, Michael has >> already given me some pointers on how to get started, I'm cleaning up >> a corpus and should have the first language to submit in a couple of >> days. What sort of timeframes would I be looking at before I have >> something to download and use in the community? >> Regards >> ChrisB > > > > -- Christopher Bickers Managing Director Bickers Services Samoa |
|
From: Jens C. <jch...@ke...> - 2012-10-15 11:05:56
|
Hi Chris, Once you have the dictionary files ready we can have a test dictionary ready within a day or two, depending on any issues that we come across of course. Given that the Polynesian languages are relatively simple (at least technically speaking) I don't think there should be too many problems. The only thing that can hold us back is if a language has some completely unsupported feature which will have to be developed first (as was the case for the Gaelic languages), but I can't immediately see that for the Polynesian languages (at least not from looking at Wikipedia :-). Cheers, Jens -----Original Message----- From: Michael Bauer [mailto:fi...@ak...] Sent: 15 October 2012 11:01 To: Chris Bickers Cc: Jens Christensen; ope...@li... Subject: Re: [Openadaptxt-linguists] New Language The Gaelics started in July 2011 but before you pass out, we had some language specific problems (Gaelic has "weird" stuff around prefixes like h- t- n-) which required some development work. I think Polynesian languages will be much faster as they tend only to have a macron at most but I'd say it depends on how many "special requirements" the language has and how long you take testing. Michael 15/10/2012 10:51, sgrìobh Chris Bickers: > Excellent Jens, nice to meet you and great to be here, Michael has > already given me some pointers on how to get started, I'm cleaning up > a corpus and should have the first language to submit in a couple of > days. What sort of timeframes would I be looking at before I have > something to download and use in the community? > Regards > ChrisB |
|
From: Michael B. <fi...@ak...> - 2012-10-15 10:01:12
|
The Gaelics started in July 2011 but before you pass out, we had some language specific problems (Gaelic has "weird" stuff around prefixes like h- t- n-) which required some development work. I think Polynesian languages will be much faster as they tend only to have a macron at most but I'd say it depends on how many "special requirements" the language has and how long you take testing. Michael 15/10/2012 10:51, sgrìobh Chris Bickers: > Excellent Jens, nice to meet you and great to be here, Michael has > already given me some pointers on how to get started, I'm cleaning up > a corpus and should have the first language to submit in a couple of > days. What sort of timeframes would I be looking at before I have > something to download and use in the community? > Regards > ChrisB |
|
From: Chris B. <cbi...@gm...> - 2012-10-15 09:51:49
|
Excellent Jens, nice to meet you and great to be here, Michael has already given me some pointers on how to get started, I'm cleaning up a corpus and should have the first language to submit in a couple of days. What sort of timeframes would I be looking at before I have something to download and use in the community? Regards ChrisB On 10/15/12, Jens Christensen <jch...@ke...> wrote: > Hi Chris, > Welcome to Openadaptxt. Currently there aren't anybody working on any of the > Polynesian languages. Please feel free to ask any questions you might have > and also to have a look through the previous threads in the mailing list for > some more info on how to create a dictionary. > > Michael: That does sound like a good idea. While I look into it I will at > least create a better description on how to create a dictionary, based on > our experience with the Gaelic languages. Sorry you had to be the guinea pig > ;-) > > Cheers, > Jens > > -----Original Message----- > From: Michael Bauer [mailto:fi...@ak...] > Sent: 14 October 2012 23:29 > To: ope...@li... > Subject: Re: [Openadaptxt-linguists] New Language > > Welcome to the list Chris ;) > > Jens, Chris is the chap I mentioned some time ago when we were talking > about Samoan but I think he intends to start on some others first. > Anyway, something that occurred to me while thinking about that. Given > the number of new languages, would it not be better to integrate the > gubbins into a web interface in a password protected are so that once > someone has signed up to maintain/do a locale, they input the language > name, ISO and special characters needed and the site builds the atd/xml > etc files and grabs a number for the language. On a separate screen, one > could upload the txt files - perhaps even split up as a txt file for the > line seperated tokens, a file for the url stuff, one for connected text > if one wants to and so on and at the end it just outputs the files. It > would have probably saved you and me a lot of time, given the knots I > managed to tie myself into. > > Just a thought > > Michael > > 14/10/2012 21:51, sgrìobh Chris Bickers: >> Hello, I intend to do a few polynesian languages, is anyone already >> working on any? >> Regards >> ChrisB > > > ------------------------------------------------------------------------------ > Don't let slow site performance ruin your business. Deploy New Relic APM > Deploy New Relic app performance management and know exactly > what is happening inside your Ruby, Python, PHP, Java, and .NET app > Try New Relic at no cost today and get our sweet Data Nerd shirt too! > http://p.sf.net/sfu/newrelic-dev2dev > _______________________________________________ > Openadaptxt-linguists mailing list > Ope...@li... > https://lists.sourceforge.net/lists/listinfo/openadaptxt-linguists > > > ------------------------------------------------------------------------------ > Don't let slow site performance ruin your business. Deploy New Relic APM > Deploy New Relic app performance management and know exactly > what is happening inside your Ruby, Python, PHP, Java, and .NET app > Try New Relic at no cost today and get our sweet Data Nerd shirt too! > http://p.sf.net/sfu/newrelic-dev2dev > _______________________________________________ > Openadaptxt-linguists mailing list > Ope...@li... > https://lists.sourceforge.net/lists/listinfo/openadaptxt-linguists > -- Christopher Bickers Managing Director Bickers Services Samoa |
|
From: Jens C. <jch...@ke...> - 2012-10-15 08:54:06
|
Hi Chris, Welcome to Openadaptxt. Currently there aren't anybody working on any of the Polynesian languages. Please feel free to ask any questions you might have and also to have a look through the previous threads in the mailing list for some more info on how to create a dictionary. Michael: That does sound like a good idea. While I look into it I will at least create a better description on how to create a dictionary, based on our experience with the Gaelic languages. Sorry you had to be the guinea pig ;-) Cheers, Jens -----Original Message----- From: Michael Bauer [mailto:fi...@ak...] Sent: 14 October 2012 23:29 To: ope...@li... Subject: Re: [Openadaptxt-linguists] New Language Welcome to the list Chris ;) Jens, Chris is the chap I mentioned some time ago when we were talking about Samoan but I think he intends to start on some others first. Anyway, something that occurred to me while thinking about that. Given the number of new languages, would it not be better to integrate the gubbins into a web interface in a password protected are so that once someone has signed up to maintain/do a locale, they input the language name, ISO and special characters needed and the site builds the atd/xml etc files and grabs a number for the language. On a separate screen, one could upload the txt files - perhaps even split up as a txt file for the line seperated tokens, a file for the url stuff, one for connected text if one wants to and so on and at the end it just outputs the files. It would have probably saved you and me a lot of time, given the knots I managed to tie myself into. Just a thought Michael 14/10/2012 21:51, sgrìobh Chris Bickers: > Hello, I intend to do a few polynesian languages, is anyone already > working on any? > Regards > ChrisB ------------------------------------------------------------------------------ Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev _______________________________________________ Openadaptxt-linguists mailing list Ope...@li... https://lists.sourceforge.net/lists/listinfo/openadaptxt-linguists |
|
From: Michael B. <fi...@ak...> - 2012-10-14 22:42:17
|
Welcome to the list Chris ;) Jens, Chris is the chap I mentioned some time ago when we were talking about Samoan but I think he intends to start on some others first. Anyway, something that occurred to me while thinking about that. Given the number of new languages, would it not be better to integrate the gubbins into a web interface in a password protected are so that once someone has signed up to maintain/do a locale, they input the language name, ISO and special characters needed and the site builds the atd/xml etc files and grabs a number for the language. On a separate screen, one could upload the txt files - perhaps even split up as a txt file for the line seperated tokens, a file for the url stuff, one for connected text if one wants to and so on and at the end it just outputs the files. It would have probably saved you and me a lot of time, given the knots I managed to tie myself into. Just a thought Michael 14/10/2012 21:51, sgrìobh Chris Bickers: > Hello, I intend to do a few polynesian languages, is anyone already > working on any? > Regards > ChrisB |
|
From: Chris B. <cbi...@gm...> - 2012-10-14 20:52:02
|
Hello, I intend to do a few polynesian languages, is anyone already working on any? Regards ChrisB |
|
From: Michael B. <fi...@ak...> - 2012-09-06 13:54:34
|
Hi guys, Question for you - we've just been given a slot at the Open Night at the Gaelic School in Glasgow to run an event on Gaelic software and it would be really cool if we could include Adaptxt in that but I kinda need to know beforehand if we can include it or not. The event is on the 12th - any chance the launch will be before than (that is, for those Gaelic/Irish/Manx tools we've been working on)? Cheers Michael |
|
From: Jens C. <jch...@ke...> - 2012-02-13 11:09:39
|
Hi, That's great. Yes, if you send the text files (corpus and inclusion), the output files and the nuance files (both the xml and abn) I will upload them to Sourceforge. You can of course update the files any time you want, just send the updated files to me. Thanks, Jens -----Original Message----- From: Michael Bauer [mailto:fi...@ak...] Sent: 10 February 2012 18:51 To: Jens Christensen Cc: ope...@li... Subject: Re: Irish Ok, all fixed :) What do you need? All the files that got generated plus the txt files? Cheers Michael |
|
From: Michael B. <fi...@ak...> - 2012-02-10 18:50:52
|
Ok, all fixed :) What do you need? All the files that got generated plus the txt files? Cheers Michael |
|
From: Michael B. <fi...@ak...> - 2012-02-09 23:16:27
|
Right, I'm royally stuck and I don't know why. We did the test files a while ago and it worked fine. Following advice, we added some phrases and then maxed out the 100,000 word count cause we had "spare room". I can still create the necessary files and Qtapp opens them alright but there's some things that just aren't working right. For example: o mhathair should yield ó mháthair as they used to but they don't anymore and there's something that's not working right with the ellision stuff anymore either. I've gone back to basics, redid the abn file from scratch just in case, recompiled the files and all, checked encoding... nothing I do seems to make a difference. The only thing that I've spotted that's different is that rather than put the phrases at the end, we've accidentally sorted them alphabetically into the rest of the entries but as they're all on their own lines, I don't see why that should matter. I've uploaded the base files and the output to http://www.akerbeltz.org/sealach/Irish.zip Any pointers would be much appreciated. Another question... when using a predicted word, there's a space after, even if I follow by a , or . I assume that that extra space is part of the "deal" and users just have to live with it? I note with excitement though that more OS have been added, such as ONXBlackberry and iOS :))) Cheers, Michael |
|
From: Michael B. <fi...@ak...> - 2012-02-08 10:38:25
|
Makes sense, thanks! Michael 08/02/2012 09:37, sgrìobh Jens Christensen: > Hi Michael, > Yes, you will get strange stuff like that, even if you set the cutoff quite high. Of course the higher the cutoff the fewer you should get, but then you will also get less of the "good" ones, so it's a trade-off either way. If you want to remove the oddities I can't really think of any other way of doing that than reviewing it manually. It's up to yourselves to judge what would be necessary to remove, if any, and whether it's necessary to review the context - you are the experts:-) > > As you noted, the current dictionaries (not just the English one) only contain a very small amount of context, basically the very most common. This is both to avoid "oddities" as you mention but more to leave the dictionaries open for people like yourselves or anybody else who might be interested in using or working with openadaptxt. Basically we didn't want to say "this is the final dictionaries" but rather leave it to the community what to do with the dictionaries. > > Cheers, > Jens |
|
From: Jens C. <jch...@ke...> - 2012-02-08 09:39:46
|
Hi Michael, Yes, you will get strange stuff like that, even if you set the cutoff quite high. Of course the higher the cutoff the fewer you should get, but then you will also get less of the "good" ones, so it's a trade-off either way. If you want to remove the oddities I can't really think of any other way of doing that than reviewing it manually. It's up to yourselves to judge what would be necessary to remove, if any, and whether it's necessary to review the context - you are the experts :-) As you noted, the current dictionaries (not just the English one) only contain a very small amount of context, basically the very most common. This is both to avoid "oddities" as you mention but more to leave the dictionaries open for people like yourselves or anybody else who might be interested in using or working with openadaptxt. Basically we didn't want to say "this is the final dictionaries" but rather leave it to the community what to do with the dictionaries. Cheers, Jens -----Original Message----- From: Michael Bauer [mailto:fi...@ak...] Sent: 08 February 2012 00:28 To: Jens Christensen Cc: ope...@li... Subject: Re: [Openadaptxt-linguists] Phrases & text Ok, took us a while cause I had to build a Gaelic corpus. Question though. If we just analyze for 2-4 word combos, won't that result in some oddities like "Jimmy said"? Or do you set a fairly high cutoff? The English corpus (the 2-4 word items) does't actually look that big for something built on 4MB of text. Cheers Michael 20/01/2012 12:20, sgrìobh Jens Christensen: > The phrases that we added are simply a selection of the most common 2-4 word phrases that appear in the original corpora (it varies a bit from language to language). |
|
From: Michael B. <fi...@ak...> - 2012-02-08 00:27:56
|
Ok, took us a while cause I had to build a Gaelic corpus. Question though. If we just analyze for 2-4 word combos, won't that result in some oddities like "Jimmy said"? Or do you set a fairly high cutoff? The English corpus (the 2-4 word items) does't actually look that big for something built on 4MB of text. Cheers Michael 20/01/2012 12:20, sgrìobh Jens Christensen: > The phrases that we added are simply a selection of the most common 2-4 word phrases that appear in the original corpora (it varies a bit from language to language). |
|
From: Jens C. <jch...@ke...> - 2012-01-20 12:22:01
|
Hi, I'm not entirely sure if you had it right before, have it right now, or both :-) The corpus file can contain any text file in pretty much any format (short phrases, long pieces of text, individual words, etc.), the dictionary creator does most of the work then, extracting context and frequency information from the corpus file. However, as you mentioned yourself, there can be copyright issues if we upload the raw corpora to SourceForge, which is one of the reasons we had to go with the approach that we have (where the corpus file is basically a frequency list with some added phrases, basically the approach you describe below). The phrases that we added are simply a selection of the most common 2-4 word phrases that appear in the original corpora (it varies a bit from language to language). Regards, Jens -----Original Message----- From: Michael Bauer [mailto:fi...@ak...] Sent: 20 January 2012 11:33 To: Jens Christensen Cc: ope...@li... Subject: Re: [Openadaptxt-linguists] Phrases & text Ah I get it, I must have misunderstood you. So essentially it goes text corpus > extract phrases and frequency > add terms to inclusion.txt and phrases (plus stats) to corpus.txt? I thought we just paste the text into either file and the package generator somehow does the work. Ok. This raises a question though. In terms of this project, how did you define "phrase"? Did you simply do a search for the longest phrases which repeat or did you use some type of limitation, linguistic or otherwise? Thanks Michael |