Re: [Openadaptxt-linguists] Phrases & text
Brought to you by:
keypoint,
openadaptxt
|
From: Jens C. <jch...@ke...> - 2012-01-20 12:22:01
|
Hi, I'm not entirely sure if you had it right before, have it right now, or both :-) The corpus file can contain any text file in pretty much any format (short phrases, long pieces of text, individual words, etc.), the dictionary creator does most of the work then, extracting context and frequency information from the corpus file. However, as you mentioned yourself, there can be copyright issues if we upload the raw corpora to SourceForge, which is one of the reasons we had to go with the approach that we have (where the corpus file is basically a frequency list with some added phrases, basically the approach you describe below). The phrases that we added are simply a selection of the most common 2-4 word phrases that appear in the original corpora (it varies a bit from language to language). Regards, Jens -----Original Message----- From: Michael Bauer [mailto:fi...@ak...] Sent: 20 January 2012 11:33 To: Jens Christensen Cc: ope...@li... Subject: Re: [Openadaptxt-linguists] Phrases & text Ah I get it, I must have misunderstood you. So essentially it goes text corpus > extract phrases and frequency > add terms to inclusion.txt and phrases (plus stats) to corpus.txt? I thought we just paste the text into either file and the package generator somehow does the work. Ok. This raises a question though. In terms of this project, how did you define "phrase"? Did you simply do a search for the longest phrases which repeat or did you use some type of limitation, linguistic or otherwise? Thanks Michael |