| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| README.txt | 2012-09-12 | 6.3 kB | |
| sample_result.txt | 2012-09-12 | 13.1 kB | |
| Totals: 2 Items | 19.4 kB | 0 |
The entire project is structured as an Eclipse [1] pydev-project [2],
so you have to install/download eclipse and pydev.
Note: the current implementation uses the dictionary files from dict.cc [4].
Because of license reasons you must download the files by your own [5] and
replace the corresponding stub-files, e.g. EN-DE.txt. you can also add new
languages by adding the files to the root directory, name it
<Source Language>-<Target Language>.txt and add the target language to the
configuration file. (note only one source languages is currently supported)
A configuraiton file (config.xml) exists in the root folder.
The following example illustrates a sample configuration.
==============================================================================
<config xmlns="http://ztt.fh-worms.de/lgeorgieff/CAiL/">
<!-- the following part is responsible for requesting and crawling new
webpages that are used as text for the following workers -->
<search>
<!-- seed: the seed pages that are crawled and which links are added to
the seed recursively -->
<!-- query: if the seed is empty after a while the query items are used
in a web search engine and the resulting pages are used as seed,
the default implementation uses google's search engine -->
<seed>
<page>http://en.wikipedia.org/wiki/Main_Page</page>
<page>http://www.msn.com/</page>
<page>http://www.cnn.com/</page>
<page>http://www.nationalgeographic.com/</page>
<page>http://web.mit.edu/</page>
<query>language</query>
<query>news</query>
</seed>
<!-- use: the language of a web pages that are taken into account.
more use tags within the languages tag are allowed. pages that
have defined different languages or do not declare a language
are ignored -->
<!-- discard: the language of web pages that are not taken into
account. more discard tags within the languages tag are allowed.
<languages>
<!-- iso 639: http://www.mathguide.de/info/tools/languagecode.html -->
<!--<use lang="en-us"/>-->
<!--<use lang="en-gb"/>-->
<use lang="en"/>
<!--<discard lang="pl"/>-->
</languages>
<!-- if the maxRequests values was reached by the crawler, the crawler
ends -->
<maxRequests value="500"/><!-- for infinite: -1 -->
<!-- for the google page downloader we use the additional tag google -->
<!-- with the apiKey for accessing the google rest api and the -->
<!-- cx attribute. -->
<!-- Note: it is important to encrypt the cx value! -->
<google cx="..." apiKey="..."/>
</search>
<!-- the following part is responsible for the translation of tokens -->
<!-- currently there are two implementations available:
one for the microsoft bing translator [3, 6] (not finished) and another
for an offline translator based on dictionary file from dict.cc [4].
the acutal used translator is the offline translator. -->
<translate>
<!-- some configuration settings for the Microsoft Translator -->
<!-- Note: it is important to encrypt the clientSecret value! -->
<microsoft clientId="..."
clientSecret="...">
<language source="en" targets="de fr it pl es"/>
<minRating value="5"/>
<minMatchDegree value="100"/>
<averageCountRation value="1.5"/>
<pagewindow value="3"/>
</microsoft>
<offline>
<!-- pageWindow: the set of words that are collected during the web
pages with the defined window, e.g. 3 => the set consist of words
collected over 3 webpages -->
<!-- languages: specifies the source language and the target
languages -->
<pageWindow value="3"/>
<language source="EN" targets="DE FR IT PL ES"/>
</offline>
</translate>
<!-- the following part is responsible for finding the actual
synonymous words -->
<synonymFinder>
<!-- ignoreCaseOfTranslations: is set to true translated words are
compared by ignoring the upper and lower case, this takes
translations of one languages into account, i.e. if the translations
of a word are "gehen" and "Gehen" they are treated as a match,
although the word types are different, so in a single languages
the source words are considered as synonyms. -->
<!-- minLanguageMatches: the number of langauges that a potential synonym
pair must be considered to be a valid synonym pair, e.g. the synonym
pair (pretty, beautiful) must be matched in at least two languages,
if this property is set to 2. -->
<!-- ignoreWordTypeWhenComparingLanguages: when set to falsenot only
different languages are taken into account but also the word type
of the translations, e.g. if minLanguageMatches is set to 2, the
synonym pair (pretty. beautiful) must not only be matched in at least
2 languages, but additionally the word types of the translations
must be identical, e.g. noun -->
<!-- the case of translated text, e.g. "gehen" and (das)"Gehen" -->
<ignoreCaseOfTranslations value="false"/>
<!-- the minimum amount of languages in that the found synonmy must be matched as a synonym -->
<minLanguageMatches value="2"/>
<ignoreWordTypeWhenComparingLanguages value="false"/>
</synonymFinder>
<!-- the following part is responsible for writing the actual results
into a file -->
<resultWriter>
<resultingFile>
<!-- path: the actual file path of the resulting file -->
<!-- fileMode: *override: overrides potentially exsiting files
*append: creates a new file or if one is already existent
appends the results to the existing file
*create: creates a new file, if one is already existent
an error is thrown -->
<path value="./resultFile.txt"/>
<fileMode value="append"/> <!-- override | append | create -->
</resultingFile>
</resultWriter>
</config>
==============================================================================
1: http://www.eclipse.org/downloads/
2: http://pydev.org/
3: http://www.bing.com/translator/
4: http://www.dict.cc/
5: http://www.dict.cc/?s=about%3Awordlist
6: http://www.microsoft.com/web/post/using-the-free-bing-translation-apis