Linking Language to Knowledge with Distributional Semantics
JobimText is a software solution for automatic text expansion using contextualized distributional similarity. It provides text analysis tools for large corpora and has capabilities to create distributional semantic models (JoBimText models) and multi-word expressions.
An Arabic Corpora Processing Tool
The new version is available at https://sourceforge.net/projects/ghawwasv4/
Program wordTabulator is intended for text analysis. With help of wordTabulator you can generate index of word elements extracted from defined text set. Word elements may be words, N-grams (of defined size) or phrases (syntagmes). The program can process texts as in ordinary 1-byte encoding (ANSI), as in multibyte UTF-8 encoding. Source texts are defined as a set of flat text files or HTML/XML/SGML documents. In the last case the program can filter content from markup. Moreover, you can process only defined content within selected paired tags. Or you can skip that content from processing. As additional feature you can analyse a pair of text sets and compare them by common or different elements. Output word index may be generated in HTML format and contain frequences of each text element and links to original content. Also it may be generated as a flat text file. Words in the index are ordered by alphabet, value or frequency.
AzConvert is an open source program to convert different scripts of Azerbaijani language (Latin, Arabic and Cyrillic) to each other. It's written in Qt.
A corpus contains more than 1 M distinct Arabic words.
This project has been developed as part of a master thesis named "Edit Distance Adapted to Natural Language Words". The available project consists three parts. First, the corpus gathers more than one million distinct Arab words. Second, the text files of Arabic resources. Third, the index file presents some information about these resources. Additional details about these parts are available in README file.
Kana no quiz is a little educational tool to memorize the transcription and pronunciation of Japanese kana (katakana & hiragana), presented as a quiz. It is written in Python and uses a GTK+ interface for a nice cross-paltform rendering!
Part of Speech tagger.
a pronunciation dictionary of American English
This dictionary can search reversely by pronunciation or normally by spelling. It uses CMUdict* as its data source. Pronunciation is transcribed in IPA** symbols. The program runs only on Windows. * http://www.speech.cs.cmu.edu/cgi-bin/cmudict ** http://en.wikipedia.org/wiki/International_Phonetic_Alphabet BRAND NEW VERSION 0.3 RELEASED! Warning: In spite of the best efforts, this dictionary is not flawless. Unfortunately, there are some errors in the dictionary entries. Also, marking of stressed syllables does not always work perfectly, but it is improved in this version (based on a set of possible syllable onsets). GitHub repo: https://github.com/JiriVaclavik/PronunDict NOTE: if you have anything to say about this project, please add a review. Thank you.
Collecter and manager of semiotica annalisis data
This program is a web application to collect and organize data of text analysis. It works with sets of texts and the analysis are done on portions of the length of a sentence. One of the preprocessing modules is based on CoGroo (A LibreOffice & OpenOffice.org Portuguese Grammar Checker).
A parallel corpora (bitext) aligning tool. Create TMX databases
(Full support available under superalign.sourceforge.net) Aligning parallel corpora Creating TMX, csv, Tab Delimited TMs Automatic aligning of text Super fast handling of multiple files Very easy GUI handling of files under Windows CAT tool assistant
A Python module for EuroWordNet files and data.
A Python module for EuroWordNet files and data.
A portable, platform-independent, open source tool for converting different Kurdish scripts
SuperAlign was fully updated as of 15 July 2013 and is now released under the name eAlign as well. A parallel corpora (bitext) aligning tool. Create TMX databases and align translations for Translation Memory databases. Use multiple files in multiple formats to align them with their translations. The full workflow is built in with a GUI interface. SuperAlign-eAlign uses the hunalign algorithm.
Unicode Conversion Gateway is a web-based proxy server to convert some of the Indian language web pages encoded in proprietary encodings into Unicode.Padma, a popular Firefox extension, is extended and reimplemented in PHP to create this proxy server
Simply convert your PDF files into audio books
Summary: Your eyes are tired of looking into the tablet or cell-phone screen reading ebooks? You have difficulty reading from LCD screen specially in a driving vehicle? This software is for you! It converts your PDF files to MP3 audio books. Special Features (Compared to similar projects): Each page is in a separate MP3 file. Created MP3 files have ID3v2 tags showing Book name and page number. Multi-threaded conversion, means all CPU cores will be used thus multiple times faster conversion.
The JINSECT toolkit is a Java-based toolkit and library that supports and demonstrates the use of n-gram graphs within Natural Language Processing applications, ranging from summarization and summary evaluation to text classiﬁcation and indexing.
A simulation package for investigating the dynamics of complex controversy.
Core program and associated utilities for building a machine translation system using the Example-Based paradigm, where previously-translated text is used to infer new translations of previously-unseen text.
No more support for this project - TAKE A LOOK AT FALCONSEARCH
No more support for this project - TAKE A LOOK AT FALCONSEARCH "https://sourceforge.net/projects/falcontextsearch/"
Projects with GPL licensed resources
This project contains project which are dependent on other projects/libraries which are under the GPL licence
Pes is an simple programing language. It has just few basic constructions and it has c-like & pascal-like syntax. The compiler output is Jasmin - an assembler for java virtual machine. It was my school project.
This is a fast C implementation of Arturo Camacho's SWIPE' pitch extraction algorithm. See the project homepage for more about the advantages of the SWIPE' algorithm. swipe-1.0.tar.gz contains the current source, which should compile quite neatly.
Classify any two TXT documents, no training required - JAVA
This program is made to address two most common issues with the known classifying algorithms. First, over-training and second, shortage of data for a training of categories. Instead, each TXT file is a category on its own, rather than an assigned category. In a way, this is similar to clustering but not really a clustering algorithm since there is some training involved. The summarizer from Classifier4J has been adjusted to accept two inputs (lets call them A and B). Then, the summarizer gets trained with A to summarize a document B, and vice versa. This extracts a relevant structure for both documents (and thus avoids the over-training) which are then compared using the Vector-Space analysis to give a range of belonging of one document to another (and thus avoids the shortage of information). This method can be used to create the user-defined classes by merging texts of certain categories and then to calculate the relevant distances between the documents, but this is not necessary.
Java API and tools for performing NLP and other AI tasks
Java API and tools for performing a wide range of AI tasks such as: word sense disambiguation (released), optimization (5 Evolutionary Algorithms Implemented ETA February 2014), opinion mining (ETA November 2014) and text wikification (ETA July 2014). Gannu includes some graphical interfaces for scientific purposes. When using Gannu please cite: *Jiménez, F. V., Gelbukh, A. F. & Sidorov, G. (2013). Simple Window Selection Strategies for the Simplified Lesk Algorithm for Word Sense Disambiguation.. In F. Castro, A. F. Gelbukh & M. González (eds.), MICAI (1) (pp. 217-227), : Springer. ISBN: 978-3-642-45113-3 The zip file contains Gannu jar, source, API documentation and necessary resources for performing research. Gannu uses the following projects: Weka, JExcel API, Stanford POS Tagger and WordNet. Please cite them when using Gannu.
dual grammar translation project
Aim of the neurotranslator compiler is to be a powerfull and natural translator langage. Severals outputs flux (files or stdout) are usable through grammar channels. Fit for following usages : data extraction, files or format convertion/encoding/decoding, ... Note that XBNF is a normalisation which neurotranslator is not based on. Please update review on : http://neurotranslator.sourceforge.net To build this project : wget https://sourceforge.net/projects/neurotranslator/files/nt_Linux-x86_64.2016-06-09.tar.gz gzip -dc nt_Linux-x86_64.2016-06-09.tar.gz | tar -xvf - cd git make to install /usr/bin/nt command, as root, do make install to perform a test: nt -i SAMPLES/logic.txt -o - SAMPLES/logic.xbnf You have translated a dual BNF grammar to stdout.