Showing 43 open source projects for "english corpus"

View related business solutions
  • Our Free Plans just got better! | Auth0 by Okta Icon
    Our Free Plans just got better! | Auth0 by Okta

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your secuirty. Auth0 now, thank yourself later.
    Try free now
  • Bright Data - All in One Platform for Proxies and Web Scraping Icon
    Bright Data - All in One Platform for Proxies and Web Scraping

    Say goodbye to blocks, restrictions, and CAPTCHAs

    Bright Data offers the highest quality proxies with automated session management, IP rotation, and advanced web unlocking technology. Enjoy reliable, fast performance with easy integration, a user-friendly dashboard, and enterprise-grade scaling. Powered by ethically-sourced residential IPs for seamless web scraping.
    Get Started
  • 1
    LF Aligner helps translators create translation memories from texts and their translations. It relies on Hunalign for automatic sentence pairing. Input: txt, doc, docx, rtf, pdf, html. Output: tab delimited txt, TMX and xls. With web features. My email address is listed in readme.txt; for support, use the forum here. My personal website: www.farkastranslations.com.
    Leader badge
    Downloads: 171 This Week
    Last Update:
    See Project
  • 2
    TXM

    TXM

    Unicode-XML-TEI text/corpus analysis platform

    TXM is a free and open-source cross-platform Unicode & XML based text/corpus analysis environment and graphical client, supporting Windows, Linux and Mac OS X. It can also be used online as a J2EE standard compliant web portal (GWT based) with access control built in. DOWNLOAD LATEST VERSION OF TXM : http://textometrie.ens-lyon.fr/spip.php?rubrique61&lang=en TXM offers a comprehensive range of analysis tools (concordances, collocate search, frequency lists, etc.) based on the powerfull...
    Downloads: 20 This Week
    Last Update:
    See Project
  • 3
    TEXminer

    TEXminer

    Text Mining Classification for Texts in ASCII, Unicode and PDF Format.

    ... is not disigned to have a Reference Corpus, Thematic Model Statistics uses Language Models (lexicons) to have Background Knowledge about certain Languages (English, German, French, Spanish, Italian, Russian), which are derived from Decaleon Project. The Thematic Models for Standard Vocabulary have been extended (spring 2015). The Thematic Models for Technical Terms have been extended (autumn 2015). The Thematic Models for additional Standard Vocabularies have been extended (2015-2019).
    Downloads: 2 This Week
    Last Update:
    See Project
  • 4

    Linguistic Analyzer

    The Linguistic Analyzer is a tool for corpus analysis and comparison

    The Linguistic Analyzer (Almuhalil Alloghawy) is a free tool designed by a team from Al-Imam Muhammad bin Saud islamic university that can be used for corpus analysis and comparison in terms of the several linguistic characteristics, such as frequency lists generation, concordances, collocation extraction, the difference between two words, and keyword identification.
    Downloads: 4 This Week
    Last Update:
    See Project
  • Build Securely on AWS with Proven Frameworks Icon
    Build Securely on AWS with Proven Frameworks

    Lay a foundation for success with Tested Reference Architectures developed by Fortinet’s experts. Learn more in this white paper.

    Moving to the cloud brings new challenges. How can you manage a larger attack surface while ensuring great network performance? Turn to Fortinet’s Tested Reference Architectures, blueprints for designing and securing cloud environments built by cybersecurity experts. Learn more and explore use cases in this white paper.
    Download Now
  • 5

    DWDS/Dialing Concordance

    a collection of indexing and search tools for corpus linguists

    DWDS/Dialing Concordance (DDC) - a collection of index and search tools for corpus linguists
    Leader badge
    Downloads: 23 This Week
    Last Update:
    See Project
  • 6
    iramuteq
    IRAMUTEQ : Interface de R pour les Analyses Multidimensionnelles de Textes et de Questionnaires. Logiciel de traitement de données pour des corpus texte ou de type individus/caractères. Permet notamment de réaliser des analyses de type "ALCESTE"
    Leader badge
    Downloads: 720 This Week
    Last Update:
    See Project
  • 7

    Arabic Corpus

    Text categorization, arabic language processing, language modeling

    The Arabic Corpus {compiled by Dr. Mourad Abbas ( http://sites.google.com/site/mouradabbas9/corpora ) The corpus Khaleej-2004 contains 5690 documents. It is divided to 4 topics (categories). The corpus Watan-2004 contains 20291 documents organized in 6 topics (categories). Researchers who use these two corpora would mention the two main references: (1) For Watan-2004 corpus ---------------------- M. Abbas, K. Smaili, D. Berkani, (2011) Evaluation of Topic Identification Methods...
    Leader badge
    Downloads: 6 This Week
    Last Update:
    See Project
  • 8
    concordia

    concordia

    Powerful search library, best suited for computer-aided translation

    Concordia - Roman goddess of agreement. Concordance searcher - tool for translators who need their translations to "agree" with one standard. Concordia is a C++ library for fast text lookup in large corpora. It uses a RAM stored index, which takes up approximately 600MB of memory for a corpus of 2 million sentences. It is based on the idea of a suffix array, enhanced by the presence of other auxiliary data structures. The effects are stunning - Concordia is able to do simple substring...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    QJDicExample

    QJDicExample

    QJDicExample is an English <-> Japanese dictionary.

    QJDicExample is an Japanese to English and English to Japanese dictionary featuring words/names/kanji/sentences search. QJDicExample uses JMdict, JMnedict, Kanjidic2, Radkfilex, KanjiVG, Tanaka Corpus / Tatoeba databases for translations and zinnia recognition library for handwritten kanji recognition. Latest source code: git clone git://git.code.sf.net/p/qjdicexample/code qjdicexample-code
    Downloads: 0 This Week
    Last Update:
    See Project
  • Build Securely on Azure with Proven Frameworks Icon
    Build Securely on Azure with Proven Frameworks

    Lay a foundation for success with Tested Reference Architectures developed by Fortinet’s experts. Learn more in this white paper.

    Moving to the cloud brings new challenges. How can you manage a larger attack surface while ensuring great network performance? Turn to Fortinet’s Tested Reference Architectures, blueprints for designing and securing cloud environments built by cybersecurity experts. Learn more and explore use cases in this white paper.
    Download Now
  • 10
    Corpus Toolkit

    Corpus Toolkit

    A text management tool for linguistic purposes...

    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    The English-Vietnamese Bilingual Corpus (EVBCorpus) is a collection of English and Vietnamese parallel translations and bitexts.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 12

    texrex

    Web corpus creation software (moved to GitHub)

    This project has moved to GitHub: https://github.com/rsling/texrex https://github.com/rsling/cow
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    ICE Nigeria

    ICE Nigeria

    Nigerian component of the International Corpus of English

    This is the Nigerian component of the International Corpus of English, a one million word corpus of written and spoken Nigerian English for linguistic research. It can be used as a stand-alone corpus or in conjunction with other components of the International Corpus of English (such as ICE-GB, ICE-India, etc.) to compare international varieties of English. This is the first release of the complete corpus. The corpus can be downloaded in several parts. The written part can be downloaded...
    Downloads: 5 This Week
    Last Update:
    See Project
  • 14
    Osman Arabic Text Readability

    Osman Arabic Text Readability

    Open Source tool for Arabic text readability

    .... All the readability metrics mentioned in Section \ref{calcRead} are included within the open source code, they all work with vocalised and non-vocalised text but based our results presented here we recommend adding the diacritics in by using the addTashkeel() method. See the files sections for the vocalised version of UN Arabic English parallel paragraphs.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    AFEWC corpus is a multilingual comparable text articles in Arabic, French, and English languages. Each triple article is related to the same topic (aligned at article level). AFEWC corpus is collected from Wikipedia. The corpus is available for free for research purposes only. It is composed of 40K aligned articles, 91.3M English words, 57.8M French words, 22M Arabic words, 2.8M English unique words, 1.9M French unique words, and 1.5M Arabic unique words. Wikipedia text is available...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 16

    mwetoolkit

    THIS PROJECT MIGRATED TO https://gitlab.com/mwetoolkit/mwetoolkit3/

    THIS PROJECT MIGRATED TO https://gitlab.com/mwetoolkit/mwetoolkit3/ The Multiword Expressions toolkit aids in the automatic identification and extraction of multiword units in running text. These include idioms (kick the bucket), noun compounds (cable car), phrasal verbs (take off, give up), etc. Even though it focuses on multiword expresisons, the framework is quite complete and can also be useful in any corpus-based study in computational linguistics. The mwetoolkit can be applied...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17

    Natural Language Analysis with Ngrams

    NLP tool for statistical analysis of words, sentences, documents

    ... will JAR-it once I decide that it can be called a final release. This project was made by creating a corpus from the Google Ngrams data for English Language, version 20120701. EOWL list of English words was used to filter-out the words from Ngrams data. For each year, per word, the data was added and calculated to describe the average appearance of a word per document for a given year. Before using this program, you MUST download the corpus.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    LeaP corpus

    LeaP corpus

    A phonological corpus of learner English and learner German

    The LeaP corpus is a phonologically annotated corpus that comprises spoken language produced by 46 learners of English and 55 learners of German as well as recordings with 4 native speakers of English and 7 native speakers of German. In total, it consists of 12 hours of speech and was collected at the University of Bielefeld (Germany) between 2001 and 2003 as part of the LeaP (Learning Prosody in a Foreign Language) project, which investigated the acquisition of prosody by second language...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 19
    DisMo

    DisMo

    A POS, disfluency and multi-word unit annotator for spoken language

    DisMo is a part-of-speech, disfluency and multi-word unit automatic annotator. It is designed to manage the complexities and phenomena specific to spoken language. It currently supports English and French, with support for more languages coming soon. It is developed and maintained by George Christodoulides (Centre Valibel, IL&C, University of Louvain, Louvain-la-Neuve, Belgium). Visit www.corpusannotation.org to find out more about DisMo and other annotation tools for language corpora...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20

    Transml

    Phrase based Statistical Machine Transltion system for English Languag

    This software will translate English language to Malayalam and vice versa. Statistical Machine Translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The SMT is a corpus based approach, where a massive parallel corpus is required for training the SMT systems.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    zkanji - Japanese Language Study Suite

    zkanji - Japanese Language Study Suite

    Japanese vocabulary and kanji study tool with built in dictionary

    zkanji is a feature rich Japanese language study suite and dictionary for Windows. It has several kanji look-up methods, optional example sentences for many Japanese words, vocabulary printing, JLPT levels indicated for words and kanji for all N levels, spaced-repetition system for studying and more. Visit http://zkanji.sourceforge.net for details
    Leader badge
    Downloads: 62 This Week
    Last Update:
    See Project
  • 22
    CorpusSearch finds syntactic structures in a corpus of annotated sentence trees. It can be used as a research tool on a corpus, or as a development tool for building the corpus.
    Leader badge
    Downloads: 20 This Week
    Last Update:
    See Project
  • 23

    English-Khmer S. Machine Translation

    English-Khmer Automatic Statistic Machine Translation (SMT)

    Automatic Machine Translation from English to Khmer project is the first effort in Natural Language Processing field for translating English to Khmer (Cambodian) language. This project uses Domy CE, an open source SMT toolkit, for training parallel corpus and web technologies such as Python, Apache2, HTML, XML, and XSLT for developing web-based application. This project is developed by Ms. Kim Sokphyrum (DU) and Ms. Suos Samak (Jamia), under Supervision of Mr. Javier Sola, a Program Manager...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24

    Khmer Automatic Translation

    Khmer-English-Khmer Automatic Translation

    The project attempts to develop a parallel-corpus-based hybrid high quality English-Khmer-English automatic translation system based on statistical analysis and enhanced with part-of-speech analysis.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25

    Corpora of Misspellings

    Corpora with misspellings marked

    This is a project for creating corpora with misspellings marked and the correct word given. Example use could be for testing spell checkers.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • Next