corpora free download

Showing 24 open source projects for "corpora"

View related business solutions

Linguistics Linux Clear Filters & Widen Search

Stop Storing Third-Party Tokens in Your Database
Auth0 Token Vault handles secure token storage, exchange, and refresh for external providers so you don't have to build it yourself.

Rolling your own OAuth token storage can be a security liability. Token Vault securely stores access and refresh tokens from federated providers and handles exchange and renewal automatically. Connected accounts, refresh exchange, and privileged worker flows included.

Try Auth0 for Free
Ship Agents Faster
Transform your applications and workflows into powerful agentic systems at global scale.

Gemini Enterprise Agent Platform lets you rapidly build, scale, govern and optimize production-ready agents grounded in your organization's data. The platform enables developers to build custom or pre-built agents for virtually any use case. New customers get $300 in free credits.

Get Started Free
1

IMS Open Corpus Workbench

Indexing and query tools for very large text corpora

The IMS Open Corpus Workbench is a collection of tools for managing and querying large text corpora (100 M words and more) with linguistic annotations. Its central component is the flexible and efficient query processor CQP, which can be used interactively in a terminal session, as a backend e.g. from a Perl script, or through the Web-based GUI CQPweb.

Downloads: 33 This Week

Last Update: 2026-05-20
See Project
2

Tokenized Text Aligner

Aligns tokens in two versions of a text with differing tokenization.

This tool performs token-by-token alignment of two versions of a text with differing tokenization by interpreting the results of a file diff (https://docs.python.org/3/library/difflib.html). It is intended for use in the preparation of annotated linguistic corpora, where differences in tokenization may arise (i) following corrections or modifications to the source text or (ii) through the creation of different layers of annotation (part-of-speech, treebank) requiring different tokenization. In its default implementation, it produces a human-readable CSV table associating tokens in text A with tokens in text B, and can also inject token-level annotation from text B to text A. ...

Downloads: 0 This Week

Last Update: 2026-02-06
See Project
3

TXM

Unicode XML TEI text analysis platform

TXM is a free and open-source cross-platform Unicode & XML based text analysis environment and graphical client, supporting Windows, Linux and Mac OS X. It can also be used online as a J2EE standard compliant web portal (GWT based) with access control built in. DOWNLOAD LATEST VERSION OF TXM : http://textometrie.ens-lyon.fr/spip.php?rubrique61&lang=en TXM offers a comprehensive range of analysis tools (concordances, collocate search, frequency lists, etc.) based on the powerfull CQP...

Downloads: 5 This Week

Last Update: 2024-12-09
See Project
4

JoBimText

Linking Language to Knowledge with Distributional Semantics

JobimText is a software solution for automatic text expansion using contextualized distributional similarity. It provides text analysis tools for large corpora and has capabilities to create distributional semantic models (JoBimText models) and multi-word expressions.

Downloads: 0 This Week

Last Update: 2022-08-04
See Project
Go from Code to Production URL in Seconds
Cloud Run deploys apps in any language instantly. Scales to zero. Pay only when code runs.

Skip the Kubernetes configs. Cloud Run handles HTTPS, scaling, and infrastructure automatically. Two million requests free per month.

Try it free
5

Linguistic Analyzer

The Linguistic Analyzer is a tool for corpus analysis and comparison

The Linguistic Analyzer (Almuhalil Alloghawy) is a free tool designed by a team from Al-Imam Muhammad bin Saud islamic university that can be used for corpus analysis and comparison in terms of the several linguistic characteristics, such as frequency lists generation, concordances, collocation extraction, the difference between two words, and keyword identification.

Downloads: 0 This Week

Last Update: 2022-04-16
See Project
6

UnsupervisedMT

Phrase-Based & Neural Unsupervised Machine Translation

Unsupervised Machine Translation is a research repository that implements both phrase-based SMT and neural MT approaches for translation without parallel corpora. The neural component supports multiple architectures—seq2seq, biLSTM with attention, and Transformer—and allows extensive parameter sharing across languages to improve data efficiency. Training relies on denoising auto-encoding and back-translation, with on-the-fly, multithreaded generation of synthetic parallel data to continually refresh supervision signals. ...

Downloads: 0 This Week

Last Update: 6 days ago
See Project
7

Arabic Corpus

Text categorization, arabic language processing, language modeling

...More useful references to check: ------------------------------------------- https://sites.google.com/site/mouradabbas9/corpora

Downloads: 4 This Week

Last Update: 2019-03-05
See Project
8

concordia

Powerful search library, best suited for computer-aided translation

...This project now contains fully functional Concordia search library. In the near future, it will be extended by concordia-server: ligthweight, robust web server providing corpora search functionalities

Downloads: 0 This Week

Last Update: 2019-02-28
See Project
9

Ghawwas_V4

An open source system for Arabic corpora processing

Ghawwas (previously known as Khawas) is an open source system for Arabic corpora processing. Ghawwas V4.0 provides the following main functions: a. Frequency list for single word and N-Grams b. Concordance c. Collocation (MI, CHI Squared, LL, T-Score, Z Score, Dice, Log Dice, Weirdness Coefficient) d. Lexical patterns search e. Two corpora frequency profile comparison based on MI, CHI, LL, T-Score, Z Score, Dice, Log Dice, Weirdness Coefficient f.

1 Review

Downloads: 1 This Week

Last Update: 2018-12-09
See Project
Secure File Transfer for Windows with Cerberus by Redwood
Protect and share files over FTP/S, SFTP, HTTPS and SCP with the #1 rated Windows file transfer server.

Cerberus supports unlimited users and connections on a single IP, with built-in encryption, 2FA, and a browser-based web client — all deployable in under 15 minutes with a 25-day free trial.

Try for Free
10

poliqarp2

natural language corpora search engine

This project aims at building an efficient indexer and search engine for natural language corpora with multilevel annotations.

Downloads: 0 This Week

Last Update: 2016-12-19
See Project
11

BioC

We describe a simple XML format to share text documents and annotation

...Allows a large number of different annotations to be represented. Project files contain: - simple code to hold/read/write data and perform sample processing. - BioC-formatted corpora - BioC tools that work with BioC corpora BioC goals - simplicity - interoperability - broad use - reuse There should be little investment required to learn to use a format or a software module to process that format. We are interested in reuse, and we focus on common NLP tasks that are broadly useful for textmining.

Downloads: 7 This Week

Last Update: 2016-08-08
See Project
12

diasim

Dialogue Similarity

Tools for calculating similarity (including lexical and syntactic) between speakers in dialogue, across standard and randomised corpora.

Downloads: 0 This Week

Last Update: 2016-03-31
See Project
13

Cross-Language Computational Linguistics

cross-languages resources

...It is composed of 40K aligned articles, 91.3M English words, 57.8M French words, 22M Arabic words, 2.8M English unique words, 1.9M French unique words, and 1.5M Arabic unique words. Wikipedia text is available under Creative Commons Attribution-ShareAlike 3.0 License. https://en.wikipedia.org/wiki/Wikipedia:About To cite the corpora: M. Saad, D. Langlois, and K. Smaïli. Extracting Comparable Articles from Wikipedia and Measuring their Comparabilities. Procedia - Social and Behavioral Sciences, 95(0):40 – 47, 2013. ISSN 1877-0428.

Downloads: 0 This Week

Last Update: 2015-09-11
See Project
14

mwetoolkit

THIS PROJECT MIGRATED TO https://gitlab.com/mwetoolkit/mwetoolkit3/

THIS PROJECT MIGRATED TO https://gitlab.com/mwetoolkit/mwetoolkit3/ The Multiword Expressions toolkit aids in the automatic identification and extraction of multiword units in running text. These include idioms (kick the bucket), noun compounds (cable car), phrasal verbs (take off, give up), etc. Even though it focuses on multiword expresisons, the framework is quite complete and can also be useful in any corpus-based study in computational linguistics. The mwetoolkit can be...

1 Review

Downloads: 1 This Week

Last Update: 2019-05-01
See Project
15

Aelius Brazilian Portuguese POS-Tagger

Python, NLTK-based package for shallow parsing of Brazilian Portuguese

...It also includes language resources such as language models, sample texts, and gold standards. Presently, Aelius already offers facilities for POS-tagging and chunking corpora and outputting annotations in different formats, such as in XML in the TEI P5 encoding scheme.

1 Review

Downloads: 0 This Week

Last Update: 2014-11-03
See Project
16

Khawas

An Arabic Corpora Processing Tool

The new version is available at https://sourceforge.net/projects/ghawwasv4/

Downloads: 0 This Week

Last Update: 2014-08-02
See Project
17

Donatus Parsing Tools for Portuguese

Donatus is an on-going project consisting of Python, NLTK-based tools and grammars for deep parsing and syntactical annotation of Brazilian Portuguese corpora. It includes a user-friendly graphical user interface for building syntactic parsers with the NLTK, providing some additional functionalities.

Downloads: 0 This Week

Last Update: 2016-08-28
See Project
18

Hermes Natural Language Processing

A repository of software, documentation and data for NLP

Hermes is a repository of software, documentation and data for NLP. I am currently adding corpora extracted from Wikipedia (mostrly in Romance languages).

Downloads: 0 This Week

Last Update: 2013-04-26
See Project
19

Uplug corpus tools

Various tools for creating annotated parallel corpora including pre-trained tagging and parsing models for various languages, sentence alignment tools and word alignment tools. Uplug also includes a web-based interface for interactive sentence and word alignment and scripts for indexing and querying parallel corpora using the Corpus Work Bench CWB. Download 'uplug-main' first and then add other packages.

Downloads: 0 This Week

Last Update: 2013-04-29
See Project
20

Gargantua

Fast Unsupervised Sentence Aligner described in "Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora", COLING 2010. NEWS : release 1.0b : bug fixed (release1.0a deprecated).

1 Review

Downloads: 1 This Week

Last Update: 2015-10-24
See Project
21

Richextr

A tool for large richly annotated parallel corpora preprocessing and Moses phrase-table extraction.

Downloads: 0 This Week

Last Update: 2015-11-12
See Project
22

CorpSe

CORPSE (CORPus SEarch) is a powerful search engine written in Java. The aim is to provide an efficient implementation of a word level inverted index search with various cool functions that can be used on very large corpora.

1 Review

Downloads: 0 This Week

Last Update: 2013-04-26
See Project
23

MedTag - Annotated Corpora

A database of linguistic annotation of medical text (from MEDLINE), including corpora used with ABGene, BioCreative I and II, and the MedPost training corpus.

Downloads: 0 This Week

Last Update: 2014-02-05
See Project
24

NooJ - linguistic engineering developmen

NooJ is used by linguists to describe linguistic phenomena and apply the formalized morphological, syntactic or semantic rules to corpora . It is used by non linguists in fields like psychology, sociology, history, literature studies as well.

Downloads: 0 This Week

Last Update: 2013-04-22
See Project

Previous
You're on page 1
Next

Search Results for "corpora"

Showing 24 open source projects for "corpora"

IMS Open Corpus Workbench

Tokenized Text Aligner

TXM

JoBimText

Linguistic Analyzer

UnsupervisedMT

Arabic Corpus

concordia

Ghawwas_V4

poliqarp2

BioC

diasim

Cross-Language Computational Linguistics

mwetoolkit

Aelius Brazilian Portuguese POS-Tagger

Khawas

Donatus Parsing Tools for Portuguese

Hermes Natural Language Processing

Uplug corpus tools

Gargantua

Richextr

CorpSe

MedTag - Annotated Corpora

NooJ - linguistic engineering developmen

Search Results for "corpora"

Showing 24 open source projects for "corpora"

IMS Open Corpus Workbench

Tokenized Text Aligner

TXM

JoBimText

Linguistic Analyzer

UnsupervisedMT

Arabic Corpus

concordia

Ghawwas_V4

poliqarp2

BioC

diasim

Cross-Language Computational Linguistics

mwetoolkit

Aelius Brazilian Portuguese POS-Tagger

Khawas

Donatus Parsing Tools for Portuguese

Hermes Natural Language Processing

Uplug corpus tools

Gargantua

Richextr

CorpSe

MedTag - Annotated Corpora

NooJ - linguistic engineering developmen

Related Searches

Related Categories