corpora free download

20 projects for "corpora" with 2 filters applied:

Scientific/Engineering BSD Clear Filters & Widen Search

Build Agents and Models on One Platform
Everything you need to build production-ready agents and models. Access 200+ Google and third-party AI models and tools.

Gemini Enterprise Agent Platform is Google Cloud's comprehensive platform for developers to build, scale, govern, and optimize agents and models. Choose from Google's most advanced models and third-party models like Anthropic's Claude Model Family.

Try It Free
Our Free Plans just got better! | Auth0
With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.

Try free now
1

Tokenized Text Aligner

Aligns tokens in two versions of a text with differing tokenization.

This tool performs token-by-token alignment of two versions of a text with differing tokenization by interpreting the results of a file diff (https://docs.python.org/3/library/difflib.html). It is intended for use in the preparation of annotated linguistic corpora, where differences in tokenization may arise (i) following corrections or modifications to the source text or (ii) through the creation of different layers of annotation (part-of-speech, treebank) requiring different tokenization. In its default implementation, it produces a human-readable CSV table associating tokens in text A with tokens in text B, and can also inject token-level annotation from text B to text A. ...

Downloads: 0 This Week

Last Update: 2026-02-06
See Project
2

JoBimText

Linking Language to Knowledge with Distributional Semantics

JobimText is a software solution for automatic text expansion using contextualized distributional similarity. It provides text analysis tools for large corpora and has capabilities to create distributional semantic models (JoBimText models) and multi-word expressions.

Downloads: 0 This Week

Last Update: 2022-08-04
See Project
3

UnsupervisedMT

Phrase-Based & Neural Unsupervised Machine Translation

Unsupervised Machine Translation is a research repository that implements both phrase-based SMT and neural MT approaches for translation without parallel corpora. The neural component supports multiple architectures—seq2seq, biLSTM with attention, and Transformer—and allows extensive parameter sharing across languages to improve data efficiency. Training relies on denoising auto-encoding and back-translation, with on-the-fly, multithreaded generation of synthetic parallel data to continually refresh supervision signals. ...

Downloads: 0 This Week

Last Update: 1 day ago
See Project
4

concordia

Powerful search library, best suited for computer-aided translation

...This project now contains fully functional Concordia search library. In the near future, it will be extended by concordia-server: ligthweight, robust web server providing corpora search functionalities

Downloads: 0 This Week

Last Update: 2019-02-28
See Project
Forever Free Full-Stack Observability | Grafana Cloud
Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.

Create free account
5

rcqp

R interface to the Corpus Query Protocol

Implements the Corpus Query Protocol as a package for the R statistical environment. It allows to query linguistic corpora and manipulate the data as native R objects. It is based on the CWB software.

Downloads: 0 This Week

Last Update: 2018-03-13
See Project
6

poliqarp2

natural language corpora search engine

This project aims at building an efficient indexer and search engine for natural language corpora with multilevel annotations.

Downloads: 0 This Week

Last Update: 2016-12-19
See Project
7

diasim

Dialogue Similarity

Tools for calculating similarity (including lexical and syntactic) between speakers in dialogue, across standard and randomised corpora.

Downloads: 0 This Week

Last Update: 2016-03-31
See Project
8

mwetoolkit

THIS PROJECT MIGRATED TO https://gitlab.com/mwetoolkit/mwetoolkit3/

THIS PROJECT MIGRATED TO https://gitlab.com/mwetoolkit/mwetoolkit3/ The Multiword Expressions toolkit aids in the automatic identification and extraction of multiword units in running text. These include idioms (kick the bucket), noun compounds (cable car), phrasal verbs (take off, give up), etc. Even though it focuses on multiword expresisons, the framework is quite complete and can also be useful in any corpus-based study in computational linguistics. The mwetoolkit can be...

1 Review

Downloads: 1 This Week

Last Update: 2019-05-01
See Project
9

EXMARaLDA

EXMARaLDA stands for "Extensible Markup Language for Discourse Annotation". It's a system of concepts, data formats and tools for the computer assisted transcription and annotation of spoken language, and the analysis of spoken language corpora. This project's source code has moved to https://github.com/Exmaralda-Org/exmaralda

Downloads: 0 This Week

Last Update: 2020-05-05
See Project
Custom VMs From 1 to 96 vCPUs With 99.95% Uptime
General-purpose, compute-optimized, or GPU/TPU-accelerated. Built to your exact specs.

Live migration and automatic failover keep workloads online through maintenance. One free e2-micro VM every month.

Try Free
10

Aelius Brazilian Portuguese POS-Tagger

Python, NLTK-based package for shallow parsing of Brazilian Portuguese

...It also includes language resources such as language models, sample texts, and gold standards. Presently, Aelius already offers facilities for POS-tagging and chunking corpora and outputting annotations in different formats, such as in XML in the TEI P5 encoding scheme.

1 Review

Downloads: 0 This Week

Last Update: 2014-11-03
See Project
11

Donatus Parsing Tools for Portuguese

Donatus is an on-going project consisting of Python, NLTK-based tools and grammars for deep parsing and syntactical annotation of Brazilian Portuguese corpora. It includes a user-friendly graphical user interface for building syntactic parsers with the NLTK, providing some additional functionalities.

Downloads: 0 This Week

Last Update: 2016-08-28
See Project
12

Hermes Natural Language Processing

A repository of software, documentation and data for NLP

Hermes is a repository of software, documentation and data for NLP. I am currently adding corpora extracted from Wikipedia (mostrly in Romance languages).

Downloads: 0 This Week

Last Update: 2013-04-26
See Project
13

Uplug corpus tools

Various tools for creating annotated parallel corpora including pre-trained tagging and parsing models for various languages, sentence alignment tools and word alignment tools. Uplug also includes a web-based interface for interactive sentence and word alignment and scripts for indexing and querying parallel corpora using the Corpus Work Bench CWB. Download 'uplug-main' first and then add other packages.

Downloads: 0 This Week

Last Update: 2013-04-29
See Project
14

Richextr

A tool for large richly annotated parallel corpora preprocessing and Moses phrase-table extraction.

Downloads: 0 This Week

Last Update: 2015-11-12
See Project
15

CorpSe

CORPSE (CORPus SEarch) is a powerful search engine written in Java. The aim is to provide an efficient implementation of a word level inverted index search with various cool functions that can be used on very large corpora.

1 Review

Downloads: 0 This Week

Last Update: 2013-04-26
See Project
16

CorpusReader

Enrich and query corpora in the TEI-XML vocabulary. CorpusReader manage very large corpora and corpora containing milestone annotation. It provides tools for enriching corpora with output of linguistic parsers, and for extracting quantitative information

Downloads: 0 This Week

Last Update: 2013-04-18
See Project
17

NooJ - linguistic engineering developmen

NooJ is used by linguists to describe linguistic phenomena and apply the formalized morphological, syntactic or semantic rules to corpora . It is used by non linguists in fields like psychology, sociology, history, literature studies as well.

Downloads: 0 This Week

Last Update: 2013-04-22
See Project
18

The NITE XML Toolkit

The NITE XML Toolkit supports the creation, analysis, and browsing of annotated multimodal, text, or spoken language corpora, and represents both timing and rich linguistic structure. It contains libraries for developers and some end user tools.

Downloads: 0 This Week

Last Update: 2013-04-22
See Project
19

AmiGram

AmiGram is the AMI Graphical Representation and Annotation Module. It is a general-purpose tool for multimodal corpus annotation and allows the time line based annoation of NXT corpora in a layer based environment.

Downloads: 1 This Week

Last Update: 2013-03-08
See Project
20

Crouton

A programming language designed for searching and manipulating tree-structured data, particularly corpora of natural languages encoded in an s-expression-like format.

Downloads: 0 This Week

Last Update: 2013-03-08
See Project