Page 3 | corpora free download

Showing 105 open source projects for "corpora"

View related business solutions

Linux Clear Filters & Widen Search

Auth0 B2B Essentials: SSO, MFA, and RBAC Built In
Unlimited organizations, 3 enterprise SSO connections, role-based access control, and pro MFA included. Dev and prod tenants out of the box.

Auth0's B2B Essentials plan gives you everything you need to ship secure multi-tenant apps. Unlimited orgs, enterprise SSO, RBAC, audit log streaming, and higher auth and API limits included. Add on M2M tokens, enterprise MFA, or additional SSO connections as you scale.

Sign Up Free
Secure File Transfer for Windows with Cerberus by Redwood
Protect and share files over FTP/S, SFTP, HTTPS and SCP with the #1 rated Windows file transfer server.

Cerberus supports unlimited users and connections on a single IP, with built-in encryption, 2FA, and a browser-based web client — all deployable in under 15 minutes with a 25-day free trial.

Try for Free
1

Arabic Corpus

Text categorization, arabic language processing, language modeling

...More useful references to check: ------------------------------------------- https://sites.google.com/site/mouradabbas9/corpora

Downloads: 4 This Week

Last Update: 2019-03-05
See Project
2

Tacotron-2

DeepMind's Tacotron-2 Tensorflow implementation

...The repository is structured as a full training pipeline: dataset preparation, preprocessing into spectrograms, Tacotron training, WaveNet (or Griffin-Lim) vocoder training, and final waveform synthesis. It includes directory layouts and logging directories for multiple datasets such as LJSpeech and M-AILABS en_US/en_UK, making it easier to adapt to new English corpora. Separate log trees track mel-spectrograms, attention plots, evaluation audio, and vocoder outputs, so you can inspect how alignment and audio quality evolve over time.

Downloads: 0 This Week

Last Update: 2025-11-28
See Project
3

concordia

Powerful search library, best suited for computer-aided translation

...This project now contains fully functional Concordia search library. In the near future, it will be extended by concordia-server: ligthweight, robust web server providing corpora search functionalities

Downloads: 0 This Week

Last Update: 2019-02-28
See Project
4

Queries for OSAC (Arabic) Corpus

43 Queries for Arabic Information Retrieval Collection

...The corpus is created from the OSAC corpus of journalistic texts consisting of 4763 articles recovered from the Arabic BBC News. https://sourceforge.net/projects/ar-text-mining/files/Arabic-Corpora/

Downloads: 0 This Week

Last Update: 2019-01-07
See Project
Fully Managed MySQL, PostgreSQL, and SQL Server
Automatic backups, patching, replication, and failover. Focus on your app, not your database.

Cloud SQL handles your database ops end to end, so you can focus on your app.

Try Free
5

Ghawwas_V4

An open source system for Arabic corpora processing

Ghawwas (previously known as Khawas) is an open source system for Arabic corpora processing. Ghawwas V4.0 provides the following main functions: a. Frequency list for single word and N-Grams b. Concordance c. Collocation (MI, CHI Squared, LL, T-Score, Z Score, Dice, Log Dice, Weirdness Coefficient) d. Lexical patterns search e. Two corpora frequency profile comparison based on MI, CHI, LL, T-Score, Z Score, Dice, Log Dice, Weirdness Coefficient f.

1 Review

Downloads: 1 This Week

Last Update: 2018-12-09
See Project
6

HipparchiaServer

front end to Hipparchia corpora: searching, browsing, concordances, texts, dictionaries, parsing

Downloads: 0 This Week

Last Update: 2018-06-15
See Project
7

Corpus Manager

Yet another corpus manager. Allows for HTTP access to annotated text corpora, client does not need to install any special software to access the server (any browser with JavaScript support will do).

Downloads: 0 This Week

Last Update: 2017-10-05
See Project
8

Chinese Poetry

The most comprehensive database of Chinese poetry

This repository is a curated collection of Chinese poems and poets organized into catalogs, metadata, and text representations suitable for research, creative and cultural use. It includes major dynastic corpora, such as Tang and Song poems, as well as biographical and categorization data. Each poem entry is structured with fields like author, dynasty, title, content, and sometimes annotations or alternate versions. Developers and scholars can build tools that query by author, era, keyword, or poetic form using the standardized data structure. ...

Downloads: 0 This Week

Last Update: 2025-09-05
See Project
9

Scattertext 0.2.1

Beautiful visualizations of how language differs among document types

A tool for finding distinguishing terms in corpora and displaying them in an interactive HTML scatter plot. Points corresponding to terms are selectively labeled so that they don't overlap with other labels or points.

Downloads: 0 This Week

Last Update: 2024-08-09
See Project
Forever Free Full-Stack Observability | Grafana Cloud
Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.

Create free account
10

BioNLP-Corpora

BioNLP-Corpora is a repository of biomedically and linguistically annotated corpora and biomedical data sources. There are many resources available in separate packages in this project.

Downloads: 1 This Week

Last Update: 2016-11-22
See Project
11

Arabic Stemming Corpora

The Corpora contains 81,000 tagged words of Arabic resources (Contemporary Arabic (CCA) [1] and Arabic Wikipedia [2]) text with the basic tags (verb, noun, adjective). [1] http://www.comp.leeds.ac.uk/eric/latifa/research.htm. [2] http://ar.wikipedia.org.

Downloads: 0 This Week

Last Update: 2016-12-04
See Project
12

Arabic business corpora

Arabic business and management corpus

This corpora is made up of 3 sub corpora as follows: 1) Management Corpus: 400 articles by Chairmans and CEOs of Arabic companies in the Middle East. 2) Economics News: 400 news articles from different Arabic online newspapers. 3) Stock market news, 400 articles collected from investing.com. The main corpora contains 1200 articles. The articles have been tagged using Stanford Arabic Part of Speech Tagger.

Downloads: 0 This Week

Last Update: 2016-11-01
See Project
13

poliqarp2

natural language corpora search engine

This project aims at building an efficient indexer and search engine for natural language corpora with multilevel annotations.

Downloads: 0 This Week

Last Update: 2016-12-19
See Project
14

BioC

We describe a simple XML format to share text documents and annotation

...Allows a large number of different annotations to be represented. Project files contain: - simple code to hold/read/write data and perform sample processing. - BioC-formatted corpora - BioC tools that work with BioC corpora BioC goals - simplicity - interoperability - broad use - reuse There should be little investment required to learn to use a format or a software module to process that format. We are interested in reuse, and we focus on common NLP tasks that are broadly useful for textmining.

Downloads: 7 This Week

Last Update: 2016-08-08
See Project
15

Question Answering Corpus

Question answering dataset in "Teaching Machines to Read & Comprehend"

RC-Data is a dataset generation framework created by Google DeepMind to produce large-scale reading comprehension question-answer pairs from CNN and Daily Mail news articles. The dataset, introduced in the 2015 paper “Teaching Machines to Read and Comprehend” (Hermann et al., NIPS 2015), was among the first large corpora designed to train and evaluate machine reading and comprehension models. The repository provides scripts for downloading archived CNN and Daily Mail articles from the Wayback Machine and automatically generating cloze-style questions where entities in the text are replaced with placeholders. Each data instance consists of a news article (context), a generated question, and its corresponding answer, making it suitable for supervised machine learning setups. ...

Downloads: 0 This Week

Last Update: 6 days ago
See Project
16

diasim

Dialogue Similarity

Tools for calculating similarity (including lexical and syntactic) between speakers in dialogue, across standard and randomised corpora.

Downloads: 0 This Week

Last Update: 2016-03-31
See Project
17

TermEvaluator

A tool for evaluating automatic terminology extraction.

...S., Faez F., and Amjadian, E. 2016. International Journal of Computational Linguistics and Applications. This is a tool to evaluate terminology extraction performed on corpora. Human judges can choose if an extracted term is correct, or incorrect. Already known terms (say from a dictionary) can be imported to the tool and marked automatically as correct by the tool so that the annotators won't have to judge them. The tool offers comparison & analysis among annotators. Annotators have the option to save the progress and resume annotation at any time. ...

Downloads: 0 This Week

Last Update: 2016-04-02
See Project
18

GloVe

GloVe model for distributed word representation

...Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. The links provided contain word vectors obtained from the respective corpora. If you want word vectors trained on massive web datasets, you need only download one of these text files! Pre-trained word vectors are made available under the Public Domain Dedication and License. If the web datasets above don't match the semantics of your end use case, you can train word vectors on your own corpus. The demo.sh script downloads a small corpus, consisting of the first 100M characters of Wikipedia. ...

Downloads: 1 This Week

Last Update: 2021-09-30
See Project
19

WebCorpus

Hadoop framework for scalable processing of large web corpora

WebCorpus is a Hadoop-based framework that enables you to calculate statistics on large web corpora extracted from web crawls.

Downloads: 0 This Week

Last Update: 2016-11-09
See Project
20

Cross-Language Computational Linguistics

cross-languages resources

...It is composed of 40K aligned articles, 91.3M English words, 57.8M French words, 22M Arabic words, 2.8M English unique words, 1.9M French unique words, and 1.5M Arabic unique words. Wikipedia text is available under Creative Commons Attribution-ShareAlike 3.0 License. https://en.wikipedia.org/wiki/Wikipedia:About To cite the corpora: M. Saad, D. Langlois, and K. Smaïli. Extracting Comparable Articles from Wikipedia and Measuring their Comparabilities. Procedia - Social and Behavioral Sciences, 95(0):40 – 47, 2013. ISSN 1877-0428.

Downloads: 0 This Week

Last Update: 2015-09-11
See Project
21

mwetoolkit

THIS PROJECT MIGRATED TO https://gitlab.com/mwetoolkit/mwetoolkit3/

THIS PROJECT MIGRATED TO https://gitlab.com/mwetoolkit/mwetoolkit3/ The Multiword Expressions toolkit aids in the automatic identification and extraction of multiword units in running text. These include idioms (kick the bucket), noun compounds (cable car), phrasal verbs (take off, give up), etc. Even though it focuses on multiword expresisons, the framework is quite complete and can also be useful in any corpus-based study in computational linguistics. The mwetoolkit can be...

1 Review

Downloads: 1 This Week

Last Update: 2019-05-01
See Project
22

EXMARaLDA

EXMARaLDA stands for "Extensible Markup Language for Discourse Annotation". It's a system of concepts, data formats and tools for the computer assisted transcription and annotation of spoken language, and the analysis of spoken language corpora. This project's source code has moved to https://github.com/Exmaralda-Org/exmaralda

Downloads: 0 This Week

Last Update: 2020-05-05
See Project
23

DeSR

DeSR is a multilingual statistical dependency parser. It produces dependency parse trees for natural language sentences using a parsing model learned from annotated corpora.

Downloads: 0 This Week

Last Update: 2014-11-04
See Project
24

Aelius Brazilian Portuguese POS-Tagger

Python, NLTK-based package for shallow parsing of Brazilian Portuguese

...It also includes language resources such as language models, sample texts, and gold standards. Presently, Aelius already offers facilities for POS-tagging and chunking corpora and outputting annotations in different formats, such as in XML in the TEI P5 encoding scheme.

1 Review

Downloads: 0 This Week

Last Update: 2014-11-03
See Project
25

Khawas

An Arabic Corpora Processing Tool

The new version is available at https://sourceforge.net/projects/ghawwasv4/

Downloads: 0 This Week

Last Update: 2014-08-02
See Project

Previous
1
2
You're on page 3
4
5
Next

Search Results for "corpora" - Page 3

Showing 105 open source projects for "corpora"

Arabic Corpus

Tacotron-2

concordia

Queries for OSAC (Arabic) Corpus

Ghawwas_V4

HipparchiaServer

Corpus Manager

Chinese Poetry

Scattertext 0.2.1

BioNLP-Corpora

Arabic Stemming Corpora

Arabic business corpora

poliqarp2

BioC

Question Answering Corpus

diasim

TermEvaluator

GloVe

WebCorpus

Cross-Language Computational Linguistics

mwetoolkit

EXMARaLDA

DeSR

Aelius Brazilian Portuguese POS-Tagger

Khawas

Search Results for "corpora" - Page 3

Showing 105 open source projects for "corpora"

Arabic Corpus

Tacotron-2

concordia

Queries for OSAC (Arabic) Corpus

Ghawwas_V4

HipparchiaServer

Corpus Manager

Chinese Poetry

Scattertext 0.2.1

BioNLP-Corpora

Arabic Stemming Corpora

Arabic business corpora

poliqarp2

BioC

Question Answering Corpus

diasim

TermEvaluator

GloVe

WebCorpus

Cross-Language Computational Linguistics

mwetoolkit

EXMARaLDA

DeSR

Aelius Brazilian Portuguese POS-Tagger

Khawas

Related Searches

Related Categories