Page 2 | corpora free download

Showing 49 open source projects for "corpora"

View related business solutions

Python Clear Filters & Widen Search

Train ML Models With SQL You Already Know
BigQuery automates data prep, analysis, and predictions with built-in AI assistance.

Build and deploy ML models using familiar SQL. Automate data prep with built-in Gemini. Query 1 TB and store 10 GB free monthly.

Try Free
Enterprise-grade ITSM, for every business
Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity.

Freshservice is an intuitive, AI-powered platform that helps IT, operations, and business teams deliver exceptional service without the usual complexity. Automate repetitive tasks, resolve issues faster, and provide seamless support across the organization. From managing incidents and assets to driving smarter decisions, Freshservice makes it easy to stay efficient and scale with confidence.

Try it Free
1

FastEdit

Editing large language models within 10 seconds

...It implements practical editing algorithms that insert or revise knowledge with targeted parameter updates, aiming to preserve model quality outside the edited scope. This approach is valuable when you need urgent corrections—think product names, APIs, or fast-changing facts—without retraining on large corpora. The repository provides evaluation harnesses so you can measure locality (does the change stay contained?) and generalization (does the change apply where it should?). It’s structured for repeatable experiments, making side-by-side comparisons of editing methods and hyperparameters straightforward. For applied teams, FastEdit offers a toolbox to keep models current and compliant while minimizing collateral damage to overall performance.

Downloads: 0 This Week

Last Update: 2025-11-10
See Project
2

Metaseq

Repo for external large-scale work

...The framework was used internally at Meta to train models like OPT (Open Pre-trained Transformer) and serves as a reference implementation for scaling transformer architectures efficiently across GPUs and nodes. It supports both pretraining and fine-tuning workflows with data pipelines for text, multilingual corpora, and custom tokenization schemes. Metaseq also includes APIs for evaluation, generation, and model serving, enabling seamless transitions from training to inference.

Downloads: 0 This Week

Last Update: 2025-10-06
See Project
3

GiantMIDI-Piano

Classical piano MIDI dataset

...Because the dataset is machine-generated via an automated transcription pipeline, it offers consistency, scale, and accessibility that would be difficult to achieve manually — enabling researchers to work with large corpora of piano music without copyright restrictions on symbolic data.

Downloads: 2 This Week

Last Update: 2025-12-02
See Project
4

DrQA

Reading Wikipedia to Answer Open-Domain Questions

DrQA is an open-domain question answering system that reads large text corpora—famously Wikipedia—to answer natural language questions with extractive spans. It follows a two-stage pipeline: a fast document retriever first narrows down candidate articles, and a neural machine reader then predicts the exact answer span from those passages. The retriever relies on classic IR features (like TF-IDF and n-gram statistics) to remain lightweight and scalable to millions of documents. ...

Downloads: 0 This Week

Last Update: 2025-10-07
See Project
MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
5

XLM (Cross-lingual Language Model)

PyTorch original implementation of Cross-lingual Language Model

...The repository provides preprocessing pipelines, training code, and fine-tuning scripts so you can reproduce benchmark results or adapt models to your own multilingual corpora. Pretrained checkpoints cover dozens of languages and multiple model sizes, balancing quality and compute needs.

Downloads: 0 This Week

Last Update: 2025-10-07
See Project
6

PyTorch SimCLR

PyTorch implementation of SimCLR: A Simple Framework

...Nowadays, pre-trained Deep Convolution Neural Networks (DCNNs) are the first go-to pre-solutions to learn a new task. These large models are trained on huge supervised corpora, like the ImageNet. And most important, their features are known to adapt well to new problems. This is particularly interesting when annotated training data is scarce. In situations like this, we take the models’ pre-trained weights, append a new classifier layer on top of it, and retrain the network. This is called transfer learning, and is one of the most used techniques in CV. ...

Downloads: 0 This Week

Last Update: 2022-08-15
See Project
7

CC-Net

Tools to download and cleanup Common Crawl data

cc_net provides tools to download, segment, clean, and filter Common Crawl to build large-scale text corpora, including monolingual datasets and the multilingual CC-100 collection introduced in the associated paper. It includes pipelines to fetch snapshots, extract text, de-duplicate, identify language, and apply quality filtering based on heuristics and language models. The outputs are intended for pretraining language models and for creating standardized corpora that can be reproduced or updated with new crawls. ...

Downloads: 0 This Week

Last Update: 2025-10-11
See Project
8

Albedo

A recommender system for discovering GitHub repos

...A reproducible setup and Makefile-driven workflow streamline tasks like spinning up services, loading datasets, training models, and generating candidate lists. Because it’s built around Spark’s scalable primitives, Albedo can experiment on substantial snapshots of GitHub metadata rather than toy corpora. The repo is also educational: it demonstrates a practical end-to-end pipeline from ingestion and feature preparation to training and ranking.

Downloads: 0 This Week

Last Update: 2025-10-16
See Project
9

NLP Best Practices

Natural Language Processing Best Practices & Examples

...Data scientists started moving from traditional methods to state-of-the-art (SOTA) deep neural network (DNN) algorithms which use language models pretrained on large text corpora. This repository contains examples and best practices for building NLP systems, provided as Jupyter notebooks and utility functions. The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language. The goal of this repository is to build a comprehensive set of tools and examples that leverage recent advances in NLP algorithms, neural architectures, and distributed machine learning systems.

Downloads: 0 This Week

Last Update: 2022-08-01
See Project
Build Agents and Models on One Platform
Everything you need to build production-ready agents and models. Access 200+ Google and third-party AI models and tools.

Gemini Enterprise Agent Platform is Google Cloud's comprehensive platform for developers to build, scale, govern, and optimize agents and models. Choose from Google's most advanced models and third-party models like Anthropic's Claude Model Family.

Try It Free
10

cocoNLP

A Chinese information extraction tool

cocoNLP is a lightweight natural-language processing toolkit geared toward practical information extraction from raw text, especially for Chinese and mixed Chinese–English content. Instead of requiring a heavy pipeline, it focuses on quick wins such as extracting names, places, organizations, emails, phone numbers, and dates directly from unstructured sentences. The project blends pattern-based methods with NLP heuristics, giving developers dependable results for real-world texts like chats,...

Downloads: 0 This Week

Last Update: 2025-11-05
See Project
11

UnsupervisedMT

Phrase-Based & Neural Unsupervised Machine Translation

Unsupervised Machine Translation is a research repository that implements both phrase-based SMT and neural MT approaches for translation without parallel corpora. The neural component supports multiple architectures—seq2seq, biLSTM with attention, and Transformer—and allows extensive parameter sharing across languages to improve data efficiency. Training relies on denoising auto-encoding and back-translation, with on-the-fly, multithreaded generation of synthetic parallel data to continually refresh supervision signals. ...

Downloads: 0 This Week

Last Update: 5 days ago
See Project
12

Arabic Corpus

Text categorization, arabic language processing, language modeling

...More useful references to check: ------------------------------------------- https://sites.google.com/site/mouradabbas9/corpora

Downloads: 2 This Week

Last Update: 2019-03-05
See Project
13

Tacotron-2

DeepMind's Tacotron-2 Tensorflow implementation

...The repository is structured as a full training pipeline: dataset preparation, preprocessing into spectrograms, Tacotron training, WaveNet (or Griffin-Lim) vocoder training, and final waveform synthesis. It includes directory layouts and logging directories for multiple datasets such as LJSpeech and M-AILABS en_US/en_UK, making it easier to adapt to new English corpora. Separate log trees track mel-spectrograms, attention plots, evaluation audio, and vocoder outputs, so you can inspect how alignment and audio quality evolve over time.

Downloads: 0 This Week

Last Update: 2025-11-28
See Project
14

Scattertext 0.2.1

Beautiful visualizations of how language differs among document types

A tool for finding distinguishing terms in corpora and displaying them in an interactive HTML scatter plot. Points corresponding to terms are selectively labeled so that they don't overlap with other labels or points.

Downloads: 0 This Week

Last Update: 2024-08-09
See Project
15

poliqarp2

natural language corpora search engine

This project aims at building an efficient indexer and search engine for natural language corpora with multilevel annotations.

Downloads: 0 This Week

Last Update: 2016-12-19
See Project
16

BioC

We describe a simple XML format to share text documents and annotation

...Allows a large number of different annotations to be represented. Project files contain: - simple code to hold/read/write data and perform sample processing. - BioC-formatted corpora - BioC tools that work with BioC corpora BioC goals - simplicity - interoperability - broad use - reuse There should be little investment required to learn to use a format or a software module to process that format. We are interested in reuse, and we focus on common NLP tasks that are broadly useful for textmining.

Downloads: 6 This Week

Last Update: 2016-08-08
See Project
17

Question Answering Corpus

Question answering dataset in "Teaching Machines to Read & Comprehend"

RC-Data is a dataset generation framework created by Google DeepMind to produce large-scale reading comprehension question-answer pairs from CNN and Daily Mail news articles. The dataset, introduced in the 2015 paper “Teaching Machines to Read and Comprehend” (Hermann et al., NIPS 2015), was among the first large corpora designed to train and evaluate machine reading and comprehension models. The repository provides scripts for downloading archived CNN and Daily Mail articles from the Wayback Machine and automatically generating cloze-style questions where entities in the text are replaced with placeholders. Each data instance consists of a news article (context), a generated question, and its corresponding answer, making it suitable for supervised machine learning setups. ...

Downloads: 0 This Week

Last Update: 5 days ago
See Project
18

mwetoolkit

THIS PROJECT MIGRATED TO https://gitlab.com/mwetoolkit/mwetoolkit3/

THIS PROJECT MIGRATED TO https://gitlab.com/mwetoolkit/mwetoolkit3/ The Multiword Expressions toolkit aids in the automatic identification and extraction of multiword units in running text. These include idioms (kick the bucket), noun compounds (cable car), phrasal verbs (take off, give up), etc. Even though it focuses on multiword expresisons, the framework is quite complete and can also be useful in any corpus-based study in computational linguistics. The mwetoolkit can be...

1 Review

Downloads: 1 This Week

Last Update: 2019-05-01
See Project
19

TextTools

TextTools is a freeware corpus linguistics tool developed in Python to aid in research. This program analyzes user-created corpora and displays information about word (token) frequency, n-grams, clusters, collocations, keyword in context (KWIC), and keyness. TextTools is designed to be user-friendly and intuitive and will run natively on Mac OS X.

Downloads: 0 This Week

Last Update: 2014-09-28
See Project
20

Aelius Brazilian Portuguese POS-Tagger

Python, NLTK-based package for shallow parsing of Brazilian Portuguese

...It also includes language resources such as language models, sample texts, and gold standards. Presently, Aelius already offers facilities for POS-tagging and chunking corpora and outputting annotations in different formats, such as in XML in the TEI P5 encoding scheme.

1 Review

Downloads: 0 This Week

Last Update: 2014-11-03
See Project
21

wiki export tool

A tool for exporting Wikipedia data

...To use this, you need prepare two things: Target dump of Wiki Target page name Then, configure the tool with dump path and input the page name, finally just let's it run, you will get the target page. Duration is dependent on size of the dump, and the number of pages. Take care for using option "talk, other talk...", this is now only for french corpora.

Downloads: 0 This Week

Last Update: 2014-08-14
See Project
22

TextBlob

TextBlob is a Python library for processing textual data

...Also, it comes with a WordNet integration. If you only intend to use TextBlob’s default models (no model overrides), you can pass the lite argument. This downloads only those corpora needed for basic functionality. TextBlob is also available as a conda package.

Downloads: 0 This Week

Last Update: 2021-07-23
See Project
23

Donatus Parsing Tools for Portuguese

Donatus is an on-going project consisting of Python, NLTK-based tools and grammars for deep parsing and syntactical annotation of Brazilian Portuguese corpora. It includes a user-friendly graphical user interface for building syntactic parsers with the NLTK, providing some additional functionalities.

Downloads: 0 This Week

Last Update: 2016-08-28
See Project
24

WebSynonymExtractor

a synonym extractor based on web-corpora and a multilingual translator

This project is an approach for synonym extraction and extending WordNet by the so found synonyms. The python application is realised as a kind of pipe that starts with a web-corpus-reader which is followed by several workers (tokenizers, lemmatizers, ...) and finally completed by a result writer. In contrast to the state of the art approaches, this implementation is based on single words found in the web used as a corpus and translated to other languages. If translations of different...

Downloads: 0 This Week

Last Update: 2016-11-18
See Project