corpora free download

Showing 123 open source projects for "corpora"

View related business solutions

Our Free Plans just got better! | Auth0
With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.

Try free now
Compliant and Reliable File Transfers Backed by Top Security Certifications
Cerberus FTP Server delivers SOC 2 Type II certified security and FIPS 140-2 validated encryption.

Stop relying on non-certified, legacy file transfer tools that creak under the weight of modern security demands. Get full audit trails, advanced access controls and more supported by an award-winning team of experts. Start your free 25-day trial today.

Start Free Trial
1

UltraRAG

Less Code, Lower Barrier, Faster Deployment

UltraRAG 2.0 is a low-code, MCP-enabled RAG framework that aims to lower the barrier to building complex retrieval pipelines for research and production. It provides end-to-end recipes—from encoding and indexing corpora to deploying retrievers and LLMs—so users can reproduce baselines and iterate rapidly. The toolkit comes with built-in support for popular RAG datasets, large corpora, and canonical baselines, plus documentation that walks from “quick start” to debugging and case analysis. It encourages pipeline composition via configuration, enabling researchers to swap retrievers, rerankers, and generators without heavy refactoring. ...

Downloads: 5 This Week

Last Update: 2026-04-09
See Project
2

Natural Language Toolkit

NLTK Source

...NLTK was originally developed to support research and teaching in computational linguistics and artificial intelligence, and it has become one of the most influential educational platforms for learning NLP in Python. The project also includes access to numerous linguistic corpora and lexical resources that can be downloaded and used directly in experiments and applications.

Downloads: 1 This Week

Last Update: 2026-06-11
See Project
3

History LLMs

Information hub for our project training the largest possible LLMs

...This approach enables researchers in the humanities and social sciences to explore how people at different historical moments would have discussed world events, norms, and ideas without later developments influencing the model. It contains documentation about model families like Ranke-4B, which are trained from scratch with historical corpora and can act as “aggregate witnesses” to the textual culture of their era.

Downloads: 0 This Week

Last Update: 2026-01-29
See Project
4

Classical Language Toolkit (CLTK)

The Classical Language Toolkit

The Classical Language Toolkit (CLTK) is a Python library offering natural language processing support for classical languages, including Latin, Greek, and others.

Downloads: 3 This Week

Last Update: 2025-05-04
See Project
99.99% Uptime for MySQL and PostgreSQL Databases
Sub-second maintenance. 2x read/write performance. Built-in vector search for AI apps.

Cloud SQL Enterprise Plus delivers near-zero downtime with 35 days of point-in-time recovery. Supports MySQL, PostgreSQL, and SQL Server.

Try Free
5

SciSpaCy

A full spaCy pipeline and models for scientific/biomedical documents

ScispaCy is a spaCy extension optimized for processing biomedical and scientific text, providing domain-specific NLP models for tasks like named entity recognition (NER) and dependency parsing.

Downloads: 3 This Week

Last Update: 2025-10-01
See Project
6

shuyuan

Reading book source

...The name suggests “academy” or “study hall,” and the tool aims to help users ingest, organize, and manage reading content — possibly offering features like text parsing, annotation, metadata generation, translation, or storage for later reference. The repository is set up to support document ingestion, indexing, and maybe some AI-aided summarization or lookup functions, which helps users convert large text corpora into a structured, searchable knowledge base. For learners, researchers, or avid readers, Shuyuan offers a way to bridge from plain text files or eBooks into a manageable, interactive resource — one where notes, references, and reading progress can be tracked. It likely supports different input formats (text, HTML, PDF), and may integrate optional translation or text normalization tools.

Downloads: 1 This Week

Last Update: 2025-11-28
See Project
7

LLM Datasets

Curated list of datasets and tools for post-training

...The repository aims to make datasets easy to inspect and transform, with scripts for downloading, deduping, cleaning, and converting to formats like JSONL that slot into training pipelines. It highlights instruction-tuning and conversation-style corpora while also pointing to code, math, or domain-specific sets for targeted capabilities. Quality is a recurring theme: examples and utilities help filter low-value samples, enforce length limits, and split train/validation consistently so results are comparable. Licensing and provenance are surfaced to encourage compliant usage and to guide dataset selection in commercial settings. ...

Downloads: 0 This Week

Last Update: 2026-04-29
See Project
8

Large Concept Model

Language modeling in a sentence representation space

...The repository provides training loops, data tooling, and evaluation routines to learn and probe these concept embeddings, typically from large image–text or weakly supervised corpora. It includes utilities to build concept vocabularies, map supervision signals to those vocabularies, and measure zero-shot or few-shot generalization. Probing tools help diagnose what the model knows—e.g., attribute recognition, relation understanding, or compositionality—so you can iterate on data and objectives. The design is modular, making it straightforward to swap backbones, change objectives, or integrate retrieval components.

Downloads: 0 This Week

Last Update: 2025-10-07
See Project
9

gensim

Topic Modelling for Humans

Gensim is a Python library for topic modeling, document indexing, and similarity retrieval with large corpora. The target audience is the natural language processing (NLP) and information retrieval (IR) community.

Downloads: 3 This Week

Last Update: 2025-10-16
See Project
Ship Agents Faster
Transform your applications and workflows into powerful agentic systems at global scale.

Gemini Enterprise Agent Platform lets you rapidly build, scale, govern and optimize production-ready agents grounded in your organization's data. The platform enables developers to build custom or pre-built agents for virtually any use case. New customers get $300 in free credits.

Get Started Free
10

Memvid

Video-based AI memory library. Store millions of text chunks in MP4

Memvid encodes text chunks as QR codes within MP4 frames to build a portable “video memory” for AI systems. This innovative approach uses standard video containers and offers millisecond-level semantic search across large corpora with dramatically less storage than vector DBs. It's self-contained—no DB needed—and supports features like PDF indexing, chat integration, and cloud dashboards.

Downloads: 4 This Week

Last Update: 2026-05-27
See Project
11

Alexandrie

Web application for Markdown note taking

Alexandrie is a fast, modern, and open-source web application for taking and organizing notes using an extended Markdown syntax, designed for students, creators, and knowledge workers. It offers a structured note-taking experience with support for workspaces and categories, making it easy to organize large repositories of information intuitively. The application runs as a responsive web interface that works online or offline, with search and export features that help users retrieve and reuse...

Downloads: 2 This Week

Last Update: 2026-07-06
See Project
12

Style-Bert-VITS2

Style-Bert-VITS2: Bert-VITS2 with more controllable voice styles

Style-Bert-VITS2 is a text-to-speech system based on Bert-VITS2 that focuses on highly controllable voice styles and emotional expression. It takes the original Bert-VITS2 v2.1 and its Japanese-Extra variant and extends them so you can control emotion and speaking style with fine-grained intensity, not just choose a generic tone. The project targets both power users and beginners: Windows users without Git or Python can install and run it using bundled .bat scripts, while advanced users can...

Downloads: 9 This Week

Last Update: 2025-11-28
See Project
13

Auto-Deep-Research

Your Fully-Automated Personal AI Assistant

...Users provide a research topic or multifaceted goal, and the system autonomously breaks the objective down into subtasks like literature collection, critical summarization, cross-comparison, citation extraction, metric evaluation, and structured writing. Auto-Deep-Research integrates retrieval from academic and web sources, processes document corpora for relevance and key insights, and organizes outputs into coherent chapters or sections according to research standards. It also embeds validation loops, where intermediate drafts are self-checked for consistency, coverage, and alignment with sound reasoning practices, reducing reliance on raw generation alone.

Downloads: 0 This Week

Last Update: 2026-02-03
See Project
14

nanoGPT

The simplest, fastest repository for training/finetuning models

...The repo is organized with a training pipeline (dataset preprocessing, model definition, optimizer, training loop) and inference script so you can train a small GPT on text datasets like Shakespeare or custom corpora. It emphasizes readability and clarity: the training loop is cleanly written, and the code avoids heavy abstractions, letting students follow the architecture step by step. While simple, it can still train non-trivial models on modern GPUs and generate coherent text. The project has become widely used in tutorials, courses, and experiments for people learning how transformers work under the hood.

Downloads: 1 This Week

Last Update: 2025-11-12
See Project
15

Matcha-TTS

A fast TTS architecture with conditional flow matching

...The repository provides an end-to-end TTS pipeline: a PyTorch/Lightning training stack, configuration files, pre-trained checkpoints, a command-line interface, and a Gradio app for interactive testing. Users can train on standard datasets like LJSpeech or plug in their own corpora, with helper tools for computing dataset statistics, extracting phoneme durations, and running multi-GPU training.

Downloads: 3 This Week

Last Update: 2025-11-28
See Project
16

Engram

A New Axis of Sparsity for Large Language Models

...It provides utilities to generate embeddings from text or other structured data, index them using efficient approximate nearest neighbor algorithms, and perform real-time similarity queries even on large corpora. Engineered with speed and memory efficiency in mind, Engram supports batched indexing, incremental updates, and custom distance metrics so developers can tailor search behaviors to their domain’s needs. In addition to raw similarity search, the project includes tools for clustering, ranking, and filtering results, enabling richer user experiences like “related content”, semantic auto-completion, and contextual filtering.

Downloads: 0 This Week

Last Update: 2026-01-28
See Project
17

CutLER

Code release for Cut and Learn for Unsupervised Object Detection

...The codebase provides training and inference scripts, model configs, and references to benchmarking results that report large gains over prior unsupervised baselines. It’s intended for researchers exploring self-supervised and unsupervised recognition, offering a practical path to scale beyond costly labeled corpora. The README links papers and gives a high-level overview of components and expected outputs, with pointers to demos and assets. The repository is actively starred and structured as a typical research release with license, contribution guidelines, and security policy.

Downloads: 0 This Week

Last Update: 2025-10-09
See Project
18

Chinese-XLNet

Chinese XLNet pre-trained model

Chinese-XLNet is a Chinese language pre-trained model based on the XLNet architecture, providing an advanced foundation for natural language processing tasks in Mandarin and other Chinese dialects. Unlike traditional masked language modeling, XLNet uses a permutation language modeling objective that captures bidirectional context more effectively by training over all possible token orderings, yielding richer contextual representations. This model is trained on large-scale Chinese text...

Downloads: 0 This Week

Last Update: 2026-04-19
See Project
19

Omnilingual ASR

Omnilingual ASR Open-Source Multilingual SpeechRecognition

Omnilingual-ASR is a research codebase exploring automatic speech recognition that generalizes across a very large number of languages using shared modeling and training recipes. It focuses on leveraging self-supervised audio pretraining and scalable fine-tuning so low-resource languages can benefit from high-resource data. The project provides data preparation pipelines, training scripts, decoding utilities, and evaluation tools so researchers can reproduce results and extend to new...

Downloads: 0 This Week

Last Update: 2025-12-12
See Project
20

IMS Open Corpus Workbench

Indexing and query tools for very large text corpora

The IMS Open Corpus Workbench is a collection of tools for managing and querying large text corpora (100 M words and more) with linguistic annotations. Its central component is the flexible and efficient query processor CQP, which can be used interactively in a terminal session, as a backend e.g. from a Perl script, or through the Web-based GUI CQPweb.

Downloads: 22 This Week

Last Update: 2026-05-20
See Project
21

WavTokenizer

SOTA discrete acoustic codec models with 40/75 tokens per second

WavTokenizer is a state-of-the-art discrete acoustic codec designed specifically for audio language modeling, capable of compressing 24 kHz audio into just 40 or 75 tokens per second while preserving high perceptual quality. It is built to represent speech, music, and general audio with extremely low bitrate, making it ideal as a front-end for large audio language models like GPT-4o and similar architectures. The model uses a single-quantizer design together with temporal compression to...

Downloads: 0 This Week

Last Update: 2025-11-28
See Project
22

TAME LLM

Traditional Mandarin LLMs for Taiwan

TAME LLM is an open-source initiative focused on building and releasing large language models optimized for Traditional Mandarin and the linguistic context of Taiwan. The project includes models such as Llama-3-Taiwan-70B, which are fine-tuned versions of large transformer architectures trained on extensive corpora containing both Traditional Mandarin and English text. These models are designed to support applications such as conversational AI, knowledge retrieval, and domain-specific reasoning in fields like manufacturing, law, healthcare, and electronics. The training pipeline leverages high-performance computing infrastructure and frameworks such as NVIDIA NeMo and Megatron to enable large-scale model training. ...

Downloads: 0 This Week

Last Update: 2026-03-09
See Project
23

Chinese-LLaMA-Alpaca-3

Chinese Llama-3 LLMs) developed from Meta Llama 3

Chinese-LLaMA-Alpaca-3 is an open-source project that provides Mandarin-focused large language models based on Meta’s LLaMA-3 architecture, with both foundational and instruction-tuned variants to support high-quality Chinese natural language understanding and generation. It extends the original LLaMA models with expanded Chinese vocabularies and additional pretraining on Chinese corpora to improve semantic encoding and decoding specifically for Chinese text. Alongside the base models, the project also releases Chinese Alpaca models that are fine-tuned on instruction datasets so they behave more like conversational and instruction-following AI assistants. It includes scripts and tooling that let researchers or developers run training, fine-tuning, quantization, and deployment on local machines (CPU or GPU), making experimentation and testing accessible without requiring large clusters.

Downloads: 0 This Week

Last Update: 2026-01-15
See Project
24

WhisperJAV

A subtitle generator for Japanese Adult Videos.

A subtitle generator for Japanese Adult Videos. Transformer-based ASR architectures like Whisper suffer significant performance degradation when applied to the spontaneous and noisy domain of JAV. This degradation is driven by specific acoustic and temporal characteristics that defy the statistical distributions of standard training data.

1 Review

Downloads: 39 This Week

Last Update: 2026-06-29
See Project
25

Tokenized Text Aligner

Aligns tokens in two versions of a text with differing tokenization.

This tool performs token-by-token alignment of two versions of a text with differing tokenization by interpreting the results of a file diff (https://docs.python.org/3/library/difflib.html). It is intended for use in the preparation of annotated linguistic corpora, where differences in tokenization may arise (i) following corrections or modifications to the source text or (ii) through the creation of different layers of annotation (part-of-speech, treebank) requiring different tokenization. In its default implementation, it produces a human-readable CSV table associating tokens in text A with tokens in text B, and can also inject token-level annotation from text B to text A. ...

Downloads: 0 This Week

Last Update: 2026-02-06
See Project