corpora free download

LLM Datasets

Curated list of datasets and tools for post-training

...The repository aims to make datasets easy to inspect and transform, with scripts for downloading, deduping, cleaning, and converting to formats like JSONL that slot into training pipelines. It highlights instruction-tuning and conversation-style corpora while also pointing to code, math, or domain-specific sets for targeted capabilities. Quality is a recurring theme: examples and utilities help filter low-value samples, enforce length limits, and split train/validation consistently so results are comparable. Licensing and provenance are surfaced to encourage compliant usage and to guide dataset selection in commercial settings. ...

Downloads: 3 This Week

Last Update: 2026-04-29

See Project

History LLMs

Information hub for our project training the largest possible LLMs

...This approach enables researchers in the humanities and social sciences to explore how people at different historical moments would have discussed world events, norms, and ideas without later developments influencing the model. It contains documentation about model families like Ranke-4B, which are trained from scratch with historical corpora and can act as “aggregate witnesses” to the textual culture of their era.

Downloads: 0 This Week

Last Update: 2026-01-29

See Project

Engram

A New Axis of Sparsity for Large Language Models

...It provides utilities to generate embeddings from text or other structured data, index them using efficient approximate nearest neighbor algorithms, and perform real-time similarity queries even on large corpora. Engineered with speed and memory efficiency in mind, Engram supports batched indexing, incremental updates, and custom distance metrics so developers can tailor search behaviors to their domain’s needs. In addition to raw similarity search, the project includes tools for clustering, ranking, and filtering results, enabling richer user experiences like “related content”, semantic auto-completion, and contextual filtering.

Downloads: 0 This Week

Last Update: 2026-01-28

See Project

TAME LLM

Traditional Mandarin LLMs for Taiwan

TAME LLM is an open-source initiative focused on building and releasing large language models optimized for Traditional Mandarin and the linguistic context of Taiwan. The project includes models such as Llama-3-Taiwan-70B, which are fine-tuned versions of large transformer architectures trained on extensive corpora containing both Traditional Mandarin and English text. These models are designed to support applications such as conversational AI, knowledge retrieval, and domain-specific reasoning in fields like manufacturing, law, healthcare, and electronics. The training pipeline leverages high-performance computing infrastructure and frameworks such as NVIDIA NeMo and Megatron to enable large-scale model training. ...

Downloads: 0 This Week

Last Update: 2026-03-09

See Project

Chinese-LLaMA-Alpaca-3

Chinese Llama-3 LLMs) developed from Meta Llama 3

Chinese-LLaMA-Alpaca-3 is an open-source project that provides Mandarin-focused large language models based on Meta’s LLaMA-3 architecture, with both foundational and instruction-tuned variants to support high-quality Chinese natural language understanding and generation. It extends the original LLaMA models with expanded Chinese vocabularies and additional pretraining on Chinese corpora to improve semantic encoding and decoding specifically for Chinese text. Alongside the base models, the project also releases Chinese Alpaca models that are fine-tuned on instruction datasets so they behave more like conversational and instruction-following AI assistants. It includes scripts and tooling that let researchers or developers run training, fine-tuning, quantization, and deployment on local machines (CPU or GPU), making experimentation and testing accessible without requiring large clusters.

Downloads: 0 This Week

Last Update: 2026-01-15

See Project

LLMDataHub

Quick guide (especially) for trending instruction finetuning dataset

...The repository focuses particularly on datasets useful for chatbot training, instruction-following tasks, and alignment training scenarios. By organizing these resources into a curated hub, the project helps researchers and developers identify the most relevant training corpora for building conversational AI systems. The repository also highlights datasets suitable for reinforcement learning from human feedback and other alignment strategies used in modern language model training.

Downloads: 0 This Week

Last Update: 2026-03-05

See Project

FastEdit

Editing large language models within 10 seconds

...It implements practical editing algorithms that insert or revise knowledge with targeted parameter updates, aiming to preserve model quality outside the edited scope. This approach is valuable when you need urgent corrections—think product names, APIs, or fast-changing facts—without retraining on large corpora. The repository provides evaluation harnesses so you can measure locality (does the change stay contained?) and generalization (does the change apply where it should?). It’s structured for repeatable experiments, making side-by-side comparisons of editing methods and hyperparameters straightforward. For applied teams, FastEdit offers a toolbox to keep models current and compliant while minimizing collateral damage to overall performance.

Downloads: 0 This Week

Last Update: 2025-11-10

See Project

Search Results for "corpora"

Showing 7 open source projects for "corpora"

LLM Datasets

History LLMs

Engram

TAME LLM

Chinese-LLaMA-Alpaca-3

LLMDataHub

FastEdit

Search Results for "corpora"

Showing 7 open source projects for "corpora"

LLM Datasets

History LLMs

Engram

TAME LLM

Chinese-LLaMA-Alpaca-3

LLMDataHub

FastEdit

Related Searches

Related Categories