Showing 122 open source projects for "corpora"

View related business solutions
  • Our Free Plans just got better! | Auth0 Icon
    Our Free Plans just got better! | Auth0

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.
    Try free now
  • Ship Agents Faster Icon
    Ship Agents Faster

    Transform your applications and workflows into powerful agentic systems at global scale.

    Gemini Enterprise Agent Platform lets you rapidly build, scale, govern and optimize production-ready agents grounded in your organization's data. The platform enables developers to build custom or pre-built agents for virtually any use case. New customers get $300 in free credits.
    Get Started Free
  • 1
    UltraRAG

    UltraRAG

    Less Code, Lower Barrier, Faster Deployment

    UltraRAG 2.0 is a low-code, MCP-enabled RAG framework that aims to lower the barrier to building complex retrieval pipelines for research and production. It provides end-to-end recipes—from encoding and indexing corpora to deploying retrievers and LLMs—so users can reproduce baselines and iterate rapidly. The toolkit comes with built-in support for popular RAG datasets, large corpora, and canonical baselines, plus documentation that walks from “quick start” to debugging and case analysis. It encourages pipeline composition via configuration, enabling researchers to swap retrievers, rerankers, and generators without heavy refactoring. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    LLM Datasets

    LLM Datasets

    Curated list of datasets and tools for post-training

    ...The repository aims to make datasets easy to inspect and transform, with scripts for downloading, deduping, cleaning, and converting to formats like JSONL that slot into training pipelines. It highlights instruction-tuning and conversation-style corpora while also pointing to code, math, or domain-specific sets for targeted capabilities. Quality is a recurring theme: examples and utilities help filter low-value samples, enforce length limits, and split train/validation consistently so results are comparable. Licensing and provenance are surfaced to encourage compliant usage and to guide dataset selection in commercial settings. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 3
    Natural Language Toolkit
    ...NLTK was originally developed to support research and teaching in computational linguistics and artificial intelligence, and it has become one of the most influential educational platforms for learning NLP in Python. The project also includes access to numerous linguistic corpora and lexical resources that can be downloaded and used directly in experiments and applications.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    History LLMs

    History LLMs

    Information hub for our project training the largest possible LLMs

    ...This approach enables researchers in the humanities and social sciences to explore how people at different historical moments would have discussed world events, norms, and ideas without later developments influencing the model. It contains documentation about model families like Ranke-4B, which are trained from scratch with historical corpora and can act as “aggregate witnesses” to the textual culture of their era.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Full-stack observability with actually useful AI | Grafana Cloud Icon
    Full-stack observability with actually useful AI | Grafana Cloud

    Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

    Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
    Create free account
  • 5
    SciSpaCy

    SciSpaCy

    A full spaCy pipeline and models for scientific/biomedical documents

    ScispaCy is a spaCy extension optimized for processing biomedical and scientific text, providing domain-specific NLP models for tasks like named entity recognition (NER) and dependency parsing.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 6
    Large Concept Model

    Large Concept Model

    Language modeling in a sentence representation space

    ...The repository provides training loops, data tooling, and evaluation routines to learn and probe these concept embeddings, typically from large image–text or weakly supervised corpora. It includes utilities to build concept vocabularies, map supervision signals to those vocabularies, and measure zero-shot or few-shot generalization. Probing tools help diagnose what the model knows—e.g., attribute recognition, relation understanding, or compositionality—so you can iterate on data and objectives. The design is modular, making it straightforward to swap backbones, change objectives, or integrate retrieval components.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    Memvid

    Memvid

    Video-based AI memory library. Store millions of text chunks in MP4

    Memvid encodes text chunks as QR codes within MP4 frames to build a portable “video memory” for AI systems. This innovative approach uses standard video containers and offers millisecond-level semantic search across large corpora with dramatically less storage than vector DBs. It's self-contained—no DB needed—and supports features like PDF indexing, chat integration, and cloud dashboards.
    Downloads: 9 This Week
    Last Update:
    See Project
  • 8
    shuyuan

    shuyuan

    Reading book source

    ...The name suggests “academy” or “study hall,” and the tool aims to help users ingest, organize, and manage reading content — possibly offering features like text parsing, annotation, metadata generation, translation, or storage for later reference. The repository is set up to support document ingestion, indexing, and maybe some AI-aided summarization or lookup functions, which helps users convert large text corpora into a structured, searchable knowledge base. For learners, researchers, or avid readers, Shuyuan offers a way to bridge from plain text files or eBooks into a manageable, interactive resource — one where notes, references, and reading progress can be tracked. It likely supports different input formats (text, HTML, PDF), and may integrate optional translation or text normalization tools.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    Classical Language Toolkit (CLTK)

    Classical Language Toolkit (CLTK)

    The Classical Language Toolkit

    The Classical Language Toolkit (CLTK) is a Python library offering natural language processing support for classical languages, including Latin, Greek, and others.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Secure File Transfer for Windows with Cerberus by Redwood Icon
    Secure File Transfer for Windows with Cerberus by Redwood

    Protect and share files over FTP/S, SFTP, HTTPS and SCP with the #1 rated Windows file transfer server.

    Cerberus supports unlimited users and connections on a single IP, with built-in encryption, 2FA, and a browser-based web client — all deployable in under 15 minutes with a 25-day free trial.
    Try for Free
  • 10
    gensim

    gensim

    Topic Modelling for Humans

    Gensim is a Python library for topic modeling, document indexing, and similarity retrieval with large corpora. The target audience is the natural language processing (NLP) and information retrieval (IR) community.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    Matcha-TTS

    Matcha-TTS

    A fast TTS architecture with conditional flow matching

    ...The repository provides an end-to-end TTS pipeline: a PyTorch/Lightning training stack, configuration files, pre-trained checkpoints, a command-line interface, and a Gradio app for interactive testing. Users can train on standard datasets like LJSpeech or plug in their own corpora, with helper tools for computing dataset statistics, extracting phoneme durations, and running multi-GPU training.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 12
    nanoGPT

    nanoGPT

    The simplest, fastest repository for training/finetuning models

    ...The repo is organized with a training pipeline (dataset preprocessing, model definition, optimizer, training loop) and inference script so you can train a small GPT on text datasets like Shakespeare or custom corpora. It emphasizes readability and clarity: the training loop is cleanly written, and the code avoids heavy abstractions, letting students follow the architecture step by step. While simple, it can still train non-trivial models on modern GPUs and generate coherent text. The project has become widely used in tutorials, courses, and experiments for people learning how transformers work under the hood.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 13
    Auto-Deep-Research

    Auto-Deep-Research

    Your Fully-Automated Personal AI Assistant

    ...Users provide a research topic or multifaceted goal, and the system autonomously breaks the objective down into subtasks like literature collection, critical summarization, cross-comparison, citation extraction, metric evaluation, and structured writing. Auto-Deep-Research integrates retrieval from academic and web sources, processes document corpora for relevance and key insights, and organizes outputs into coherent chapters or sections according to research standards. It also embeds validation loops, where intermediate drafts are self-checked for consistency, coverage, and alignment with sound reasoning practices, reducing reliance on raw generation alone.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    Alexandrie

    Alexandrie

    Web application for Markdown note taking

    Alexandrie is a fast, modern, and open-source web application for taking and organizing notes using an extended Markdown syntax, designed for students, creators, and knowledge workers. It offers a structured note-taking experience with support for workspaces and categories, making it easy to organize large repositories of information intuitively. The application runs as a responsive web interface that works online or offline, with search and export features that help users retrieve and reuse...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    Engram

    Engram

    A New Axis of Sparsity for Large Language Models

    ...It provides utilities to generate embeddings from text or other structured data, index them using efficient approximate nearest neighbor algorithms, and perform real-time similarity queries even on large corpora. Engineered with speed and memory efficiency in mind, Engram supports batched indexing, incremental updates, and custom distance metrics so developers can tailor search behaviors to their domain’s needs. In addition to raw similarity search, the project includes tools for clustering, ranking, and filtering results, enabling richer user experiences like “related content”, semantic auto-completion, and contextual filtering.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    CutLER

    CutLER

    Code release for Cut and Learn for Unsupervised Object Detection

    ...The codebase provides training and inference scripts, model configs, and references to benchmarking results that report large gains over prior unsupervised baselines. It’s intended for researchers exploring self-supervised and unsupervised recognition, offering a practical path to scale beyond costly labeled corpora. The README links papers and gives a high-level overview of components and expected outputs, with pointers to demos and assets. The repository is actively starred and structured as a typical research release with license, contribution guidelines, and security policy.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    IMS Open Corpus Workbench

    IMS Open Corpus Workbench

    Indexing and query tools for very large text corpora

    The IMS Open Corpus Workbench is a collection of tools for managing and querying large text corpora (100 M words and more) with linguistic annotations. Its central component is the flexible and efficient query processor CQP, which can be used interactively in a terminal session, as a backend e.g. from a Perl script, or through the Web-based GUI CQPweb.
    Leader badge
    Downloads: 29 This Week
    Last Update:
    See Project
  • 18
    TAME LLM

    TAME LLM

    Traditional Mandarin LLMs for Taiwan

    TAME LLM is an open-source initiative focused on building and releasing large language models optimized for Traditional Mandarin and the linguistic context of Taiwan. The project includes models such as Llama-3-Taiwan-70B, which are fine-tuned versions of large transformer architectures trained on extensive corpora containing both Traditional Mandarin and English text. These models are designed to support applications such as conversational AI, knowledge retrieval, and domain-specific reasoning in fields like manufacturing, law, healthcare, and electronics. The training pipeline leverages high-performance computing infrastructure and frameworks such as NVIDIA NeMo and Megatron to enable large-scale model training. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    Chinese-LLaMA-Alpaca-3

    Chinese-LLaMA-Alpaca-3

    Chinese Llama-3 LLMs) developed from Meta Llama 3

    Chinese-LLaMA-Alpaca-3 is an open-source project that provides Mandarin-focused large language models based on Meta’s LLaMA-3 architecture, with both foundational and instruction-tuned variants to support high-quality Chinese natural language understanding and generation. It extends the original LLaMA models with expanded Chinese vocabularies and additional pretraining on Chinese corpora to improve semantic encoding and decoding specifically for Chinese text. Alongside the base models, the project also releases Chinese Alpaca models that are fine-tuned on instruction datasets so they behave more like conversational and instruction-following AI assistants. It includes scripts and tooling that let researchers or developers run training, fine-tuning, quantization, and deployment on local machines (CPU or GPU), making experimentation and testing accessible without requiring large clusters.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    Chinese-XLNet

    Chinese-XLNet

    Chinese XLNet pre-trained model

    Chinese-XLNet is a Chinese language pre-trained model based on the XLNet architecture, providing an advanced foundation for natural language processing tasks in Mandarin and other Chinese dialects. Unlike traditional masked language modeling, XLNet uses a permutation language modeling objective that captures bidirectional context more effectively by training over all possible token orderings, yielding richer contextual representations. This model is trained on large-scale Chinese text...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    Omnilingual ASR

    Omnilingual ASR

    Omnilingual ASR Open-Source Multilingual SpeechRecognition

    Omnilingual-ASR is a research codebase exploring automatic speech recognition that generalizes across a very large number of languages using shared modeling and training recipes. It focuses on leveraging self-supervised audio pretraining and scalable fine-tuning so low-resource languages can benefit from high-resource data. The project provides data preparation pipelines, training scripts, decoding utilities, and evaluation tools so researchers can reproduce results and extend to new...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    WavTokenizer

    WavTokenizer

    SOTA discrete acoustic codec models with 40/75 tokens per second

    WavTokenizer is a state-of-the-art discrete acoustic codec designed specifically for audio language modeling, capable of compressing 24 kHz audio into just 40 or 75 tokens per second while preserving high perceptual quality. It is built to represent speech, music, and general audio with extremely low bitrate, making it ideal as a front-end for large audio language models like GPT-4o and similar architectures. The model uses a single-quantizer design together with temporal compression to...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    Style-Bert-VITS2

    Style-Bert-VITS2

    Style-Bert-VITS2: Bert-VITS2 with more controllable voice styles

    Style-Bert-VITS2 is a text-to-speech system based on Bert-VITS2 that focuses on highly controllable voice styles and emotional expression. It takes the original Bert-VITS2 v2.1 and its Japanese-Extra variant and extends them so you can control emotion and speaking style with fine-grained intensity, not just choose a generic tone. The project targets both power users and beginners: Windows users without Git or Python can install and run it using bundled .bat scripts, while advanced users can...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 24

    WhisperJAV

    A subtitle generator for Japanese Adult Videos.

    A subtitle generator for Japanese Adult Videos. Transformer-based ASR architectures like Whisper suffer significant performance degradation when applied to the spontaneous and noisy domain of JAV. This degradation is driven by specific acoustic and temporal characteristics that defy the statistical distributions of standard training data.
    Leader badge
    Downloads: 43 This Week
    Last Update:
    See Project
  • 25

    Tokenized Text Aligner

    Aligns tokens in two versions of a text with differing tokenization.

    This tool performs token-by-token alignment of two versions of a text with differing tokenization by interpreting the results of a file diff (https://docs.python.org/3/library/difflib.html). It is intended for use in the preparation of annotated linguistic corpora, where differences in tokenization may arise (i) following corrections or modifications to the source text or (ii) through the creation of different layers of annotation (part-of-speech, treebank) requiring different tokenization. In its default implementation, it produces a human-readable CSV table associating tokens in text A with tokens in text B, and can also inject token-level annotation from text B to text A. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • 5
  • Next
MongoDB Logo MongoDB