5 projects for "corpora" with 2 filters applied:

  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • Earn up to 16% annual interest with Nexo. Icon
    Earn up to 16% annual interest with Nexo.

    More flexibility. More control.

    Generate interest, access liquidity without selling, and execute trades seamlessly. All in one platform. Geographic restrictions, eligibility, and terms apply.
    Get started with Nexo.
  • 1
    LLM Datasets

    LLM Datasets

    Curated list of datasets and tools for post-training

    ...The repository aims to make datasets easy to inspect and transform, with scripts for downloading, deduping, cleaning, and converting to formats like JSONL that slot into training pipelines. It highlights instruction-tuning and conversation-style corpora while also pointing to code, math, or domain-specific sets for targeted capabilities. Quality is a recurring theme: examples and utilities help filter low-value samples, enforce length limits, and split train/validation consistently so results are comparable. Licensing and provenance are surfaced to encourage compliant usage and to guide dataset selection in commercial settings. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 2
    Fuzzer Test Suite

    Fuzzer Test Suite

    Set of tests for fuzzing engines

    The Fuzzer Test Suite is a collection of real-world, bug-rich targets used to evaluate and compare fuzzers under controlled conditions. Rather than synthetic micro-benchmarks, it packages build scripts, corpora, and known-crash oracles so fuzzer authors can measure time-to-crash, coverage growth, and stability. Each target is configured to integrate with common sanitizers, ensuring memory safety bugs surface with precise diagnostics. The suite standardizes experiment parameters—runtime, seeds, and environment—so results are reproducible and comparable across machines and research groups. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    Albedo

    Albedo

    A recommender system for discovering GitHub repos

    ...A reproducible setup and Makefile-driven workflow streamline tasks like spinning up services, loading datasets, training models, and generating candidate lists. Because it’s built around Spark’s scalable primitives, Albedo can experiment on substantial snapshots of GitHub metadata rather than toy corpora. The repo is also educational: it demonstrates a practical end-to-end pipeline from ingestion and feature preparation to training and ranking.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    Chinese Poetry

    Chinese Poetry

    The most comprehensive database of Chinese poetry

    This repository is a curated collection of Chinese poems and poets organized into catalogs, metadata, and text representations suitable for research, creative and cultural use. It includes major dynastic corpora, such as Tang and Song poems, as well as biographical and categorization data. Each poem entry is structured with fields like author, dynasty, title, content, and sometimes annotations or alternate versions. Developers and scholars can build tools that query by author, era, keyword, or poetic form using the standardized data structure. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Stop Cyber Threats with VM-Series Next-Gen Firewall on Azure Icon
    Stop Cyber Threats with VM-Series Next-Gen Firewall on Azure

    Native application identity and user-based security for your Azure cloud

    Gain integrated visibility across all traffic in a single pass. Deploy Palo Alto Networks VM-Series to determine application identity and content while automating security policy updates via rich APIs.
    Get a free trial
  • 5
    A programming language designed for searching and manipulating tree-structured data, particularly corpora of natural languages encoded in an s-expression-like format.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • Next
MongoDB Logo MongoDB