Showing 10 open source projects for "corpora"

View related business solutions
  • Secure File Transfer for Windows with Cerberus by Redwood Icon
    Secure File Transfer for Windows with Cerberus by Redwood

    Protect and share files over FTP/S, SFTP, HTTPS and SCP with the #1 rated Windows file transfer server.

    Cerberus supports unlimited users and connections on a single IP, with built-in encryption, 2FA, and a browser-based web client — all deployable in under 15 minutes with a 25-day free trial.
    Try for Free
  • AI-powered service management for IT and enterprise teams Icon
    AI-powered service management for IT and enterprise teams

    Enterprise-grade ITSM, for every business

    Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity. Maximize operational efficiency with refreshingly simple, AI-powered Freshservice.
    Try it Free
  • 1
    LLM Datasets

    LLM Datasets

    Curated list of datasets and tools for post-training

    ...The repository aims to make datasets easy to inspect and transform, with scripts for downloading, deduping, cleaning, and converting to formats like JSONL that slot into training pipelines. It highlights instruction-tuning and conversation-style corpora while also pointing to code, math, or domain-specific sets for targeted capabilities. Quality is a recurring theme: examples and utilities help filter low-value samples, enforce length limits, and split train/validation consistently so results are comparable. Licensing and provenance are surfaced to encourage compliant usage and to guide dataset selection in commercial settings. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 2
    Fuzzer Test Suite

    Fuzzer Test Suite

    Set of tests for fuzzing engines

    The Fuzzer Test Suite is a collection of real-world, bug-rich targets used to evaluate and compare fuzzers under controlled conditions. Rather than synthetic micro-benchmarks, it packages build scripts, corpora, and known-crash oracles so fuzzer authors can measure time-to-crash, coverage growth, and stability. Each target is configured to integrate with common sanitizers, ensuring memory safety bugs surface with precise diagnostics. The suite standardizes experiment parameters—runtime, seeds, and environment—so results are reproducible and comparable across machines and research groups. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    PyTorch SimCLR

    PyTorch SimCLR

    PyTorch implementation of SimCLR: A Simple Framework

    ...Nowadays, pre-trained Deep Convolution Neural Networks (DCNNs) are the first go-to pre-solutions to learn a new task. These large models are trained on huge supervised corpora, like the ImageNet. And most important, their features are known to adapt well to new problems. This is particularly interesting when annotated training data is scarce. In situations like this, we take the models’ pre-trained weights, append a new classifier layer on top of it, and retrain the network. This is called transfer learning, and is one of the most used techniques in CV. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    Albedo

    Albedo

    A recommender system for discovering GitHub repos

    ...A reproducible setup and Makefile-driven workflow streamline tasks like spinning up services, loading datasets, training models, and generating candidate lists. Because it’s built around Spark’s scalable primitives, Albedo can experiment on substantial snapshots of GitHub metadata rather than toy corpora. The repo is also educational: it demonstrates a practical end-to-end pipeline from ingestion and feature preparation to training and ranking.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Full-stack observability with actually useful AI | Grafana Cloud Icon
    Full-stack observability with actually useful AI | Grafana Cloud

    Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

    Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
    Create free account
  • 5
    concordia

    concordia

    Powerful search library, best suited for computer-aided translation

    ...This project now contains fully functional Concordia search library. In the near future, it will be extended by concordia-server: ligthweight, robust web server providing corpora search functionalities
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    Chinese Poetry

    Chinese Poetry

    The most comprehensive database of Chinese poetry

    This repository is a curated collection of Chinese poems and poets organized into catalogs, metadata, and text representations suitable for research, creative and cultural use. It includes major dynastic corpora, such as Tang and Song poems, as well as biographical and categorization data. Each poem entry is structured with fields like author, dynasty, title, content, and sometimes annotations or alternate versions. Developers and scholars can build tools that query by author, era, keyword, or poetic form using the standardized data structure. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    Question Answering Corpus

    Question Answering Corpus

    Question answering dataset in "Teaching Machines to Read & Comprehend"

    RC-Data is a dataset generation framework created by Google DeepMind to produce large-scale reading comprehension question-answer pairs from CNN and Daily Mail news articles. The dataset, introduced in the 2015 paper “Teaching Machines to Read and Comprehend” (Hermann et al., NIPS 2015), was among the first large corpora designed to train and evaluate machine reading and comprehension models. The repository provides scripts for downloading archived CNN and Daily Mail articles from the Wayback Machine and automatically generating cloze-style questions where entities in the text are replaced with placeholders. Each data instance consists of a news article (context), a generated question, and its corresponding answer, making it suitable for supervised machine learning setups. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8

    mwetoolkit

    THIS PROJECT MIGRATED TO https://gitlab.com/mwetoolkit/mwetoolkit3/

    THIS PROJECT MIGRATED TO https://gitlab.com/mwetoolkit/mwetoolkit3/ The Multiword Expressions toolkit aids in the automatic identification and extraction of multiword units in running text. These include idioms (kick the bucket), noun compounds (cable car), phrasal verbs (take off, give up), etc. Even though it focuses on multiword expresisons, the framework is quite complete and can also be useful in any corpus-based study in computational linguistics. The mwetoolkit can be...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 9
    TextBlob

    TextBlob

    TextBlob is a Python library for processing textual data

    ...Also, it comes with a WordNet integration. If you only intend to use TextBlob’s default models (no model overrides), you can pass the lite argument. This downloads only those corpora needed for basic functionality. TextBlob is also available as a conda package.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Earn up to 16% annual interest with Nexo. Icon
    Earn up to 16% annual interest with Nexo.

    Let your crypto work for you

    Put idle assets to work with competitive interest rates, borrow without selling, and trade with precision. All in one platform. Geographic restrictions, eligibility, and terms apply.
    Get started with Nexo.
  • 10
    A programming language designed for searching and manipulating tree-structured data, particularly corpora of natural languages encoded in an s-expression-like format.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • Next
MongoDB Logo MongoDB