Page 2 | corpora free download

Showing 105 open source projects for "corpora"

View related business solutions

Linux Clear Filters & Widen Search

Our Free Plans just got better! | Auth0
With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.

Try free now
Streamline Azure Security with Palo Alto Networks VM-Series
Centrally manage physical and virtualized firewalls with Panorama

Improve your security posture and reduce incident response time. Use the VM-Series to natively analyze Azure traffic and dynamically drive policy updates based on workload changes.

Learn more
1

Detic

Code release for "Detecting Twenty-thousand Classes

...It decouples localization from classification, training a strong box localizer on standard detection data while learning classifiers from weak supervision and large image-tag corpora. A shared region proposal backbone feeds a flexible classification head that can expand to tens of thousands of categories without exhaustive box annotations. The system supports zero- or few-shot extension to novel categories via semantic embeddings and class name supervision, making “open-world” detection practical. Built on Detectron2, the repo includes configs, pretrained weights, and conversion tools to mix fully and weakly supervised sources. ...

Downloads: 2 This Week

Last Update: 2025-10-07
See Project
2

TXM

Unicode XML TEI text analysis platform

TXM is a free and open-source cross-platform Unicode & XML based text analysis environment and graphical client, supporting Windows, Linux and Mac OS X. It can also be used online as a J2EE standard compliant web portal (GWT based) with access control built in. DOWNLOAD LATEST VERSION OF TXM : http://textometrie.ens-lyon.fr/spip.php?rubrique61&lang=en TXM offers a comprehensive range of analysis tools (concordances, collocate search, frequency lists, etc.) based on the powerfull CQP...

Downloads: 6 This Week

Last Update: 2024-12-09
See Project
3

FastEdit

Editing large language models within 10 seconds

...It implements practical editing algorithms that insert or revise knowledge with targeted parameter updates, aiming to preserve model quality outside the edited scope. This approach is valuable when you need urgent corrections—think product names, APIs, or fast-changing facts—without retraining on large corpora. The repository provides evaluation harnesses so you can measure locality (does the change stay contained?) and generalization (does the change apply where it should?). It’s structured for repeatable experiments, making side-by-side comparisons of editing methods and hyperparameters straightforward. For applied teams, FastEdit offers a toolbox to keep models current and compliant while minimizing collateral damage to overall performance.

Downloads: 0 This Week

Last Update: 2025-11-10
See Project
4

Metaseq

Repo for external large-scale work

...The framework was used internally at Meta to train models like OPT (Open Pre-trained Transformer) and serves as a reference implementation for scaling transformer architectures efficiently across GPUs and nodes. It supports both pretraining and fine-tuning workflows with data pipelines for text, multilingual corpora, and custom tokenization schemes. Metaseq also includes APIs for evaluation, generation, and model serving, enabling seamless transitions from training to inference.

Downloads: 0 This Week

Last Update: 2025-10-06
See Project
Forever Free Full-Stack Observability | Grafana Cloud
Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.

Create free account
5

Paul Graham GPT

RAG on Paul Graham's essays

Paul Graham GPT is a specialized AI-powered search and chat app built on a corpus of essays from Paul Graham, giving users the ability to query and discuss his writings in a conversational way. The repo stores the full text of his essays (chunked), uses embeddings (e.g. via OpenAI embeddings) to allow semantic search over that corpus, and hosts a chat interface that combines retrieval results with LLM-based answering — enabling RAG (retrieval-augmented generation) over a fixed dataset. The...

Downloads: 0 This Week

Last Update: 2025-12-08
See Project
6

Open Speech Corpora

A list of accessible speech corpora for ASR, TTS

Open Speech Corpora is a curated catalog of speech datasets intended to support research and development in automatic speech recognition, text-to-speech, and other speech technologies. The repository is organized as a set of tables that list corpora along with their languages, total hours, number of speakers, download links, and licenses, giving practitioners a quick way to find data that matches their needs.

Downloads: 0 This Week

Last Update: 2025-11-28
See Project
7

JoBimText

Linking Language to Knowledge with Distributional Semantics

JobimText is a software solution for automatic text expansion using contextualized distributional similarity. It provides text analysis tools for large corpora and has capabilities to create distributional semantic models (JoBimText models) and multi-word expressions.

Downloads: 0 This Week

Last Update: 2022-08-04
See Project
8

Linguistic Analyzer

The Linguistic Analyzer is a tool for corpus analysis and comparison

The Linguistic Analyzer (Almuhalil Alloghawy) is a free tool designed by a team from Al-Imam Muhammad bin Saud islamic university that can be used for corpus analysis and comparison in terms of the several linguistic characteristics, such as frequency lists generation, concordances, collocation extraction, the difference between two words, and keyword identification.

Downloads: 0 This Week

Last Update: 2022-04-16
See Project
9

GiantMIDI-Piano

Classical piano MIDI dataset

...Because the dataset is machine-generated via an automated transcription pipeline, it offers consistency, scale, and accessibility that would be difficult to achieve manually — enabling researchers to work with large corpora of piano music without copyright restrictions on symbolic data.

Downloads: 2 This Week

Last Update: 2025-12-02
See Project
Earn up to 16% annual interest with Nexo.
Let your crypto work for you

Put idle assets to work with competitive interest rates, borrow without selling, and trade with precision. All in one platform. Geographic restrictions, eligibility, and terms apply.

Get started with Nexo.
10

Queries-for-Arabic-OSAC-Corpus

...The corpus is created from the OSAC corpus of journalistic texts consisting of 4763 articles recovered from the Arabic BBC News. https://sourceforge.net/projects/ar-text-mining/files/Arabic-Corpora/

Downloads: 0 This Week

Last Update: 2021-12-03
See Project
11

Fuzzer Test Suite

Set of tests for fuzzing engines

The Fuzzer Test Suite is a collection of real-world, bug-rich targets used to evaluate and compare fuzzers under controlled conditions. Rather than synthetic micro-benchmarks, it packages build scripts, corpora, and known-crash oracles so fuzzer authors can measure time-to-crash, coverage growth, and stability. Each target is configured to integrate with common sanitizers, ensuring memory safety bugs surface with precise diagnostics. The suite standardizes experiment parameters—runtime, seeds, and environment—so results are reproducible and comparable across machines and research groups. ...

Downloads: 0 This Week

Last Update: 2025-10-10
See Project
12

DrQA

Reading Wikipedia to Answer Open-Domain Questions

DrQA is an open-domain question answering system that reads large text corpora—famously Wikipedia—to answer natural language questions with extractive spans. It follows a two-stage pipeline: a fast document retriever first narrows down candidate articles, and a neural machine reader then predicts the exact answer span from those passages. The retriever relies on classic IR features (like TF-IDF and n-gram statistics) to remain lightweight and scalable to millions of documents. ...

Downloads: 0 This Week

Last Update: 2025-10-07
See Project
13

XLM (Cross-lingual Language Model)

PyTorch original implementation of Cross-lingual Language Model

...The repository provides preprocessing pipelines, training code, and fine-tuning scripts so you can reproduce benchmark results or adapt models to your own multilingual corpora. Pretrained checkpoints cover dozens of languages and multiple model sizes, balancing quality and compute needs.

Downloads: 0 This Week

Last Update: 2025-10-07
See Project
14

PyTorch SimCLR

PyTorch implementation of SimCLR: A Simple Framework

...Nowadays, pre-trained Deep Convolution Neural Networks (DCNNs) are the first go-to pre-solutions to learn a new task. These large models are trained on huge supervised corpora, like the ImageNet. And most important, their features are known to adapt well to new problems. This is particularly interesting when annotated training data is scarce. In situations like this, we take the models’ pre-trained weights, append a new classifier layer on top of it, and retrain the network. This is called transfer learning, and is one of the most used techniques in CV. ...

Downloads: 0 This Week

Last Update: 2022-08-15
See Project
15

CC-Net

Tools to download and cleanup Common Crawl data

cc_net provides tools to download, segment, clean, and filter Common Crawl to build large-scale text corpora, including monolingual datasets and the multilingual CC-100 collection introduced in the associated paper. It includes pipelines to fetch snapshots, extract text, de-duplicate, identify language, and apply quality filtering based on heuristics and language models. The outputs are intended for pretraining language models and for creating standardized corpora that can be reproduced or updated with new crawls. ...

Downloads: 0 This Week

Last Update: 2025-10-11
See Project
16

American Fuzzy Lop

American fuzzy lop - a security-oriented fuzzer

AFL (American Fuzzy Lop) is a widely used graybox fuzzer that discovers bugs by mutating inputs and steering execution using lightweight instrumentation. Instead of random mutations alone, it uses coverage feedback to evolve input corpora, pushing programs into deeper and more interesting code paths. Its workflow emphasizes quick start: point it at a target binary with compile-time instrumentation (or use QEMU-based mode when recompilation isn’t possible), seed it with a small corpus, and let it iterate. AFL is known for finding serious security issues in complex software due to its corpus minimization, queue management, and deterministic mutation stages that balance breadth and depth. ...

Downloads: 0 This Week

Last Update: 2025-10-09
See Project
17

Albedo

A recommender system for discovering GitHub repos

...A reproducible setup and Makefile-driven workflow streamline tasks like spinning up services, loading datasets, training models, and generating candidate lists. Because it’s built around Spark’s scalable primitives, Albedo can experiment on substantial snapshots of GitHub metadata rather than toy corpora. The repo is also educational: it demonstrates a practical end-to-end pipeline from ingestion and feature preparation to training and ranking.

Downloads: 0 This Week

Last Update: 2025-10-16
See Project
18

Arabic Word diversity

Word frequency and diversity (distribution) across hundreds of corpora. You'll see both the lemma and the various forms.

Downloads: 0 This Week

Last Update: 2020-05-15
See Project
19

NLP Best Practices

Natural Language Processing Best Practices & Examples

...Data scientists started moving from traditional methods to state-of-the-art (SOTA) deep neural network (DNN) algorithms which use language models pretrained on large text corpora. This repository contains examples and best practices for building NLP systems, provided as Jupyter notebooks and utility functions. The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language. The goal of this repository is to build a comprehensive set of tools and examples that leverage recent advances in NLP algorithms, neural architectures, and distributed machine learning systems.

Downloads: 0 This Week

Last Update: 2022-08-01
See Project
20

Arabic Rare Words Project

Text Analysis Egyptian Schoolbooks

The purpose is to compare the most common words in the language with the words used in textbooks for students in Egyptian schools. The frequency can help scholars and teachers better teach reading.

Downloads: 0 This Week

Last Update: 2021-04-14
See Project
21

YouTokenToMe

Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe is a fast and efficient unsupervised text tokenization library designed for training subword embeddings, particularly useful for NLP models.

Downloads: 0 This Week

Last Update: 2025-01-24
See Project
22

cocoNLP

A Chinese information extraction tool

cocoNLP is a lightweight natural-language processing toolkit geared toward practical information extraction from raw text, especially for Chinese and mixed Chinese–English content. Instead of requiring a heavy pipeline, it focuses on quick wins such as extracting names, places, organizations, emails, phone numbers, and dates directly from unstructured sentences. The project blends pattern-based methods with NLP heuristics, giving developers dependable results for real-world texts like chats,...

Downloads: 0 This Week

Last Update: 2025-11-05
See Project
23

OLiA

OWL/DL ontologies for linguistic annotations

...The OLiA Reference Model itself is linked to community-maintained repositories such as GOLD (http://linguistics-ontology.org/) and ISOcat (http://www.isocat.org) The OLiA ontologies were originally developed as part of an infrastructure for the sustainable maintenance of linguistic resources (http://www.sfb441.uni-tuebingen.de/c2/index-engl.html), their fields of application include the formalization of annotation schemes, concept-based querying over heterogeneously annotated corpora, and the development of interoperable NLP pipelines.

Downloads: 0 This Week

Last Update: 2019-11-11
See Project
24

UnsupervisedMT

Phrase-Based & Neural Unsupervised Machine Translation

Unsupervised Machine Translation is a research repository that implements both phrase-based SMT and neural MT approaches for translation without parallel corpora. The neural component supports multiple architectures—seq2seq, biLSTM with attention, and Transformer—and allows extensive parameter sharing across languages to improve data efficiency. Training relies on denoising auto-encoding and back-translation, with on-the-fly, multithreaded generation of synthetic parallel data to continually refresh supervision signals. ...

Downloads: 0 This Week

Last Update: 6 days ago
See Project
25

@Note2

@Note2 - A workbench for Biomedical Text Mining

Biomedical Text Mining (BioTM) is providing valuable approaches to the automated curation of scientific literature.

1 Review

Downloads: 1 This Week

Last Update: 2019-05-13
See Project