corpus free download - SourceForge

Showing 63 open source projects for "corpus"

View related business solutions

Artificial Intelligence Linux Clear Filters & Widen Search

Compliant and Reliable File Transfers Backed by Top Security Certifications
Cerberus FTP Server delivers SOC 2 Type II certified security and FIPS 140-2 validated encryption.

Stop relying on non-certified, legacy file transfer tools that creak under the weight of modern security demands. Get full audit trails, advanced access controls and more supported by an award-winning team of experts. Start your free 25-day trial today.

Start Free Trial
Ship Agents Faster
Transform your applications and workflows into powerful agentic systems at global scale.

Gemini Enterprise Agent Platform lets you rapidly build, scale, govern and optimize production-ready agents grounded in your organization's data. The platform enables developers to build custom or pre-built agents for virtually any use case. New customers get $300 in free credits.

Get Started Free
1

gensim

Topic Modelling for Humans

Gensim is a Python library for topic modeling, document indexing, and similarity retrieval with large corpora. The target audience is the natural language processing (NLP) and information retrieval (IR) community.

Downloads: 7 This Week

Last Update: 2025-10-16
See Project
2

ArXiv MCP Server

A Model Context Protocol server for searching and analyzing arXiv

...Issue threads show feature requests such as extracting embedded LaTeX and improving markdown conversion, reflecting active community use in research flows. It’s designed to be drop-in for MCP clients, giving them typed inputs/outputs and predictable errors around a well-known academic corpus. For developers building research copilots, it removes the glue work of wiring arXiv APIs into an agent toolchain.

Downloads: 0 This Week

Last Update: 2026-04-26
See Project
3

Kimi K2

Kimi K2 is the large language model series developed by Moonshot AI

Kimi K2 is Moonshot AI’s advanced open-source large language model built on a scalable Mixture-of-Experts (MoE) architecture that combines a trillion total parameters with a subset of ~32 billion active parameters to deliver powerful and efficient performance on diverse tasks. It was trained on an enormous corpus of over 15.5 trillion tokens to push frontier capabilities in coding, reasoning, and general agentic tasks while addressing training stability through novel optimizer and architecture design strategies. The model family includes variants like a foundational base model that researchers can fine-tune for specific use cases and an instruct-optimized variant primed for general-purpose chat and agent-style interactions, offering flexibility for both experimentation and deployment. ...

Downloads: 21 This Week

Last Update: 2026-01-27
See Project
4

Reor Project

Private & local AI personal knowledge management app

Reor is an AI-powered desktop note-taking app: it automatically links related notes, answers questions on your notes, provides semantic search and can generate AI flashcards. Everything is stored locally and you can edit your notes with an Obsidian-like markdown editor. The hypothesis of the project is that AI tools for thought should run models locally by default. Reor stands on the shoulders of the giants Ollama, Transformers.js & LanceDB to enable both LLMs and embedding models to run locally.

Downloads: 5 This Week

Last Update: 2025-04-13
See Project
Auth0 B2B Essentials: SSO, MFA, and RBAC Built In
Unlimited organizations, 3 enterprise SSO connections, role-based access control, and pro MFA included. Dev and prod tenants out of the box.

Auth0's B2B Essentials plan gives you everything you need to ship secure multi-tenant apps. Unlimited orgs, enterprise SSO, RBAC, audit log streaming, and higher auth and API limits included. Add on M2M tokens, enterprise MFA, or additional SSO connections as you scale.

Sign Up Free
5

Chronos Forecasting

Pretrained (Language) Models for Probabilistic Time Series Forecasting

...Once trained, probabilistic forecasts are obtained by sampling multiple future trajectories given the historical context. Chronos models have been trained on a large corpus of publicly available time series data, as well as synthetic data generated using Gaussian processes.

Downloads: 1 This Week

Last Update: 2025-12-17
See Project
6

CodeGeeX4

CodeGeeX4-ALL-9B, a versatile model for all AI software development

CodeGeeX4 is the fourth-generation open source multilingual code large language model (LLM) developed by ZhipuAI. Designed as a powerful AI coding assistant, it supports over 100 programming languages and has been trained on a massive code and natural language corpus. Compared to its predecessors, CodeGeeX4 introduces improved reasoning, stronger alignment with developer needs, and better performance on real-world programming benchmarks. It supports tasks such as code completion, generation from natural language descriptions, code translation, bug fixing, and explanation. The repository provides model checkpoints, inference examples, and fine-tuning guides, making it adaptable for both research and practical software development workflows. ...

Downloads: 5 This Week

Last Update: 6 days ago
See Project
7

DeepSeek Coder

DeepSeek Coder: Let the Code Write Itself

DeepSeek-Coder is a series of code-specialized language models designed to generate, complete, and infill code (and mixed code + natural language) with high fluency in both English and Chinese. The models are trained from scratch on a massive corpus (~2 trillion tokens), of which about 87% is code and 13% is natural language. This dataset covers project-level code structure (not just line-by-line snippets), using a large context window (e.g. 16K) and a secondary fill-in-the-blank objective to encourage better contextual completions and infilling. Multiple sizes of the model are offered (e.g. 1B, 5.7B, 6.7B, 33B) so users can trade off inference cost vs capability. ...

Downloads: 8 This Week

Last Update: 2025-11-11
See Project
8

VoxCPM

TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning

...This design helps decouple semantic and acoustic information while preserving fine-grained prosody, leading to more stable and expressive generation than many discrete-token systems. Trained on a large 1.8-million-hour bilingual corpus, VoxCPM can infer appropriate speaking style from context, dynamically adjusting intonation, rhythm, and emotional tone. It supports zero-shot voice cloning from a short reference audio clip, capturing timbre, accent, and pacing to closely mimic a target speaker without per-speaker fine-tuning.

Downloads: 14 This Week

Last Update: 2026-04-28
See Project
9

FlexLLMGen

Running large language models on a single GPU

...This design allows organizations to deploy powerful language models for high-volume tasks without the infrastructure costs typically associated with large-scale AI systems. The project is particularly useful for workloads that prioritize throughput over latency, including benchmarking experiments and large corpus analysis.

Downloads: 0 This Week

Last Update: 2026-03-10
See Project
$300 Free Credits for Your Google Cloud Projects
Start building on Google Cloud with $300 in free credits. No commitment, no credit card required until you're ready to scale.

Launch your next project with $300 in free Google Cloud credits—no strings attached. Test, build, and deploy without risk. Use your credits across the entire Google Cloud platform to find what works best for your needs. After your credits are used, continue with always-free tier services. Only pay when you're ready to scale. Sign up in minutes and start exploring.

Start Free Trial
10

Step3-VL-10B

Multimodal model achieving SOTA performance

...Despite having only about 10 billion parameters, it delivers performance that rivals or even surpasses much larger models (10×–20× larger) on a wide range of multimodal benchmarks covering reasoning, perception, and complex tasks, positioning it as one of the most powerful models in its class. It achieves this efficiency and strong performance through unified pre-training on a massive 1.2 trillion-token multimodal corpus that jointly optimizes a language-aligned perception encoder with a powerful decoder, creating deep synergy between image processing and text understanding.

Downloads: 0 This Week

Last Update: 2026-01-22
See Project
11

nanochat

The best ChatGPT that $100 can buy

nanochat is a from-scratch, end-to-end “mini ChatGPT” that shows the entire path from raw text to a chatty web app in one small, dependency-lean codebase. The repository stitches together every stage of the lifecycle: tokenizer training, pretraining a Transformer on a large web corpus, mid-training on dialogue and multiple-choice tasks, supervised fine-tuning, optional reinforcement learning for alignment, and finally efficient inference with caching. Its north star is approachability and speed: you can boot a fresh GPU box and drive the whole pipeline via a single script, producing a usable chat model in hours and a clear markdown report of what happened. ...

Downloads: 0 This Week

Last Update: 2026-05-05
See Project
12

OSS-Fuzz Gen

LLM powered fuzzing via OSS-Fuzz

...The system integrates with modern LLM-assisted workflows to draft harness code and then iterates based on build errors or low coverage signals. Importantly, it aligns with OSS-Fuzz conventions, generating corpus seeds, build rules, and sanitizer settings so projects can plug in quickly. Reports highlight what functions were targeted, how coverage evolved, and where manual hints could unlock more paths. The goal is pragmatic: shrink the gap between “we should fuzz this” and “we have robust fuzzing running in CI,” especially for understaffed maintainers.

Downloads: 0 This Week

Last Update: 2025-10-12
See Project
13

modnlp

Modular Suite of NLP Tools

...It provides an API and tools for (inverted) indexing, storage and retrieval of large amounts of text, with (XML-based) handling of meta-data, tools for text categorisation, including, functionality for XML parsing, term set reduction (and basic keyword extraction), probabilistic classifier induction, sample classification tools, and evaluation modules, a suite of corpus management, curation and distributed access tools. If you use the tool please consider referencing it using the following article: Luz, S., & Sheehan, S. (2020). Methods and visualization tools for the analysis of medical, political and scientific concepts in Genealogies of Knowledge. Palgrave Communications, 6(1), 1-20. ...

Downloads: 2 This Week

Last Update: 2026-06-07
See Project
14

Wikipedia2Vec

A tool for learning vector representations of words and entities

Wikipedia2Vec is an embedding learning tool that creates word and entity vector representations from Wikipedia, enabling NLP models to leverage structured and contextual knowledge.

Downloads: 1 This Week

Last Update: 2025-01-24
See Project
15

NLG-Eval

Evaluation code for various unsupervised automated metrics

NLG-Eval is a toolkit for evaluating the quality of natural language generation (NLG) outputs using multiple automated metrics such as BLEU, METEOR, and ROUGE.

Downloads: 1 This Week

Last Update: 2025-01-24
See Project
16

Paul Graham GPT

RAG on Paul Graham's essays

Paul Graham GPT is a specialized AI-powered search and chat app built on a corpus of essays from Paul Graham, giving users the ability to query and discuss his writings in a conversational way. The repo stores the full text of his essays (chunked), uses embeddings (e.g. via OpenAI embeddings) to allow semantic search over that corpus, and hosts a chat interface that combines retrieval results with LLM-based answering — enabling RAG (retrieval-augmented generation) over a fixed dataset. ...

Downloads: 0 This Week

Last Update: 2025-12-08
See Project
17

Minimal text diffusion

A minimal implementation of diffusion models for text generation

...Note that you may have to increase the sequence length (--seq_len) if your corpus is longer than the simple corpus. The other default arguments are set to match the best setting I found for the simple corpus.

Downloads: 0 This Week

Last Update: 2023-03-23
See Project
18

hebrew-gpt_neo

Hebrew text generation models based on EleutherAI's gpt-neo

Hebrew text generation models based on EleutherAI's gpt-neo. Each was trained on a TPUv3-8 which was made available to me via the TPU Research Cloud Program. The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

Downloads: 0 This Week

Last Update: 2023-03-23
See Project
19

AliceMind

ALIbaba's Collection of Encoder-decoders from MinD

...Pre-trained models for natural language generation (NLG). We propose a novel scheme that jointly pre-trains an autoencoding and autoregressive language model on a large unlabeled corpus, specifically designed for generating new text conditioned on context. It achieves new SOTA results in several downstream tasks.

Downloads: 0 This Week

Last Update: 2022-08-17
See Project
20

CC-Net

Tools to download and cleanup Common Crawl data

cc_net provides tools to download, segment, clean, and filter Common Crawl to build large-scale text corpora, including monolingual datasets and the multilingual CC-100 collection introduced in the associated paper. It includes pipelines to fetch snapshots, extract text, de-duplicate, identify language, and apply quality filtering based on heuristics and language models. The outputs are intended for pretraining language models and for creating standardized corpora that can be reproduced or...

Downloads: 0 This Week

Last Update: 2025-10-11
See Project
21

node-markov-generator

Generates simple sentences based on given text corpus

This simple generator emits short sentences based on the given text corpus using a Markov chain. To put it simply, it works kinda like word suggestions that you have while typing messages in your smartphone. It analyzes which word is followed by which in the given corpus and how often. And then, for any given word it tries to predict what the next one might be. Here you create an instance of TextGenerator passing an array of strings to it - it represents your text corpus which will be used to "train" the generator. ...

Downloads: 0 This Week

Last Update: 2023-03-23
See Project
22

GPT2 for Multiple Languages

GPT2 for Multiple Languages, including pretrained models

...Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC) Simplifed GPT2 train scripts（based on Grover, supporting TPUs). Ported bert tokenizer, multilingual corpus compatible. 1.5B GPT2 pretrained Chinese model (~15G corpus, 10w steps). Batteries-included Colab demo. 1.5B GPT2 pretrained Chinese model (~30G corpus, 22w steps).

Downloads: 1 This Week

Last Update: 2023-03-23
See Project
23

KSUCCA Corpus

A 50 million tokens corpus of Classical Arabic.

King Saud University Corpus of Classical Arabic (KSUCCA) is a pioneering 50 million tokens annotated corpus of Classical Arabic texts from the period of pre-Islamic era until the fourth Hijri century (equivalent to the period from the seventh until early eleventh century CE), which is the period of pure classical Arabic. The main aim of this corpus is to be used for studying the distributional lexical semantics of The Quran words.

Downloads: 4 This Week

Last Update: 2020-02-19
See Project
24

jieba

Stuttering Chinese word segmentation

"Jaba" Chinese word segmentation, do the best Python Chinese word segmentation component. Four word segmentation modes are supported. Precise mode, which tries to cut the sentence most precisely, suitable for text analysis. Full mode, scans all the words that can be formed into words in the sentence, the speed is very fast, but the ambiguity cannot be resolved. The search engine mode, on the basis of the precise mode, divides the long words again to improve the recall rate, which is suitable...

Downloads: 2 This Week

Last Update: 2022-02-18
See Project
25

Dragonfire

The open-source virtual assistant for Ubuntu based Linux distributions

Dragonfire is the open-source virtual assistant project for Ubuntu-based Linux distributions. Her main objective is to serve as a command and control interface to the helmet user. So that you will be able to give orders just by using your voice commands and your eye movements. That makes the helmet handsfree. We are planning to ship Dragonfire as a preinstalled software package on DragonOS Linux Distribution. DragonOS will be a Linux distribution specially designed for the helmet. It will...

Downloads: 2 This Week

Last Update: 2022-01-13
See Project