corpus free download - SourceForge

Showing 18 open source projects for "corpus"

View related business solutions

Software Development Mac Clear Filters & Widen Search

Ship Agents Faster
Transform your applications and workflows into powerful agentic systems at global scale.

Gemini Enterprise Agent Platform lets you rapidly build, scale, govern and optimize production-ready agents grounded in your organization's data. The platform enables developers to build custom or pre-built agents for virtually any use case. New customers get $300 in free credits.

Get Started Free
Enterprise-grade ITSM, for every business
Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity.

Freshservice is an intuitive, AI-powered platform that helps IT, operations, and business teams deliver exceptional service without the usual complexity. Automate repetitive tasks, resolve issues faster, and provide seamless support across the organization. From managing incidents and assets to driving smarter decisions, Freshservice makes it easy to stay efficient and scale with confidence.

Try it Free
1

Echidna

Ethereum smart contract fuzzer

...It uses sophisticated grammar-based fuzzing campaigns based on a contract ABI to falsify user-defined predicates or Solidity assertions. We designed Echidna with modularity in mind, so it can be easily extended to include new mutations or test specific contracts in specific cases. Optional corpus collection, mutation and coverage guidance to find deeper bugs. Powered by Slither to extract useful information before the fuzzing campaign. Source code integration to identify which lines are covered after the fuzzing campaign. Curses-based retro UI, text-only or JSON output.

Downloads: 2 This Week

Last Update: 2026-03-27
See Project
2

ARC-AGI

The Abstraction and Reasoning Corpus

ARC-AGI is a benchmark dataset and experimental framework designed to evaluate and advance artificial general intelligence by testing systems on abstract reasoning tasks that require human-like problem-solving abilities. It consists of a curated set of tasks where models must infer patterns from input-output examples and apply those rules to new unseen cases, without relying on memorization or prior training data. The dataset is structured as grid-based puzzles, where each task requires...

Downloads: 0 This Week

Last Update: 2026-04-03
See Project
3

Honggfuzz

Security oriented software fuzzer

honggfuzz is a general-purpose, high-performance fuzzer that mixes coverage feedback with practical crash triage to uncover memory-safety and logic bugs. It supports multiple fuzzing modes—stdin, file, and networking—so targets can be exercised the same way they run in production. Instrumentation via compiler hooks or hardware/perf counters guides mutations toward previously unseen edges, while persistent mode keeps the target process alive to amortize startup costs. The tool integrates...

Downloads: 0 This Week

Last Update: 2026-01-04
See Project
4

Big List of Naughty Strings

List of strings which have a high probability of causing issues

The Big List of Naughty Strings is a community-maintained catalog of “gotcha” inputs that commonly break software, from unusual Unicode to SQL and script injection payloads. It exists so developers and QA engineers can easily test edge cases that normal test data would miss, such as zero-width characters, right-to-left marks, emojis, foreign alphabets, and long or malformed strings. By throwing these strings at forms, APIs, databases, and UIs, teams can discover encoding bugs, sanitizer...

Downloads: 0 This Week

Last Update: 2025-11-05
See Project
Auth0 B2B Essentials: SSO, MFA, and RBAC Built In
Unlimited organizations, 3 enterprise SSO connections, role-based access control, and pro MFA included. Dev and prod tenants out of the box.

Auth0's B2B Essentials plan gives you everything you need to ship secure multi-tenant apps. Unlimited orgs, enterprise SSO, RBAC, audit log streaming, and higher auth and API limits included. Add on M2M tokens, enterprise MFA, or additional SSO connections as you scale.

Sign Up Free
5

Application Generator for Stemmers

This is an application generator for conflation algorithms in perl language. This system supports generation perl source code for a stemmer from a rule file, running a stemmer which is supported by the system, parsing a corpus file.

Downloads: 0 This Week

Last Update: 2021-06-20
See Project
6

PyTorch Natural Language Processing

Basic Utilities for PyTorch Natural Language Processing (NLP)

...With your batch in hand, you can use PyTorch to develop and train your model using gradient descent. For example, check out this example code for training on the Stanford Natural Language Inference (SNLI) Corpus. Now you've setup your pipeline, you may want to ensure that some functions run deterministically. Wrap any code that's random, with fork_rng and you'll be good to go. Now that you've computed your vocabulary, you may want to make use of pre-trained word vectors to set your embeddings.

Downloads: 0 This Week

Last Update: 2022-08-09
See Project
7

Hypermachiavel

Hypermachiavel is a software developped in Java, addressed to end-users such as linguists and humanities researchers, offering various manipulations on an aligned corpus of texts.

Downloads: 1 This Week

Last Update: 2018-03-10
See Project
8

Chinese Poetry

The most comprehensive database of Chinese poetry

...Developers and scholars can build tools that query by author, era, keyword, or poetic form using the standardized data structure. Because the project is open, contributors improve metadata, correct text variants, and maintain consistency across classical collections. The corpus enables digital humanities, poetry apps, styling tools, and creative projects that intermix poems with translation, visualization, or musical settings.

Downloads: 0 This Week

Last Update: 2025-09-05
See Project
9

Question Answering Corpus

Question answering dataset in "Teaching Machines to Read & Comprehend"

RC-Data is a dataset generation framework created by Google DeepMind to produce large-scale reading comprehension question-answer pairs from CNN and Daily Mail news articles. The dataset, introduced in the 2015 paper “Teaching Machines to Read and Comprehend” (Hermann et al., NIPS 2015), was among the first large corpora designed to train and evaluate machine reading and comprehension models. The repository provides scripts for downloading archived CNN and Daily Mail articles from the...

Downloads: 3 This Week

Last Update: 6 days ago
See Project
$300 Free Credits for Your Google Cloud Projects
Start building on Google Cloud with $300 in free credits. No commitment, no credit card required until you're ready to scale.

Launch your next project with $300 in free Google Cloud credits—no strings attached. Test, build, and deploy without risk. Use your credits across the entire Google Cloud platform to find what works best for your needs. After your credits are used, continue with always-free tier services. Only pay when you're ready to scale. Sign up in minutes and start exploring.

Start Free Trial
10

mwetoolkit

THIS PROJECT MIGRATED TO https://gitlab.com/mwetoolkit/mwetoolkit3/

...These include idioms (kick the bucket), noun compounds (cable car), phrasal verbs (take off, give up), etc. Even though it focuses on multiword expresisons, the framework is quite complete and can also be useful in any corpus-based study in computational linguistics. The mwetoolkit can be applied to virtually any text collection, language, and MWE type. It is a command-line tool written mostly in Python. Its development started in 2010 as a PhD thesis but the project keeps active (see the SVN logs). Up-to-date documentation and details about the tool can be found on the mwetoolkit website: http://mwetoolkit.sourceforge.net/

1 Review

Downloads: 0 This Week

Last Update: 2019-05-01
See Project
11

Australian National Corpus

An ongoing project to collate and provide access to language data

Includes • Scripts for the program/ code developed • High level architecture diagrams • Install guides for developers • Links to end user documentation on the AusNC website Note: The BSD license applies to customised plug-ins, scripts and ingest programs developed by the AusNC project team. Additional open source, 3rd party software products used by the AusNC solution are referenced on our SF wiki space.

Downloads: 1 This Week

Last Update: 2016-11-29
See Project
12

Semantic-PA

Tecnologie di semantic per la PA

Semantic-PA è un progetto di ricerca finanziato sul POR Puglia 2007-2013, che nasce con l'obbiettivo di realizzare una piattaforma software prototipale, basata su tecnologie di Semantic Web (SW), capace di integrare informazioni e servizi propri della Pubblica Amministrazione (PA), con particolare attenzione alla fruizione e consultazione dei contenuti di portali egovernment da parte degli utenti/cittadini.

Downloads: 0 This Week

Last Update: 2013-11-26
See Project
13

knowceans

Utility classes from maps to search engine to random samplers

.... --- Highlights: --- org.knowceans.util: IndexQuickSort, TableList: apply order of one array/list to others +++ Vectors, ArrayUtils: array convenience +++ RandomSamplers, CokusRandom, ArmSampler, Densities: random sampling and distributions +++ Arguments: command line parser +++ StopWatch, Which, ExternalProcess: runtime stuff +++ ParallelFor: OpenMP workalike +++ PatternString, NamedGroupRegex: regex convenience --- org.knowceans.corpus: CorpusSearcher: full-text search engine +++ LabelNumCorpus: svmlight corpus storage and filtering +++ NIPS corpus with text, authors, labels and citations --- org.knowceans.map: InvertibleHashMultiMap, BijectiveHashMap: implement n:m and 1:1 relations. --- Other libs: knowceans-arms = port of the Adaptive Rejection Metropolis Sampler (ARMS) for arbitrary distributions +++ lda-j = port of lda-c, implementing Latent Dirichlet Allocation (LDA)

Downloads: 0 This Week

Last Update: 2015-11-28
See Project
14

Corsis (formerly Tenka Text)

An open-source corpus analysis class library written in C#. GUI of Tenka Text 0.1.3 comes with Wordlister - an advanced, extremely fast graphical wordlist tool and a simple regex concordance tool. Tenka Text - the open-source answer to WordSmith Tool

Downloads: 1 This Week

Last Update: 2013-05-10
See Project
15

PyAnnotation

PyAnnotation is a Python Library to access and manipulate linguistically annotated corpus files. Supported file formats are Kura XML, Elan XML and Toolbox files. A Corpus Reader API is provided to support statistical analysis within the NLTK.

Downloads: 0 This Week

Last Update: 2013-04-29
See Project
16

RocketReader Readability

...Unlike well known reading metrics such as Fog, Kincaid, SMOG, ARI, Flesch, and Coleman-Liau readability this metric takes into account far more factors and is standarized against a corpus

Downloads: 0 This Week

Last Update: 2015-08-03
See Project
17

CoPT, Corpus Processing Tools

CoPT, Corpus Processing Tools, is a set of java classes intended to assist field linguists, NLP researchers and developers, students and software developers in all corpus-related processing.

Downloads: 0 This Week

Last Update: 2013-03-11
See Project
18

reputron

reputron is a knowledge extraction engine platform that covers all aspect of text mining, relevance, indexing and querying on a corpus of text documents.

Downloads: 0 This Week

Last Update: 2015-04-08
See Project