text corpus free download

Showing 65 open source projects for "text corpus"

View related business solutions

Rent Manager Software
Landlords, multi-family homes, manufactured home communities, single family homes, associations, commercial properties and mixed portfolios.

Rent Manager is award-winning property management software built for residential, commercial, and short-term-stay portfolios of any size. The program’s fully customizable features include a double-entry accounting system, maintenance management/scheduling, marketing integration, mobile applications, more than 450 insightful reports, and an API that integrates with the best PropTech providers on the market.

Learn More
Automated quote and proposal software for IT solution providers. | ConnectWise CPQ
Create IT quote templates, automate workflows, add integrations & price catalogs to save time & reduce errors on manual data entry & updates.

ConnectWise CPQ, formerly ConnectWise Sell, is a professional quote and proposal automation software for IT solution providers. ConnectWise CPQ offers a wide range of tools that enables IT solution providers to save time, quote more, and win big. Top features include professional quote or proposal templates, product catalog and sourcing, workflow automation, sales reporting, and integrations with best-in-breed solutions like Cisco, Dell, HP, and Salesforce.

Learn More
1

Minimal text diffusion

A minimal implementation of diffusion models for text generation

A minimal implementation of diffusion models of text: learns a diffusion model of a given text corpus, allowing to generate text samples from the learned model. The main idea was to retain just enough code to allow training a simple diffusion model and generating samples, remove image-related terms, and make it easier to use. To train a model, run scripts/train.sh. By default, this will train a model on the simple corpus. However, you can change this to any text file using the --train_data...

Downloads: 0 This Week

Last Update: 2023-03-23
See Project
2

IMS Open Corpus Workbench

Indexing and query tools for very large text corpora

The IMS Open Corpus Workbench is a collection of tools for managing and querying large text corpora (100 M words and more) with linguistic annotations. Its central component is the flexible and efficient query processor CQP, which can be used interactively in a terminal session, as a backend e.g. from a Perl script, or through the Web-based GUI CQPweb.

Downloads: 78 This Week

Last Update: 1 day ago
See Project
3

Reor Project

Private & local AI personal knowledge management app

Reor is an AI-powered desktop note-taking app: it automatically links related notes, answers questions on your notes, provides semantic search and can generate AI flashcards. Everything is stored locally and you can edit your notes with an Obsidian-like markdown editor. The hypothesis of the project is that AI tools for thought should run models locally by default. Reor stands on the shoulders of the giants Ollama, Transformers.js & LanceDB to enable both LLMs and embedding models to run locally.

Downloads: 0 This Week

Last Update: 2024-09-04
See Project
4

Echidna

Ethereum smart contract fuzzer

... in specific cases. Optional corpus collection, mutation and coverage guidance to find deeper bugs. Powered by Slither to extract useful information before the fuzzing campaign. Source code integration to identify which lines are covered after the fuzzing campaign. Curses-based retro UI, text-only or JSON output.

Downloads: 0 This Week

Last Update: 2024-07-16
See Project
Total Network Visibility for Network Engineers and IT Managers
Network monitoring and troubleshooting is hard. TotalView makes it easy.

This means every device on your network, and every interface on every device is automatically analyzed for performance, errors, QoS, and configuration.

Learn More
5

TXM

Unicode-XML-TEI text/corpus analysis platform

TXM is a free and open-source cross-platform Unicode & XML based text/corpus analysis environment and graphical client, supporting Windows, Linux and Mac OS X. It can also be used online as a J2EE standard compliant web portal (GWT based) with access control built in. DOWNLOAD LATEST VERSION OF TXM : http://textometrie.ens-lyon.fr/spip.php?rubrique61&lang=en TXM offers a comprehensive range of analysis tools (concordances, collocate search, frequency lists, etc.) based on the powerfull...

Downloads: 10 This Week

Last Update: 2023-10-02
See Project
6

Queries-for-Arabic-OSAC-Corpus

43 queries of various topics for the Information Retrieval Collection . The corpus is created from the OSAC corpus of journalistic texts consisting of 4763 articles recovered from the Arabic BBC News. https://sourceforge.net/projects/ar-text-mining/files/Arabic-Corpora/

Downloads: 1 This Week

Last Update: 2021-12-03
See Project
7

TEXminer

Text Mining Classification for Texts in ASCII, Unicode and PDF Format.

TEXminer uses generic Text Mining Methods to analyze Unicode Files as plain Text or PDF. The Text Database can be saved in XML where the orginal Text, the Sentence and Word Lists and additional Parameters (e.g. Abbreviations) are stored. TEXminer allows Language Detection by Letter Frequency Analysis, finding important Words by Cooccurrence Analysis, Determination of Central Expressions, Thematic Text Classification (also Semantic Groups) and Fingerprint Comparison. Because TEXminer...

Downloads: 2 This Week

Last Update: 2023-10-21
See Project
8

Syllabic Verse Analysis (SylVA)

Syllabifies and scans syllabic verse texts for metrical annotation

The tool syllabifies and scans texts written in syllabic verse for metrical corpus annotation. It is designed for Old French and Old Occitan and exports the results in PAULA format suitable for the ANNIS platform (http://corpus-tools.org/annis/). Used first in the preparation of the metrical treebank containing the Old Occitan <i>Boeci</i> text (cf. Rainsford and Scrivner 2014), development continued for use with the Old Gallo-Romance Corpus <http://www.ogr-corpus.org>).

Downloads: 0 This Week

Last Update: 2024-08-25
See Project
9

modnlp-plugins

External plugins for modnlp/teccli

This is a general project for modnlp/teccli plugins, with focus on text visualizaton.

Downloads: 0 This Week

Last Update: 2023-05-06
See Project
Precoro helps companies spend smarter
Fully Automated Process in One Tool: From Purchase Orders to Budget Control and Reporting.

For minor company expenses, you might utilize a spend management solution or track everything in spreadsheets. For everything more, you'll need Precoro. We help companies achieve procurement excellence and budget efficiency by building transparent, predictable, automated spending workflows.

Learn More
10

modnlp

Modular Suite of NLP Tools

modnlp aims to provide a modular architecture and tools for natural language processing written (mainly) in Java. It provides an API and tools for (inverted) indexing, storage and retrieval of large amounts of text, with (XML-based) handling of meta-data, tools for text categorisation, including, functionality for XML parsing, term set reduction (and basic keyword extraction), probabilistic classifier induction, sample classification tools, and evaluation modules, a suite of corpus management...

Downloads: 0 This Week

Last Update: 2024-06-20
See Project
11

Linguistic Analyzer

The Linguistic Analyzer is a tool for corpus analysis and comparison

The Linguistic Analyzer (Almuhalil Alloghawy) is a free tool designed by a team from Al-Imam Muhammad bin Saud islamic university that can be used for corpus analysis and comparison in terms of the several linguistic characteristics, such as frequency lists generation, concordances, collocation extraction, the difference between two words, and keyword identification.

Downloads: 1 This Week

Last Update: 2022-04-16
See Project
12

hebrew-gpt_neo

Hebrew text generation models based on EleutherAI's gpt-neo

Hebrew text generation models based on EleutherAI's gpt-neo. Each was trained on a TPUv3-8 which was made available to me via the TPU Research Cloud Program. The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

Downloads: 0 This Week

Last Update: 2023-03-23
See Project
13

AliceMind

ALIbaba's Collection of Encoder-decoders from MinD

... and sentence levels, respectively. Pre-trained models for natural language generation (NLG). We propose a novel scheme that jointly pre-trains an autoencoding and autoregressive language model on a large unlabeled corpus, specifically designed for generating new text conditioned on context. It achieves new SOTA results in several downstream tasks.

Downloads: 0 This Week

Last Update: 2022-08-17
See Project
14

Open-Content Text Corpus

The OCTC hosts open-content texts, encoded in TEI P5, for many languages, each in a separate subcorpus. Another part of the OCTC stores inter-language alignment info. The project is intended to be an open platform for collaboration.

Downloads: 0 This Week

Last Update: 2021-01-14
See Project
15

agd-text

In this corpus: 10 essays containing 752 sentences (with a total of 4,160 words). The essays were selected from different collections of partially or totally diacritic Arabic texts, all of which are available in the Tashkeela corpus. Texts in this corpus have been used in the evaluation of AGD checker. There are two types of texts in this corpus: 1- Texts without errors to evaluate AGD in terms of detecting and correcting errors that we do not know about before the checking process 2...

Downloads: 1 This Week

Last Update: 2021-02-01
See Project
16

DWDS/Dialing Concordance

a collection of indexing and search tools for corpus linguists

DWDS/Dialing Concordance (DDC) - a collection of index and search tools for corpus linguists

2 Reviews

Downloads: 2 This Week

Last Update: 2021-06-16
See Project
17

node-markov-generator

Generates simple sentences based on given text corpus

This simple generator emits short sentences based on the given text corpus using a Markov chain. To put it simply, it works kinda like word suggestions that you have while typing messages in your smartphone. It analyzes which word is followed by which in the given corpus and how often. And then, for any given word it tries to predict what the next one might be. Here you create an instance of TextGenerator passing an array of strings to it - it represents your text corpus which will be used...

Downloads: 0 This Week

Last Update: 2023-03-23
See Project
18

KSUCCA Corpus

A 50 million tokens corpus of Classical Arabic.

King Saud University Corpus of Classical Arabic (KSUCCA) is a pioneering 50 million tokens annotated corpus of Classical Arabic texts from the period of pre-Islamic era until the fourth Hijri century (equivalent to the period from the seventh until early eleventh century CE), which is the period of pure classical Arabic. The main aim of this corpus is to be used for studying the distributional lexical semantics of The Quran words. However, it can be used for other research purposes...

Downloads: 4 This Week

Last Update: 2020-02-19
See Project
19

korpus

Corpus Linguistics Software

Some software for Corpus Linguistics, which includes Corpus Text Editor, Web-based search, etc. This project created for Belarusian Corpus, but can be used for other languages with some adaption.

Downloads: 0 This Week

Last Update: 2021-02-02
See Project
20

GPT2 for Multiple Languages

GPT2 for Multiple Languages, including pretrained models

With just 2 clicks (not including Colab auth process), the 1.5B pretrained Chinese model demo is ready to go. The contents in this repository are for academic research purpose, and we do not provide any conclusive remarks. Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC) Simplifed GPT2 train scripts（based on Grover, supporting TPUs). Ported bert tokenizer, multilingual corpus compatible. 1.5B GPT2 pretrained Chinese model (~15G corpus, 10w steps). Batteries...

Downloads: 0 This Week

Last Update: 2023-03-23
See Project
21

jieba

Stuttering Chinese word segmentation

"Jaba" Chinese word segmentation, do the best Python Chinese word segmentation component. Four word segmentation modes are supported. Precise mode, which tries to cut the sentence most precisely, suitable for text analysis. Full mode, scans all the words that can be formed into words in the sentence, the speed is very fast, but the ambiguity cannot be resolved. The search engine mode, on the basis of the precise mode, divides the long words again to improve the recall rate, which is suitable...

Downloads: 0 This Week

Last Update: 2022-02-18
See Project
22

Dragonfire

The open-source virtual assistant for Ubuntu based Linux distributions

.... It will contain various software packages for controlling the helmet. It will be the first of its kind. Dragonfire uses Mozilla DeepSpeech to understand your voice commands and Festival Speech Synthesis System to handle text-to-speech tasks.

Downloads: 0 This Week

Last Update: 2022-01-13
See Project
23

PyTorch Natural Language Processing

Basic Utilities for PyTorch Natural Language Processing (NLP)

PyTorch-NLP is a library for Natural Language Processing (NLP) in Python. It’s built with the very latest research in mind, and was designed from day one to support rapid prototyping. PyTorch-NLP comes with pre-trained embeddings, samplers, dataset loaders, metrics, neural network modules and text encoders. It’s open-source software, released under the BSD3 license. With your batch in hand, you can use PyTorch to develop and train your model using gradient descent. For example, check out...

Downloads: 0 This Week

Last Update: 2022-08-09
See Project
24

SimpleLemmatizer

This program is for text lemmatization

It lemmatizes texts based on supplied model. The base model is for slovak texts and is created from Slovak National Corpus, copyright by Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences

Downloads: 0 This Week

Last Update: 2020-03-22
See Project
25

Arabic Corpus

Text categorization, arabic language processing, language modeling

The Arabic Corpus {compiled by Dr. Mourad Abbas ( http://sites.google.com/site/mouradabbas9/corpora ) The corpus Khaleej-2004 contains 5690 documents. It is divided to 4 topics (categories). The corpus Watan-2004 contains 20291 documents organized in 6 topics (categories). Researchers who use these two corpora would mention the two main references: (1) For Watan-2004 corpus ---------------------- M. Abbas, K. Smaili, D. Berkani, (2011) Evaluation of Topic Identification Methods...

Downloads: 7 This Week

Last Update: 2019-03-05
See Project