Page 2 | corpus free download

Showing 170 open source projects for "corpus"

View related business solutions

Our Free Plans just got better! | Auth0 by Okta
With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your secuirty. Auth0 now, thank yourself later.

Try free now
Bright Data - All in One Platform for Proxies and Web Scraping
Say goodbye to blocks, restrictions, and CAPTCHAs

Bright Data offers the highest quality proxies with automated session management, IP rotation, and advanced web unlocking technology. Enjoy reliable, fast performance with easy integration, a user-friendly dashboard, and enterprise-grade scaling. Powered by ethically-sourced residential IPs for seamless web scraping.

Get Started
1

agd-text

In this corpus: 10 essays containing 752 sentences (with a total of 4,160 words). The essays were selected from different collections of partially or totally diacritic Arabic texts, all of which are available in the Tashkeela corpus. Texts in this corpus have been used in the evaluation of AGD checker. There are two types of texts in this corpus: 1- Texts without errors to evaluate AGD in terms of detecting and correcting errors that we do not know about before the checking process 2...

Downloads: 0 This Week

Last Update: 2021-02-01
See Project
2

GPT2 for Multiple Languages

GPT2 for Multiple Languages, including pretrained models

With just 2 clicks (not including Colab auth process), the 1.5B pretrained Chinese model demo is ready to go. The contents in this repository are for academic research purpose, and we do not provide any conclusive remarks. Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC) Simplifed GPT2 train scripts（based on Grover, supporting TPUs). Ported bert tokenizer, multilingual corpus compatible. 1.5B GPT2 pretrained Chinese model (~15G corpus, 10w steps). Batteries...

Downloads: 0 This Week

Last Update: 2023-03-23
See Project
3

iramuteq

IRAMUTEQ : Interface de R pour les Analyses Multidimensionnelles de Textes et de Questionnaires. Logiciel de traitement de données pour des corpus texte ou de type individus/caractères. Permet notamment de réaliser des analyses de type "ALCESTE"

Downloads: 710 This Week

Last Update: 2020-11-04
See Project
4

Korean Analyzer Rhino

Parsing Korean words by morpheme and part-of-speech

RHINO parses Korean words by morpheme and part-of-speech. Its dictionaries are based on Korean Modern Tagged Corpus(12 million phrases scale) which was made by Korean government. So it analyses many cases of stems and endings. And the newly developed Dynamic Dictionary Technology can make words to react with their context. That is, a programmed database. For more information see the files in the help folder.

Downloads: 6 This Week

Last Update: 2020-10-11
See Project
The CRM you’ll want to use every day
With CRM, Sales, and Marketing Automation in one, Act! gives you everything you need for happier clients, more revenue, and less stress.

Act! Premium is perfect for small and midsize businesses looking to market better, sell more, and create customers for life. With unparalleled flexibility and freedom of choice, Act! Premium accommodates the unique ways you do business. Whether it’s customizations to fit your specific business or industry processes or your preferences for deployment and access, the possibilities with Act! Premium are limitless.

Learn More
5

jieba

Stuttering Chinese word segmentation

"Jaba" Chinese word segmentation, do the best Python Chinese word segmentation component. Four word segmentation modes are supported. Precise mode, which tries to cut the sentence most precisely, suitable for text analysis. Full mode, scans all the words that can be formed into words in the sentence, the speed is very fast, but the ambiguity cannot be resolved. The search engine mode, on the basis of the precise mode, divides the long words again to improve the recall rate, which is suitable...

Downloads: 0 This Week

Last Update: 2022-02-18
See Project
6

Dragonfire

The open-source virtual assistant for Ubuntu based Linux distributions

Dragonfire is the open-source virtual assistant project for Ubuntu-based Linux distributions. Her main objective is to serve as a command and control interface to the helmet user. So that you will be able to give orders just by using your voice commands and your eye movements. That makes the helmet handsfree. We are planning to ship Dragonfire as a preinstalled software package on DragonOS Linux Distribution. DragonOS will be a Linux distribution specially designed for the helmet. It will...

Downloads: 0 This Week

Last Update: 2022-01-13
See Project
7

PyTorch Natural Language Processing

Basic Utilities for PyTorch Natural Language Processing (NLP)

... this example code for training on the Stanford Natural Language Inference (SNLI) Corpus. Now you've setup your pipeline, you may want to ensure that some functions run deterministically. Wrap any code that's random, with fork_rng and you'll be good to go. Now that you've computed your vocabulary, you may want to make use of pre-trained word vectors to set your embeddings.

Downloads: 0 This Week

Last Update: 2022-08-09
See Project
8

SimpleLemmatizer

This program is for text lemmatization

It lemmatizes texts based on supplied model. The base model is for slovak texts and is created from Slovak National Corpus, copyright by Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences

Downloads: 0 This Week

Last Update: 2020-03-22
See Project
9

Arabic Corpus

Text categorization, arabic language processing, language modeling

The Arabic Corpus {compiled by Dr. Mourad Abbas ( http://sites.google.com/site/mouradabbas9/corpora ) The corpus Khaleej-2004 contains 5690 documents. It is divided to 4 topics (categories). The corpus Watan-2004 contains 20291 documents organized in 6 topics (categories). Researchers who use these two corpora would mention the two main references: (1) For Watan-2004 corpus ---------------------- M. Abbas, K. Smaili, D. Berkani, (2011) Evaluation of Topic Identification Methods...

Downloads: 12 This Week

Last Update: 2019-03-05
See Project
Find out just how much your login box can do for your customer | Auth0
With over 53 social login options, you can fast-track the signup and login experience for users.

From improving customer experience through seamless sign-on to making MFA as easy as a click of a button – your login box must find the right balance between user convenience, privacy and security.

Sign up
10

CakeChat

CakeChat: Emotional Generative Dialog System

... bidirectional. By default, CuDNNGRU implementation is used for ~25% acceleration during inference. Thought vector is fed into decoder on each decoding step. Decoder can be conditioned on any categorical label, for example, emotion label or persona id. May be initialized using w2v model trained on your corpus. Embedding layer may be either fixed or fine-tuned along with other weights of the network.

Downloads: 0 This Week

Last Update: 2022-08-12
See Project
11

ace2005-preprocessing

ACE 2005 corpus preprocessing for Event Extraction task

This is a simple code for preprocessing ACE 2005 corpus for Event Extraction task. Using the existing methods were complicated for me, so I made this project. Github: https://github.com/nlpcl-lab/ace2005-preprocessing

Downloads: 0 This Week

Last Update: 2019-11-07
See Project
12

NeuroNER

Named-entity recognition using neural networks

...-platform, open source, freely available, and straightforward to use. Enables the users to create or modify annotations for a new or existing corpus. Train the neural network that performs the NER. During the training, NeuroNER allows monitoring of the network. Evaluate the quality of the predictions made by NeuroNER. The performance metrics can be calculated and plotted by comparing the predicted labels with the gold labels.

Downloads: 0 This Week

Last Update: 2022-08-12
See Project
13

Queries for OSAC (Arabic) Corpus

43 Queries for Arabic Information Retrieval Collection

43 queries of various topics for the Information Retrieval Collection . The corpus is created from the OSAC corpus of journalistic texts consisting of 4763 articles recovered from the Arabic BBC News. https://sourceforge.net/projects/ar-text-mining/files/Arabic-Corpora/

Downloads: 0 This Week

Last Update: 2019-01-07
See Project
14

concordia

Powerful search library, best suited for computer-aided translation

Concordia - Roman goddess of agreement. Concordance searcher - tool for translators who need their translations to "agree" with one standard. Concordia is a C++ library for fast text lookup in large corpora. It uses a RAM stored index, which takes up approximately 600MB of memory for a corpus of 2 million sentences. It is based on the idea of a suffix array, enhanced by the presence of other auxiliary data structures. The effects are stunning - Concordia is able to do simple substring...

Downloads: 0 This Week

Last Update: 2019-02-28
See Project
15

QJDicExample

QJDicExample is an English <-> Japanese dictionary.

QJDicExample is an Japanese to English and English to Japanese dictionary featuring words/names/kanji/sentences search. QJDicExample uses JMdict, JMnedict, Kanjidic2, Radkfilex, KanjiVG, Tanaka Corpus / Tatoeba databases for translations and zinnia recognition library for handwritten kanji recognition. Latest source code: git clone git://git.code.sf.net/p/qjdicexample/code qjdicexample-code

1 Review

Downloads: 0 This Week

Last Update: 2019-01-19
See Project
16

Tashkeela: Arabic diacritization corpus

Tashkeela: Arabic discritization Corpus (Vocalized texts)

Tashkeela: Arabic discritization Corpus, Resource, Arabic vocalized texts: نصوص عربية مشكولة =========== Contains Arabic text vocalized . Text -format; 75.6 millions words Please cite this resource as: T. Zerrouki, A. Balla, Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems, Data in Brief (2017), http://dx.doi.org/10.1016/j.dib.2017.01.011 Data in Brief ∎ ( ∎∎∎∎ ) ∎∎∎ – ∎∎∎

1 Review

Downloads: 4 This Week

Last Update: 2018-02-15
See Project
17

Corpus Toolkit

A text management tool for linguistic purposes...

Downloads: 0 This Week

Last Update: 2017-11-23
See Project
18

kcws

Deep Learning Chinese Word Segment

Deep learning chinese word segment. Install the bazel code construction tool and install tensorflow (currently this project requires tf 1.0.0alpha version or above) Switch to the code directory of this project and run ./configure. Compile background service. Pay attention to the public account of waiting for words and reply to kcws to get the corpus download address. Extract the corpus to a directory. Change to the code directory.After installing tensorflow, switch to the kcws code directory...

Downloads: 0 This Week

Last Update: 2022-08-09
See Project
19

Corpus Manager

Yet another corpus manager. Allows for HTTP access to annotated text corpora, client does not need to install any special software to access the server (any browser with JavaScript support will do).

Downloads: 0 This Week

Last Update: 2017-10-05
See Project
20

KhmerText

Open data for a Khmer language corpus and lexicographic data that can be used for the development of free language tools for Khmer language, such as automatic translators, dictionaries, linguistic analysis tools, etc.

4 Reviews

Downloads: 64 This Week

Last Update: 2018-05-17
See Project
21

WikiSQL

A large annotated semantic parsing corpus for developing NL interfaces

A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is the dataset released along with our work Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. Regarding tokenization and Stanza, when WikiSQL was written 3-years ago, it relied on Stanza, a CoreNLP python wrapper that has since been deprecated. If you'd still like to use the tokenizer, please use the docker image. We do not anticipate switching...

Downloads: 3 This Week

Last Update: 2022-07-26
See Project
22

Corpus DOGC

Corpus del Diari Oficial de la Generalitat de Catalunya

Plana de descàrrega del corpus dels Diari Oficial de la Generalitat de Catalunya.

Downloads: 8 This Week

Last Update: 2017-05-03
See Project
23

Bitextor

**CODE MOVED TO GITHUB: https://github.com/bitextor ** Bitextor is an application created to generate translation memories using multilingual websites as a corpus source. It downloads an entire website and applies a set of heuristics (based mainly on HTML tag structure and text block length) to find bitexts.

Downloads: 0 This Week

Last Update: 2018-04-17
See Project
24

rcqp

R interface to the Corpus Query Protocol

Implements the Corpus Query Protocol as a package for the R statistical environment. It allows to query linguistic corpora and manipulate the data as native R objects. It is based on the CWB software.

Downloads: 0 This Week

Last Update: 2018-03-13
See Project
25

Hypermachiavel

Hypermachiavel is a software developped in Java, addressed to end-users such as linguists and humanities researchers, offering various manipulations on an aligned corpus of texts.

Downloads: 0 This Week

Last Update: 2018-03-10
See Project