corpus free download - SourceForge

Showing 58 open source projects for "corpus"

View related business solutions

Scientific/Engineering Linux Clear Filters & Widen Search

Stop Storing Third-Party Tokens in Your Database
Auth0 Token Vault handles secure token storage, exchange, and refresh for external providers so you don't have to build it yourself.

Rolling your own OAuth token storage can be a security liability. Token Vault securely stores access and refresh tokens from federated providers and handles exchange and renewal automatically. Connected accounts, refresh exchange, and privileged worker flows included.

Try Auth0 for Free
Stop Cyber Threats with VM-Series Next-Gen Firewall on Azure
Native application identity and user-based security for your Azure cloud

Gain integrated visibility across all traffic in a single pass. Deploy Palo Alto Networks VM-Series to determine application identity and content while automating security policy updates via rich APIs.

Get a free trial
1

IMS Open Corpus Workbench

Indexing and query tools for very large text corpora

The IMS Open Corpus Workbench is a collection of tools for managing and querying large text corpora (100 M words and more) with linguistic annotations. Its central component is the flexible and efficient query processor CQP, which can be used interactively in a terminal session, as a backend e.g. from a Perl script, or through the Web-based GUI CQPweb.

Downloads: 53 This Week

Last Update: 2026-05-20
See Project
2

iramuteq

IRAMUTEQ : Interface de R pour les Analyses Multidimensionnelles de Textes et de Questionnaires. Logiciel de traitement de données pour des corpus texte ou de type individus/caractères. Permet notamment de réaliser des analyses de type "ALCESTE"

Downloads: 637 This Week

Last Update: 2024-11-03
See Project
3

modnlp-plugins

External plugins for modnlp/teccli

This is a general project for modnlp/teccli plugins, with focus on text visualizaton.

Downloads: 0 This Week

Last Update: 2023-05-06
See Project
4

Linguistic Analyzer

The Linguistic Analyzer is a tool for corpus analysis and comparison

The Linguistic Analyzer (Almuhalil Alloghawy) is a free tool designed by a team from Al-Imam Muhammad bin Saud islamic university that can be used for corpus analysis and comparison in terms of the several linguistic characteristics, such as frequency lists generation, concordances, collocation extraction, the difference between two words, and keyword identification.

Downloads: 10 This Week

Last Update: 2022-04-16
See Project
$300 Free Credits to Build on Google Cloud
New to Google Cloud? Get $300 in credits to explore Compute Engine, BigQuery, Cloud Run, Gemini Enterprise Agent Platform, and more.

Start your next project with $300 in free Google Cloud credit. Spin up VMs, run containers, query petabytes in BigQuery, or build agents with Gemini Enterprise Agent Platform. Once your credits are used, keep building with 20+ always-free tier products including Compute Engine, Cloud Storage, GKE, and Cloud Run functions. No commitment required—just sign up and start building.

Claim $300 Free
5

Web as Corpus

Software, information, data sets and documentation for the Web as Corpus community.

Downloads: 0 This Week

Last Update: 2021-04-29
See Project
6

DWDS/Dialing Concordance

a collection of indexing and search tools for corpus linguists

DWDS/Dialing Concordance (DDC) - a collection of index and search tools for corpus linguists

2 Reviews

Downloads: 1 This Week

Last Update: 2021-06-16
See Project
7

Application Generator for Stemmers

This is an application generator for conflation algorithms in perl language. This system supports generation perl source code for a stemmer from a rule file, running a stemmer which is supported by the system, parsing a corpus file.

Downloads: 0 This Week

Last Update: 2021-06-20
See Project
8

korpus

Corpus Linguistics Software

Some software for Corpus Linguistics, which includes Corpus Text Editor, Web-based search, etc. This project created for Belarusian Corpus, but can be used for other languages with some adaption.

Downloads: 0 This Week

Last Update: 2021-02-02
See Project
9

Korean Analyzer Rhino

Parsing Korean words by morpheme and part-of-speech

RHINO parses Korean words by morpheme and part-of-speech. Its dictionaries are based on Korean Modern Tagged Corpus(12 million phrases scale) which was made by Korean government. So it analyses many cases of stems and endings. And the newly developed Dynamic Dictionary Technology can make words to react with their context. That is, a programmed database. For more information see the files in the help folder.

Downloads: 2 This Week

Last Update: 2020-10-11
See Project
MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
10

KSUCCA Corpus

A 50 million tokens corpus of Classical Arabic.

King Saud University Corpus of Classical Arabic (KSUCCA) is a pioneering 50 million tokens annotated corpus of Classical Arabic texts from the period of pre-Islamic era until the fourth Hijri century (equivalent to the period from the seventh until early eleventh century CE), which is the period of pure classical Arabic. The main aim of this corpus is to be used for studying the distributional lexical semantics of The Quran words.

Downloads: 4 This Week

Last Update: 2020-02-19
See Project
11

SimpleLemmatizer

This program is for text lemmatization

It lemmatizes texts based on supplied model. The base model is for slovak texts and is created from Slovak National Corpus, copyright by Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences

Downloads: 0 This Week

Last Update: 2020-03-22
See Project
12

Arabic Corpus

Text categorization, arabic language processing, language modeling

The Arabic Corpus {compiled by Dr. Mourad Abbas ( http://sites.google.com/site/mouradabbas9/corpora ) The corpus Khaleej-2004 contains 5690 documents. It is divided to 4 topics (categories). The corpus Watan-2004 contains 20291 documents organized in 6 topics (categories). Researchers who use these two corpora would mention the two main references: (1) For Watan-2004 corpus ---------------------- M.

Downloads: 4 This Week

Last Update: 2019-03-05
See Project
13

concordia

Powerful search library, best suited for computer-aided translation

...Concordance searcher - tool for translators who need their translations to "agree" with one standard. Concordia is a C++ library for fast text lookup in large corpora. It uses a RAM stored index, which takes up approximately 600MB of memory for a corpus of 2 million sentences. It is based on the idea of a suffix array, enhanced by the presence of other auxiliary data structures. The effects are stunning - Concordia is able to do simple substring lookup at the pace of 5000 queries per second (on personal PC) - a speed which can not be achieved by any other search library. Moreover, Concordia can perform its own "concordia search". ...

Downloads: 0 This Week

Last Update: 2019-02-28
See Project
14

KhmerText

Open data for a Khmer language corpus and lexicographic data that can be used for the development of free language tools for Khmer language, such as automatic translators, dictionaries, linguistic analysis tools, etc.

4 Reviews

Downloads: 163 This Week

Last Update: 2018-05-17
See Project
15

Bitextor

**CODE MOVED TO GITHUB: https://github.com/bitextor ** Bitextor is an application created to generate translation memories using multilingual websites as a corpus source. It downloads an entire website and applies a set of heuristics (based mainly on HTML tag structure and text block length) to find bitexts.

Downloads: 0 This Week

Last Update: 2018-04-17
See Project
16

Corpus Toolkit

A text management tool for linguistic purposes...

Downloads: 0 This Week

Last Update: 2017-11-23
See Project
17

Corpus Manager

Yet another corpus manager. Allows for HTTP access to annotated text corpora, client does not need to install any special software to access the server (any browser with JavaScript support will do).

Downloads: 0 This Week

Last Update: 2017-10-05
See Project
18

PADIC

A multilingual Parallel Arabic DIalectal Corpus

PADIC (Parallel Arabic DIalectal Corpus) is a multi-dialectal corpus built in the framework of the National Research Project "TORJMAN", led by Scientific and Technical Research Center for the Development of Arabic Language and funded by the Algerian Ministry of Higher Education and Scientific Research. PADIC is composed of 6 dialects: two Algerian dialects (Algiers and Annaba cities), Palestinian, Syrian, Tunisian, Moroccan) and MSA.

Downloads: 1 This Week

Last Update: 2017-05-26
See Project
19

ReLiS

Tool for conducting systematic literature reviews and studies

...When a researcher wants to address a research problem, he starts by looking at what already exists in the scientific literature (published papers) on the topic. ReLiS is a tool that helps him considerably reduce the effort to analyze the corpus of papers, typically varying between hunderds and thousands depending on the research topic. ReLiS allows the user to follow a systematic process and automate the classification of papers as much as possible during the literature study.

Downloads: 0 This Week

Last Update: 2016-08-15
See Project
20

texrex

Web corpus creation software (moved to GitHub)

This project has moved to GitHub: https://github.com/rsling/texrex https://github.com/rsling/cow

Downloads: 0 This Week

Last Update: 2016-04-20
See Project
21

ICE Nigeria

Nigerian component of the International Corpus of English

This is the Nigerian component of the International Corpus of English, a one million word corpus of written and spoken Nigerian English for linguistic research. It can be used as a stand-alone corpus or in conjunction with other components of the International Corpus of English (such as ICE-GB, ICE-India, etc.) to compare international varieties of English. This is the first release of the complete corpus.

1 Review

Downloads: 7 This Week

Last Update: 2015-11-03
See Project
22

Cross-Language Computational Linguistics

cross-languages resources

AFEWC corpus is a multilingual comparable text articles in Arabic, French, and English languages. Each triple article is related to the same topic (aligned at article level). AFEWC corpus is collected from Wikipedia. The corpus is available for free for research purposes only. It is composed of 40K aligned articles, 91.3M English words, 57.8M French words, 22M Arabic words, 2.8M English unique words, 1.9M French unique words, and 1.5M Arabic unique words. ...

Downloads: 0 This Week

Last Update: 2015-09-11
See Project
23

mwetoolkit

THIS PROJECT MIGRATED TO https://gitlab.com/mwetoolkit/mwetoolkit3/

...These include idioms (kick the bucket), noun compounds (cable car), phrasal verbs (take off, give up), etc. Even though it focuses on multiword expresisons, the framework is quite complete and can also be useful in any corpus-based study in computational linguistics. The mwetoolkit can be applied to virtually any text collection, language, and MWE type. It is a command-line tool written mostly in Python. Its development started in 2010 as a PhD thesis but the project keeps active (see the SVN logs). Up-to-date documentation and details about the tool can be found on the mwetoolkit website: http://mwetoolkit.sourceforge.net/

1 Review

Downloads: 0 This Week

Last Update: 2019-05-01
See Project
24

Drug Extraction

Drug name extraction

...Using CONLL-Evaluation: processed 32065 tokens with 3656 phrases; found: 3251 phrases; correct: 2786. accuracy: 95.25%; precision: 85.70%; recall: 76.20%; FB1: 80.67 Using GATE Corpus Benchmark: Strict: P: 0.65 R: 0.73 F1: 0.69 Lenient: P: 0.74 R: 0.84 F1: 0.78 The details of how to reproduce evaluation, see README. To use standalone version for tagging download DrugExtractionStandalone.tar.gz from Files.

Downloads: 0 This Week

Last Update: 2015-06-12
See Project
25

TF-IDF Measure

TF-IDF.jar is a Java Archive file to measure TF-IDF of each document in a document collection (corpus). The jar can be used to (a) get all the terms in the corpus (b) get the document frequency (DF) and inverse document frequency (IDF) of all the terms in the corpus (c) get the TF-IDF of each document in the corpus (d) get each term with their frequency (no. of presence), term frequency (TF) and TF-IDF in every document

Downloads: 0 This Week

Last Update: 2015-12-17
See Project

Previous
You're on page 1
2
3
Next

Search Results for "corpus"

Showing 58 open source projects for "corpus"

IMS Open Corpus Workbench

iramuteq

modnlp-plugins

Linguistic Analyzer

Web as Corpus

DWDS/Dialing Concordance

Application Generator for Stemmers

korpus

Korean Analyzer Rhino

KSUCCA Corpus

SimpleLemmatizer

Arabic Corpus

concordia

KhmerText

Bitextor

Corpus Toolkit

Corpus Manager

PADIC

ReLiS

texrex

ICE Nigeria

Cross-Language Computational Linguistics

mwetoolkit

Drug Extraction

TF-IDF Measure

Search Results for "corpus"

Showing 58 open source projects for "corpus"

IMS Open Corpus Workbench

iramuteq

modnlp-plugins

Linguistic Analyzer

Web as Corpus

DWDS/Dialing Concordance

Application Generator for Stemmers

korpus

Korean Analyzer Rhino

KSUCCA Corpus

SimpleLemmatizer

Arabic Corpus

concordia

KhmerText

Bitextor

Corpus Toolkit

Corpus Manager

PADIC

ReLiS

texrex

ICE Nigeria

Cross-Language Computational Linguistics

mwetoolkit

Drug Extraction

TF-IDF Measure

Related Searches

Related Categories