corpora free download

Showing 77 open source projects for "corpora"

View related business solutions

Powerful small business accounting software
For small businesses looking for desktop accounting software

With AccountEdge, business owners can organize, process, and report on their financial information so they can focus on their business. Features include: accounting, integrated payroll, sales and purchases, contact management, inventory tracking, time billing, and more.

Learn More
Gain insights and build data-powered applications
Your unified business intelligence platform. Self-service. Governed. Embedded.

Chat with your business data with Looker. More than just a modern business intelligence platform, you can turn to Looker for self-service or governed BI, build your own custom applications with trusted metrics, or even bring Looker modeling to your existing BI environment.

Try it free
1

gensim

Topic Modelling for Humans

Gensim is a Python library for topic modeling, document indexing, and similarity retrieval with large corpora. The target audience is the natural language processing (NLP) and information retrieval (IR) community.

Downloads: 0 This Week

Last Update: 2024-07-31
See Project
2

IMS Open Corpus Workbench

Indexing and query tools for very large text corpora

The IMS Open Corpus Workbench is a collection of tools for managing and querying large text corpora (100 M words and more) with linguistic annotations. Its central component is the flexible and efficient query processor CQP, which can be used interactively in a terminal session, as a backend e.g. from a Perl script, or through the Web-based GUI CQPweb.

Downloads: 52 This Week

Last Update: 3 days ago
See Project
3

TXM

Unicode-XML-TEI text/corpus analysis platform

TXM is a free and open-source cross-platform Unicode & XML based text/corpus analysis environment and graphical client, supporting Windows, Linux and Mac OS X. It can also be used online as a J2EE standard compliant web portal (GWT based) with access control built in. DOWNLOAD LATEST VERSION OF TXM : http://textometrie.ens-lyon.fr/spip.php?rubrique61&lang=en TXM offers a comprehensive range of analysis tools (concordances, collocate search, frequency lists, etc.) based on the powerfull...

Downloads: 15 This Week

Last Update: 2023-10-02
See Project
4

Tokenized Text Aligner

Aligns tokens in two versions of a text with differing tokenization.

This tool performs token-by-token alignment of two versions of a text with differing tokenization by interpreting the results of a file diff (https://docs.python.org/3/library/difflib.html). It is intended for use in the preparation of annotated linguistic corpora, where differences in tokenization may arise (i) following corrections or modifications to the source text or (ii) through the creation of different layers of annotation (part-of-speech, treebank) requiring different tokenization...

Downloads: 0 This Week

Last Update: 2024-07-31
See Project
Business Continuity Solutions | ConnectWise BCDR
Build a foundation for data security and disaster recovery to fit your clients’ needs no matter the budget.

Whether natural disaster, cyberattack, or plain-old human error, data can disappear in the blink of an eye. ConnectWise BCDR (formerly Recover) delivers reliable and secure backup and disaster recovery backed by powerful automation and a 24/7 NOC to get your clients back to work in minutes, not days.

Learn More
5

JoBimText

Linking Language to Knowledge with Distributional Semantics

JobimText is a software solution for automatic text expansion using contextualized distributional similarity. It provides text analysis tools for large corpora and has capabilities to create distributional semantic models (JoBimText models) and multi-word expressions.

Downloads: 0 This Week

Last Update: 2022-08-04
See Project
6

Linguistic Analyzer

The Linguistic Analyzer is a tool for corpus analysis and comparison

The Linguistic Analyzer (Almuhalil Alloghawy) is a free tool designed by a team from Al-Imam Muhammad bin Saud islamic university that can be used for corpus analysis and comparison in terms of the several linguistic characteristics, such as frequency lists generation, concordances, collocation extraction, the difference between two words, and keyword identification.

Downloads: 1 This Week

Last Update: 2022-04-16
See Project
7

Queries-for-Arabic-OSAC-Corpus

43 queries of various topics for the Information Retrieval Collection . The corpus is created from the OSAC corpus of journalistic texts consisting of 4763 articles recovered from the Arabic BBC News. https://sourceforge.net/projects/ar-text-mining/files/Arabic-Corpora/

Downloads: 0 This Week

Last Update: 2021-12-03
See Project
8

PyTorch SimCLR

PyTorch implementation of SimCLR: A Simple Framework

For quite some time now, we know about the benefits of transfer learning in Computer Vision (CV) applications. Nowadays, pre-trained Deep Convolution Neural Networks (DCNNs) are the first go-to pre-solutions to learn a new task. These large models are trained on huge supervised corpora, like the ImageNet. And most important, their features are known to adapt well to new problems. This is particularly interesting when annotated training data is scarce. In situations like this, we take the models...

Downloads: 0 This Week

Last Update: 2022-08-15
See Project
9

NLP Best Practices

Natural Language Processing Best Practices & Examples

In recent years, natural language processing (NLP) has seen quick growth in quality and usability, and this has helped to drive business adoption of artificial intelligence (AI) solutions. In the last few years, researchers have been applying newer deep learning methods to NLP. Data scientists started moving from traditional methods to state-of-the-art (SOTA) deep neural network (DNN) algorithms which use language models pretrained on large text corpora. This repository contains examples...

Downloads: 0 This Week

Last Update: 2022-08-01
See Project
HRSoft Compensation - Human Resources Software
HRSoft is the only unified, purpose-built SaaS platform designed to transform your complex HR processes into seamless digital ones

Manage your enterprise’s compensation lifecycle and accurately recognize top performers with a digitized, integrated system. Keep employees invested and your HR team in control while preventing compensation chaos.

Learn More
10

POWLA

OWL/RDF representation for linguistic corpora

POWLA is a formalism that allows to represent linguistic corpora in RDF. POWLA is an OWL/DL formalization of an abstract data model, PAULA (http://www.sfb632.uni-potsdam.de/d1/paula/doc), that has been developed to represent (a) any type of linguistic annotation applicable to textual data, and (b) any combination of annotation layers. For a detailed motivation of POWLA and its application to facilitate interoperability of annotated corpora, see Christian Chiarcos (to appear 2012...

Downloads: 0 This Week

Last Update: 2020-06-08
See Project
11

Arabic Word diversity

Word frequency and diversity (distribution) across hundreds of corpora. You'll see both the lemma and the various forms.

Downloads: 0 This Week

Last Update: 2020-05-15
See Project
12

Arabic Rare Words Project

Text Analysis Egyptian Schoolbooks

The purpose is to compare the most common words in the language with the words used in textbooks for students in Egyptian schools. The frequency can help scholars and teachers better teach reading.

Downloads: 0 This Week

Last Update: 2021-04-14
See Project
13

OLiA

OWL/DL ontologies for linguistic annotations

.../) and ISOcat (http://www.isocat.org) The OLiA ontologies were originally developed as part of an infrastructure for the sustainable maintenance of linguistic resources (http://www.sfb441.uni-tuebingen.de/c2/index-engl.html), their fields of application include the formalization of annotation schemes, concept-based querying over heterogeneously annotated corpora, and the development of interoperable NLP pipelines.

Downloads: 1 This Week

Last Update: 2019-11-11
See Project
14

@Note2

@Note2 - A workbench for Biomedical Text Mining

Biomedical Text Mining (BioTM) is providing valuable approaches to the automated curation of scientific literature.

1 Review

Downloads: 0 This Week

Last Update: 2019-05-13
See Project
15

Arabic Corpus

Text categorization, arabic language processing, language modeling

The Arabic Corpus {compiled by Dr. Mourad Abbas ( http://sites.google.com/site/mouradabbas9/corpora ) The corpus Khaleej-2004 contains 5690 documents. It is divided to 4 topics (categories). The corpus Watan-2004 contains 20291 documents organized in 6 topics (categories). Researchers who use these two corpora would mention the two main references: (1) For Watan-2004 corpus ---------------------- M. Abbas, K. Smaili, D. Berkani, (2011) Evaluation of Topic Identification Methods...

Downloads: 8 This Week

Last Update: 2019-03-05
See Project
16

concordia

Powerful search library, best suited for computer-aided translation

Concordia - Roman goddess of agreement. Concordance searcher - tool for translators who need their translations to "agree" with one standard. Concordia is a C++ library for fast text lookup in large corpora. It uses a RAM stored index, which takes up approximately 600MB of memory for a corpus of 2 million sentences. It is based on the idea of a suffix array, enhanced by the presence of other auxiliary data structures. The effects are stunning - Concordia is able to do simple substring...

Downloads: 0 This Week

Last Update: 2019-02-28
See Project
17

Queries for OSAC (Arabic) Corpus

43 Queries for Arabic Information Retrieval Collection

43 queries of various topics for the Information Retrieval Collection . The corpus is created from the OSAC corpus of journalistic texts consisting of 4763 articles recovered from the Arabic BBC News. https://sourceforge.net/projects/ar-text-mining/files/Arabic-Corpora/

Downloads: 0 This Week

Last Update: 2019-01-07
See Project
18

Ghawwas_V4

An open source system for Arabic corpora processing

Ghawwas (previously known as Khawas) is an open source system for Arabic corpora processing. Ghawwas V4.0 provides the following main functions: a. Frequency list for single word and N-Grams b. Concordance c. Collocation (MI, CHI Squared, LL, T-Score, Z Score, Dice, Log Dice, Weirdness Coefficient) d. Lexical patterns search e. Two corpora frequency profile comparison based on MI, CHI, LL, T-Score, Z Score, Dice, Log Dice, Weirdness Coefficient f. Accept Windows and UTF-8 character...

1 Review

Downloads: 2 This Week

Last Update: 2018-12-09
See Project
19

HipparchiaServer

front end to Hipparchia corpora: searching, browsing, concordances, texts, dictionaries, parsing

Downloads: 0 This Week

Last Update: 2018-06-15
See Project
20

rcqp

R interface to the Corpus Query Protocol

Implements the Corpus Query Protocol as a package for the R statistical environment. It allows to query linguistic corpora and manipulate the data as native R objects. It is based on the CWB software.

Downloads: 0 This Week

Last Update: 2018-03-13
See Project
21

BioNLP-Corpora

BioNLP-Corpora is a repository of biomedically and linguistically annotated corpora and biomedical data sources. There are many resources available in separate packages in this project.

Downloads: 1 This Week

Last Update: 2016-11-22
See Project
22

Arabic Stemming Corpora

The Corpora contains 81,000 tagged words of Arabic resources (Contemporary Arabic (CCA) [1] and Arabic Wikipedia [2]) text with the basic tags (verb, noun, adjective). [1] http://www.comp.leeds.ac.uk/eric/latifa/research.htm. [2] http://ar.wikipedia.org.

Downloads: 0 This Week

Last Update: 2016-12-04
See Project
23

Scattertext 0.2.1

Beautiful visualizations of how language differs among document types

A tool for finding distinguishing terms in corpora and displaying them in an interactive HTML scatter plot. Points corresponding to terms are selectively labeled so that they don't overlap with other labels or points.

Downloads: 0 This Week

Last Update: 2024-08-09
See Project
24

Corpus Manager

Yet another corpus manager. Allows for HTTP access to annotated text corpora, client does not need to install any special software to access the server (any browser with JavaScript support will do).

Downloads: 0 This Week

Last Update: 2017-10-05
See Project
25

Arabic business corpora

Arabic business and management corpus

This corpora is made up of 3 sub corpora as follows: 1) Management Corpus: 400 articles by Chairmans and CEOs of Arabic companies in the Middle East. 2) Economics News: 400 news articles from different Arabic online newspapers. 3) Stock market news, 400 articles collected from investing.com. The main corpora contains 1200 articles. The articles have been tagged using Stanford Arabic Part of Speech Tagger. Both plain text and tagged corpora are available to download, check the Files...

Downloads: 6 This Week

Last Update: 2016-11-01
See Project