Showing 171 open source projects for "corpus"

View related business solutions
  • Top-Rated Free CRM Software Icon
    Top-Rated Free CRM Software

    216,000+ customers in over 135 countries grow their businesses with HubSpot

    HubSpot is an AI-powered customer platform with all the software, integrations, and resources you need to connect your marketing, sales, and customer service. HubSpot's connected platform enables you to grow your business faster by focusing on what matters most: your customers.
  • Top-Rated Free CRM Software Icon
    Top-Rated Free CRM Software

    216,000+ customers in over 135 countries grow their businesses with HubSpot

    HubSpot is an AI-powered customer platform with all the software, integrations, and resources you need to connect your marketing, sales, and customer service. HubSpot's connected platform enables you to grow your business faster by focusing on what matters most: your customers.
  • 1
    The English-Vietnamese Bilingual Corpus (EVBCorpus) is a collection of English and Vietnamese parallel translations and bitexts.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2

    PADIC

    A multilingual Parallel Arabic DIalectal Corpus

    PADIC (Parallel Arabic DIalectal Corpus) is a multi-dialectal corpus built in the framework of the National Research Project "TORJMAN", led by Scientific and Technical Research Center for the Development of Arabic Language and funded by the Algerian Ministry of Higher Education and Scientific Research. PADIC is composed of 6 dialects: two Algerian dialects (Algiers and Annaba cities), Palestinian, Syrian, Tunisian, Moroccan) and MSA. Mourad Abbas Computational Linguistics Department, crstdla...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 3

    Reviz-it

    Software tools to re-tell stories in a better way and expand them

    ... ones are inspiring. - Use the inspiring word clouds to rephrase the story in an original way, then expand it. Enrich with various text mining algorithms to retrieve automatically the different ways the same thing is said in a given context (series of publications on same topic or from same organization for example): latent semantic analysis, topic modeling, rule-based text mining, etc. This allows rewriting a text with the specific 'style' of a corpus.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4

    Indonesian Learner Corpus

    Indonesian learner corpus contains around 5K sentence pairs written by

    Indonesian learner corpus contains around 5K sentence pairs written by second language learners who learning Indonesian language. Each pair consists of learner sentence and native-corrected sentence which is automatically annotated with the error type and error position. However, the native-corrected sentence has filtered out from spelling error and manually checked by a native speaker.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Cyber Risk Assessment and Management Platform Icon
    Cyber Risk Assessment and Management Platform

    ConnectWise Identify is a powerful cybersecurity risk assessment platform offering strategic cybersecurity assessments and recommendations.

    When it comes to cybersecurity, what your clients don’t know can really hurt them. And believe it or not, keep them safe starts with asking questions. With ConnectWise Identify Assessment, get access to risk assessment backed by the NIST Cybersecurity Framework to uncover risks across your client’s entire business, not just their networks. With a clearly defined, easy-to-read risk report in hand, you can start having meaningful security conversations that can get you on the path of keeping your clients protected from every angle. Choose from two assessment levels to cover every client’s need, from the Essentials to cover the basics to our Comprehensive Assessment to dive deeper to uncover additional risks. Our intuitive heat map shows you your client’s overall risk level and priority to address risks based on probability and financial impact. Each report includes remediation recommendations to help you create a revenue-generating action plan.
  • 5

    Classical Arabic Corpus

    A corpus contains more than 1 M distinct Arabic words.

    This project has been developed as part of a master thesis named "Edit Distance Adapted to Natural Language Words". The available project consists three parts. First, the corpus gathers more than one million distinct Arab words. Second, the text files of Arabic resources. Third, the index file presents some information about these resources. Additional details about these parts are available in README file.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6

    Arabic business corpora

    Arabic business and management corpus

    This corpora is made up of 3 sub corpora as follows: 1) Management Corpus: 400 articles by Chairmans and CEOs of Arabic companies in the Middle East. 2) Economics News: 400 news articles from different Arabic online newspapers. 3) Stock market news, 400 articles collected from investing.com. The main corpora contains 1200 articles. The articles have been tagged using Stanford Arabic Part of Speech Tagger. Both plain text and tagged corpora are available to download, check the Files section...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 7

    ReLiS

    Tool for conducting systematic literature reviews and studies

    ReLiS stands for "Revue Littéraire Systématique" which is French for "Systematic Literature Review". When a researcher wants to address a research problem, he starts by looking at what already exists in the scientific literature (published papers) on the topic. ReLiS is a tool that helps him considerably reduce the effort to analyze the corpus of papers, typically varying between hunderds and thousands depending on the research topic. ReLiS allows the user to follow a systematic process...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    GloVe

    GloVe

    GloVe model for distributed word representation

    GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. The links provided contain word vectors obtained from the respective corpora. If you want word vectors trained on massive web datasets, you need only download one of these text files! Pre-trained word vectors...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9

    texrex

    Web corpus creation software (moved to GitHub)

    This project has moved to GitHub: https://github.com/rsling/texrex https://github.com/rsling/cow
    Downloads: 0 This Week
    Last Update:
    See Project
  • Speech-to-Text: Automatic Speech Recognition Icon
    Speech-to-Text: Automatic Speech Recognition

    Accurately convert voice to text in over 125 languages and variants by applying Google's powerful machine learning models with an easy-to-use API.

    New customers get $300 in free credits to spend on Speech-to-Text. All customers get 60 minutes for transcribing and analyzing audio free per month, not charged against your credits.
  • 10
    The Alpheios project is developing tools to facilitate self-directed, corpus-based language learning.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11

    Projet sumtec

    Nettoyage et préparation de corpus de transcriptions d'entretiens

    Scripts réalisés dans le cadre du projet SUMTEC pour la préparation des corpus de transcription en vue d'une exploitation sur RQDA et IRAMUTEQ. http://www.msh-lorraine.fr/index.php?id=623 Le projet contient 3 programmes PERL. L'objectif consiste à récupérer des transcriptions d'entretien non structurées afin de les structurer sous la forme d'un arbre xml. L'intérêt consiste à pouvoir, in fine, identifier les tours de parole et séparer les discours des interviewés et des intervieweurs.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    ICE Nigeria

    ICE Nigeria

    Nigerian component of the International Corpus of English

    This is the Nigerian component of the International Corpus of English, a one million word corpus of written and spoken Nigerian English for linguistic research. It can be used as a stand-alone corpus or in conjunction with other components of the International Corpus of English (such as ICE-GB, ICE-India, etc.) to compare international varieties of English. This is the first release of the complete corpus. The corpus can be downloaded in several parts. The written part can be downloaded as text...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 13
    Osman Arabic Text Readability

    Osman Arabic Text Readability

    Open Source tool for Arabic text readability

    We present OSMAN (Open Source Metric for Measuring Arabic Narratives) - a novel open source Arabic readability metric and tool. The open source Java tool allows users to calculate readability for Arabic text (with and without diacritics). The tool provides methods to split the text into words and sentence, count syllables, Faseeh letters, hard and complex words in addition to adding diacritics (vocalise text). This makes the tool useful for researchers and educators working with Arabic text....
    Downloads: 1 This Week
    Last Update:
    See Project
  • 14
    Epwing2Anki

    Epwing2Anki

    Used to automate creation of Japanese Anki vocabulary cards.

    Epwing2Anki may be used to automatically or semi-automatically create Japanese Anki vocabulary cards based on a provided list of words and one or more of your favorite EPWING dictionaries and/or the included EDICT J-E dictionary and Tatoeba example sentence corpus.
    Downloads: 11 This Week
    Last Update:
    See Project
  • 15
    AFEWC corpus is a multilingual comparable text articles in Arabic, French, and English languages. Each triple article is related to the same topic (aligned at article level). AFEWC corpus is collected from Wikipedia. The corpus is available for free for research purposes only. It is composed of 40K aligned articles, 91.3M English words, 57.8M French words, 22M Arabic words, 2.8M English unique words, 1.9M French unique words, and 1.5M Arabic unique words. Wikipedia text is available...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16

    mwetoolkit

    THIS PROJECT MIGRATED TO https://gitlab.com/mwetoolkit/mwetoolkit3/

    THIS PROJECT MIGRATED TO https://gitlab.com/mwetoolkit/mwetoolkit3/ The Multiword Expressions toolkit aids in the automatic identification and extraction of multiword units in running text. These include idioms (kick the bucket), noun compounds (cable car), phrasal verbs (take off, give up), etc. Even though it focuses on multiword expresisons, the framework is quite complete and can also be useful in any corpus-based study in computational linguistics. The mwetoolkit can be applied...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17

    Drug Extraction

    Drug name extraction

    ... indicates the presence of the drug name in the DrugBank. Using CONLL-Evaluation: processed 32065 tokens with 3656 phrases; found: 3251 phrases; correct: 2786. accuracy: 95.25%; precision: 85.70%; recall: 76.20%; FB1: 80.67 Using GATE Corpus Benchmark: Strict: P: 0.65 R: 0.73 F1: 0.69 Lenient: P: 0.74 R: 0.84 F1: 0.78 The details of how to reproduce evaluation, see README. To use standalone version for tagging download DrugExtractionStandalone.tar.gz from Files.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18

    CSLU_KALDI

    speach recognision using kaldi

    adjusting KALDI speech recognition to new corpus.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19

    optimize_topics.sh

    Run multiple MALLET runs and report on search term prevalence.

    Run multiple MALLET runs over a pre-existing corpus and report on search term prevalence in each run.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21

    Natural Language Analysis with Ngrams

    NLP tool for statistical analysis of words, sentences, documents

    ... will JAR-it once I decide that it can be called a final release. This project was made by creating a corpus from the Google Ngrams data for English Language, version 20120701. EOWL list of English words was used to filter-out the words from Ngrams data. For each year, per word, the data was added and calculated to describe the average appearance of a word per document for a given year. Before using this program, you MUST download the corpus.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22

    KNIC Concordances

    Syntactic concordances from TIGERSearch query results

    KNIC concordances permit users of the treebank search software TIGERSearch (http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/tigersearch.en.html) to create a concordance-style table of their query results from an exported TIGER-XML result file. This software was originally designed for the Syntactic Reference Corpus of Medieval French (http://www.srcmf.org). It was developed in collaboration with the developers of the open-source TXM platform (http://textometrie.ens-lyon.fr...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23

    Persica-A new Persian corpus for NLP

    This project presents a new corpus for NEWS text analysis in Persian

    Lack of multi-application text corpus despite of the surging text data is a serious bottleneck in the text mining and natural language processing especially in Persian language. This project presents a new corpus for NEWS articles analysis in Persian called Persica. NEWS analysis includes NEWS classification, topic discovery and classification, category classification and many more procedures. Dealing with NEWS has special requirements and first of all a valid and reliable corpus to perform...
    Downloads: 14 This Week
    Last Update:
    See Project
  • 24
    DisMo

    DisMo

    A POS, disfluency and multi-word unit annotator for spoken language

    ... are using DisMo to annotate your corpus, please cite the following paper: Christodoulides, George; Avanzi, Mathieu; Goldman, Jean-Philippe. DisMo: A Morphosyntactic, Disfluency and Multi-Word Unit Annotator. An Evaluation on a Corpus of French Spontaneous and Read Speech. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC) 2014, Reykjavik, Iceland, 26-31 May 2014, pp. 3902-3907.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    TextTools
    TextTools is a freeware corpus linguistics tool developed in Python to aid in research. This program analyzes user-created corpora and displays information about word (token) frequency, n-grams, clusters, collocations, keyword in context (KWIC), and keyness. TextTools is designed to be user-friendly and intuitive and will run natively on Mac OS X.
    Downloads: 0 This Week
    Last Update:
    See Project