Showing 170 open source projects for "corpus"

View related business solutions
  • Our Free Plans just got better! | Auth0 by Okta Icon
    Our Free Plans just got better! | Auth0 by Okta

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your secuirty. Auth0 now, thank yourself later.
    Try free now
  • Bright Data - All in One Platform for Proxies and Web Scraping Icon
    Bright Data - All in One Platform for Proxies and Web Scraping

    Say goodbye to blocks, restrictions, and CAPTCHAs

    Bright Data offers the highest quality proxies with automated session management, IP rotation, and advanced web unlocking technology. Enjoy reliable, fast performance with easy integration, a user-friendly dashboard, and enterprise-grade scaling. Powered by ethically-sourced residential IPs for seamless web scraping.
    Get Started
  • 1
    In this corpus: 10 essays containing 752 sentences (with a total of 4,160 words). The essays were selected from different collections of partially or totally diacritic Arabic texts, all of which are available in the Tashkeela corpus. Texts in this corpus have been used in the evaluation of AGD checker. There are two types of texts in this corpus: 1- Texts without errors to evaluate AGD in terms of detecting and correcting errors that we do not know about before the checking process 2...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    GPT2 for Multiple Languages

    GPT2 for Multiple Languages

    GPT2 for Multiple Languages, including pretrained models

    With just 2 clicks (not including Colab auth process), the 1.5B pretrained Chinese model demo is ready to go. The contents in this repository are for academic research purpose, and we do not provide any conclusive remarks. Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC) Simplifed GPT2 train scripts(based on Grover, supporting TPUs). Ported bert tokenizer, multilingual corpus compatible. 1.5B GPT2 pretrained Chinese model (~15G corpus, 10w steps). Batteries...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    iramuteq
    IRAMUTEQ : Interface de R pour les Analyses Multidimensionnelles de Textes et de Questionnaires. Logiciel de traitement de données pour des corpus texte ou de type individus/caractères. Permet notamment de réaliser des analyses de type "ALCESTE"
    Leader badge
    Downloads: 710 This Week
    Last Update:
    See Project
  • 4
    Korean Analyzer Rhino

    Korean Analyzer Rhino

    Parsing Korean words by morpheme and part-of-speech

    RHINO parses Korean words by morpheme and part-of-speech. Its dictionaries are based on Korean Modern Tagged Corpus(12 million phrases scale) which was made by Korean government. So it analyses many cases of stems and endings. And the newly developed Dynamic Dictionary Technology can make words to react with their context. That is, a programmed database. For more information see the files in the help folder.
    Leader badge
    Downloads: 6 This Week
    Last Update:
    See Project
  • The CRM you’ll want to use every day Icon
    The CRM you’ll want to use every day

    With CRM, Sales, and Marketing Automation in one, Act! gives you everything you need for happier clients, more revenue, and less stress.

    Act! Premium is perfect for small and midsize businesses looking to market better, sell more, and create customers for life. With unparalleled flexibility and freedom of choice, Act! Premium accommodates the unique ways you do business. Whether it’s customizations to fit your specific business or industry processes or your preferences for deployment and access, the possibilities with Act! Premium are limitless.
    Learn More
  • 5
    jieba

    jieba

    Stuttering Chinese word segmentation

    "Jaba" Chinese word segmentation, do the best Python Chinese word segmentation component. Four word segmentation modes are supported. Precise mode, which tries to cut the sentence most precisely, suitable for text analysis. Full mode, scans all the words that can be formed into words in the sentence, the speed is very fast, but the ambiguity cannot be resolved. The search engine mode, on the basis of the precise mode, divides the long words again to improve the recall rate, which is suitable...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    Dragonfire

    Dragonfire

    The open-source virtual assistant for Ubuntu based Linux distributions

    Dragonfire is the open-source virtual assistant project for Ubuntu-based Linux distributions. Her main objective is to serve as a command and control interface to the helmet user. So that you will be able to give orders just by using your voice commands and your eye movements. That makes the helmet handsfree. We are planning to ship Dragonfire as a preinstalled software package on DragonOS Linux Distribution. DragonOS will be a Linux distribution specially designed for the helmet. It will...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    PyTorch Natural Language Processing

    PyTorch Natural Language Processing

    Basic Utilities for PyTorch Natural Language Processing (NLP)

    ... this example code for training on the Stanford Natural Language Inference (SNLI) Corpus. Now you've setup your pipeline, you may want to ensure that some functions run deterministically. Wrap any code that's random, with fork_rng and you'll be good to go. Now that you've computed your vocabulary, you may want to make use of pre-trained word vectors to set your embeddings.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8

    SimpleLemmatizer

    This program is for text lemmatization

    It lemmatizes texts based on supplied model. The base model is for slovak texts and is created from Slovak National Corpus, copyright by Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9

    Arabic Corpus

    Text categorization, arabic language processing, language modeling

    The Arabic Corpus {compiled by Dr. Mourad Abbas ( http://sites.google.com/site/mouradabbas9/corpora ) The corpus Khaleej-2004 contains 5690 documents. It is divided to 4 topics (categories). The corpus Watan-2004 contains 20291 documents organized in 6 topics (categories). Researchers who use these two corpora would mention the two main references: (1) For Watan-2004 corpus ---------------------- M. Abbas, K. Smaili, D. Berkani, (2011) Evaluation of Topic Identification Methods...
    Leader badge
    Downloads: 12 This Week
    Last Update:
    See Project
  • Find out just how much your login box can do for your customer | Auth0 Icon
    Find out just how much your login box can do for your customer | Auth0

    With over 53 social login options, you can fast-track the signup and login experience for users.

    From improving customer experience through seamless sign-on to making MFA as easy as a click of a button – your login box must find the right balance between user convenience, privacy and security.
    Sign up
  • 10
    CakeChat

    CakeChat

    CakeChat: Emotional Generative Dialog System

    ... bidirectional. By default, CuDNNGRU implementation is used for ~25% acceleration during inference. Thought vector is fed into decoder on each decoding step. Decoder can be conditioned on any categorical label, for example, emotion label or persona id. May be initialized using w2v model trained on your corpus. Embedding layer may be either fixed or fine-tuned along with other weights of the network.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    ace2005-preprocessing

    ace2005-preprocessing

    ACE 2005 corpus preprocessing for Event Extraction task

    This is a simple code for preprocessing ACE 2005 corpus for Event Extraction task. Using the existing methods were complicated for me, so I made this project. Github: https://github.com/nlpcl-lab/ace2005-preprocessing
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    NeuroNER

    NeuroNER

    Named-entity recognition using neural networks

    ...-platform, open source, freely available, and straightforward to use. Enables the users to create or modify annotations for a new or existing corpus. Train the neural network that performs the NER. During the training, NeuroNER allows monitoring of the network. Evaluate the quality of the predictions made by NeuroNER. The performance metrics can be calculated and plotted by comparing the predicted labels with the gold labels.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13

    Queries for OSAC (Arabic) Corpus

    43 Queries for Arabic Information Retrieval Collection

    43 queries of various topics for the Information Retrieval Collection . The corpus is created from the OSAC corpus of journalistic texts consisting of 4763 articles recovered from the Arabic BBC News. https://sourceforge.net/projects/ar-text-mining/files/Arabic-Corpora/
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    concordia

    concordia

    Powerful search library, best suited for computer-aided translation

    Concordia - Roman goddess of agreement. Concordance searcher - tool for translators who need their translations to "agree" with one standard. Concordia is a C++ library for fast text lookup in large corpora. It uses a RAM stored index, which takes up approximately 600MB of memory for a corpus of 2 million sentences. It is based on the idea of a suffix array, enhanced by the presence of other auxiliary data structures. The effects are stunning - Concordia is able to do simple substring...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    QJDicExample

    QJDicExample

    QJDicExample is an English <-> Japanese dictionary.

    QJDicExample is an Japanese to English and English to Japanese dictionary featuring words/names/kanji/sentences search. QJDicExample uses JMdict, JMnedict, Kanjidic2, Radkfilex, KanjiVG, Tanaka Corpus / Tatoeba databases for translations and zinnia recognition library for handwritten kanji recognition. Latest source code: git clone git://git.code.sf.net/p/qjdicexample/code qjdicexample-code
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    Tashkeela: Arabic diacritization corpus

    Tashkeela: Arabic diacritization corpus

    Tashkeela: Arabic discritization Corpus (Vocalized texts)

    Tashkeela: Arabic discritization Corpus, Resource, Arabic vocalized texts: نصوص عربية مشكولة =========== Contains Arabic text vocalized . Text -format; 75.6 millions words Please cite this resource as: T. Zerrouki, A. Balla, Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems, Data in Brief (2017), http://dx.doi.org/10.1016/j.dib.2017.01.011 Data in Brief ∎ ( ∎∎∎∎ ) ∎∎∎ – ∎∎∎
    Leader badge
    Downloads: 4 This Week
    Last Update:
    See Project
  • 17
    Corpus Toolkit

    Corpus Toolkit

    A text management tool for linguistic purposes...

    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    kcws

    kcws

    Deep Learning Chinese Word Segment

    Deep learning chinese word segment. Install the bazel code construction tool and install tensorflow (currently this project requires tf 1.0.0alpha version or above) Switch to the code directory of this project and run ./configure. Compile background service. Pay attention to the public account of waiting for words and reply to kcws to get the corpus download address. Extract the corpus to a directory. Change to the code directory.After installing tensorflow, switch to the kcws code directory...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    Yet another corpus manager. Allows for HTTP access to annotated text corpora, client does not need to install any special software to access the server (any browser with JavaScript support will do).
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    Open data for a Khmer language corpus and lexicographic data that can be used for the development of free language tools for Khmer language, such as automatic translators, dictionaries, linguistic analysis tools, etc.
    Leader badge
    Downloads: 64 This Week
    Last Update:
    See Project
  • 21
    WikiSQL

    WikiSQL

    A large annotated semantic parsing corpus for developing NL interfaces

    A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is the dataset released along with our work Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. Regarding tokenization and Stanza, when WikiSQL was written 3-years ago, it relied on Stanza, a CoreNLP python wrapper that has since been deprecated. If you'd still like to use the tokenizer, please use the docker image. We do not anticipate switching...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 22

    Corpus DOGC

    Corpus del Diari Oficial de la Generalitat de Catalunya

    Plana de descàrrega del corpus dels Diari Oficial de la Generalitat de Catalunya.
    Downloads: 8 This Week
    Last Update:
    See Project
  • 23
    **CODE MOVED TO GITHUB: https://github.com/bitextor ** Bitextor is an application created to generate translation memories using multilingual websites as a corpus source. It downloads an entire website and applies a set of heuristics (based mainly on HTML tag structure and text block length) to find bitexts.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24

    rcqp

    R interface to the Corpus Query Protocol

    Implements the Corpus Query Protocol as a package for the R statistical environment. It allows to query linguistic corpora and manipulate the data as native R objects. It is based on the CWB software.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    Hypermachiavel is a software developped in Java, addressed to end-users such as linguists and humanities researchers, offering various manipulations on an aligned corpus of texts.
    Downloads: 0 This Week
    Last Update:
    See Project