Showing 170 open source projects for "corpus"

View related business solutions
  • Our Free Plans just got better! | Auth0 by Okta Icon
    Our Free Plans just got better! | Auth0 by Okta

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your secuirty. Auth0 now, thank yourself later.
    Try free now
  • Bright Data - All in One Platform for Proxies and Web Scraping Icon
    Bright Data - All in One Platform for Proxies and Web Scraping

    Say goodbye to blocks, restrictions, and CAPTCHAs

    Bright Data offers the highest quality proxies with automated session management, IP rotation, and advanced web unlocking technology. Enjoy reliable, fast performance with easy integration, a user-friendly dashboard, and enterprise-grade scaling. Powered by ethically-sourced residential IPs for seamless web scraping.
    Get Started
  • 1
    use to collect speech corpus speech recognition task like sphinx .
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    VoiceScribe is a simple highlighting editor. Its purpose is to faciliate the task of creating and correcting transcripts for inclusion in the Vienna Oxford International Corpus of English (VOICE).
    Downloads: 1 This Week
    Last Update:
    See Project
  • 3
    An Arabic word Corpus, which contains a huge list of words, starting by 1.5 million words, usefull for naturel language processing.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    Supporting software for a school research paper to analyze a corpus for letter frequency and word properties.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Create and run cloud-based virtual machines. Icon
    Create and run cloud-based virtual machines.

    Secure and customizable compute service that lets you create and run virtual machines on Google’s infrastructure.

    Computing infrastructure in predefined or custom machine sizes to accelerate your cloud transformation. General purpose (E2, N1, N2, N2D) machines provide a good balance of price and performance. Compute optimized (C2) machines offer high-end vCPU performance for compute-intensive workloads. Memory optimized (M2) machines offer the highest memory and are great for in-memory databases. Accelerator optimized (A2) machines are based on the A100 GPU, for very demanding applications.
    Try for free
  • 5
    Clipsyll is a collection of scripts and programs for dowloading, codifying, analysing (using NLTK) CLIPS, the largest Italian corpus of spoken language. It includes a syllabification module based on the SSP: http://sourceforge.net/projects/silly
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    Unsupervised non-language specific morphological parser based on compression and precedence relations between morphemes. Can be run on a Unicode corpus and will output a lexicon of proposed morphemes in the language.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    Cunei is a data-driven machine translation system that builds dynamic, statistical models based on instances of known translations found in a corpus.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    GHIRL is the Graph-based Heterogeneous Information Representation Language: a java library for representing, querying, and navigating graph- or network-based data structures.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    This project is realized to pass our two years Degree in Computer Science of Orleans (France). The aim of this project is to save web pages to create a corpus of web document. The research is done with key words,language, website,search enginer...
    Downloads: 0 This Week
    Last Update:
    See Project
  • A new approach to fast data transfer | IBM Aspera Icon
    A new approach to fast data transfer | IBM Aspera

    For organizations interested in a file transfer and streaming solution

    IBM Aspera takes a different approach to tackling the challenges of big data movement over global WANs. Rather than optimize or accelerate data transfer, Aspera eliminates underlying bottlenecks by using a breakthrough transport technology that fully utilizes available network bandwidth to maximize speed and quickly scale up with no theoretical limit.
    Learn More
  • 10
    Sanchay
    Sanchay is a collection of tools and APIs for language researchers. It has some implementations of NLP algorithms, some flexible APIs, several user friendly annotation interfaces and Sanchay Query Language for language resources.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    Web-as-corpus tools in Java. * Simple Crawler (and also integration with Nutch and Heritrix) * HTML cleaner to remove boiler plate code * Language recognition * Corpus builder
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    A set of tools, ready to process the Europarl corpus as published by statmt.org (v3).
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    This proyect presents a system, which, from a corpus of documents, extracts information about a theme area, and a pedagogical components collection. This information is packed into fine granularity learning objects (metadata included).
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    Get1T is a tool for filtering through the massive quantity of data available in the Web 1T corpus and extracting only the counts you need - including for simple wildcard patterns.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    This project is supposed to list the Top R ranked terms that are of between M and N length. It is designed to extract these phrases from a given corpus in a input folder.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    Samudra Manthan uses C and MPI for finding interesting n-grams(terms) in a large corpus of data. We use the GigaWord corpus to find top m interesting n-grams using TF*IDF measure.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    cl-cc-bnc provides a frontend to learners of English language. You can enter an URI, which will be analyzed word-frequency-wise and compared to word frequencies in the British National Corpus.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    A collection of python scripts to create and handle an XML corpus (a large collection of text for linguistic purpose) from an original Wikipedia database backup dump. It includes a regular expression based parser for the MediaWiki markup language.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    TaCo is a tasty Palm application that enables you to use the Tanaka Corpus on your handheld. The Tanaka Corpus is a collection of Japanese/English sentence pairs that a student of Japanese language can use as a source of example sentences.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    Bi-gram applications based on language models produced by SRILM from Chinese Wikipedia corpus, include Chinese word segmenter, word-based (not character-based) Traditional-Simplified Chinese converter and Chinese syllable-to-word converter.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    BabyTALK is to add another brick in the wall of natural languages learning. The baby needs to structure a corpus of texts when his tutor points and talks about a particular part of the corpus. The baby is also to describe any selected part of the corpus.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    A fast way to rate the reading challenging level of book or text. Unlike well known reading metrics such as Fog, Kincaid, SMOG, ARI, Flesch, and Coleman-Liau readability this metric takes into account far more factors and is standarized against a corpus
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    CRFChunker: Conditional Random Fields Phrase Chunker (Phrase Chunking Tool) for English. The model was trained on sections 01..24 of WSJ corpus and using section 00 as the development test set (F1-score of 95.77). Chunking speed: 700 sentences/s
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    CRFTagger: Conditional Random Fields Part-of-Speech (POS) Tagger for English. The model was trained on sections 01..24 of WSJ corpus and using section 00 as the development test set (accuracy of 97.00%). Tagging speed: 500 sentences/s.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    Stem-Les (Lexicon Extraction Suite) extracts lexical chunks that are relevant in a corpus of documents. If the corpus is bilingual, Stem-Les also finds translation equivalents for the lexical solution selected by the user.
    Downloads: 0 This Week
    Last Update:
    See Project