Showing 105 open source projects for "corpora"

View related business solutions
  • Ship Agents Faster Icon
    Ship Agents Faster

    Transform your applications and workflows into powerful agentic systems at global scale.

    Gemini Enterprise Agent Platform lets you rapidly build, scale, govern and optimize production-ready agents grounded in your organization's data. The platform enables developers to build custom or pre-built agents for virtually any use case. New customers get $300 in free credits.
    Get Started Free
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • 1

    Khawas

    An Arabic Corpora Processing Tool

    The new version is available at https://sourceforge.net/projects/ghawwasv4/
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2

    Fine-grained Arabic Named Entity Corpora

    Fine-grained Arabic Named Entity Corpora

    ...Those corpora have been manually annotated from the Arabic Wikipedia and Newswire sources respectively. B) Automatically-developed: 1) WikiFANE_Whole: All sentences of the Arabic Wikipedia articles were retrieved to compile to corpus. ~2M tokens. 2) WikiFANE_Selective: Sentences which have at least one NE phrase were retrieved to compile the corpus. ~2M tokens.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3

    WN-Toolkit

    Creation of WordNets using the expand model

    This toolkit is a set of Python programs for the creation or enlargement of WordNets using the expand model. Several methodologies are available: dictionary-based, Babelnet based as well as methodologies based on the use of parallel corpora.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4

    BioParallelCorporaExtractor

    BioPCE: a tool to extract parallel corpora of biomedical texts

    BioParallelCorporaExtractor (BioPCE) is Python tool to extract parallel corpora of biomedical texts. It's a joint work between Elise Prieur-Gaston, Antonio Jimeno Yepes and Aurélie Névéol. In the "Files" tab in this page, you can find the perl script used to web-crawl publisher data and a sample input file created for 5 MEDLINE citations. Each line in the input file should contain the PubMed identifier (PMID) and its Digital Object Idetifier (DOI) separated by the pipe symbol. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Auth0 B2B Essentials: SSO, MFA, and RBAC Built In Icon
    Auth0 B2B Essentials: SSO, MFA, and RBAC Built In

    Unlimited organizations, 3 enterprise SSO connections, role-based access control, and pro MFA included. Dev and prod tenants out of the box.

    Auth0's B2B Essentials plan gives you everything you need to ship secure multi-tenant apps. Unlimited orgs, enterprise SSO, RBAC, audit log streaming, and higher auth and API limits included. Add on M2M tokens, enterprise MFA, or additional SSO connections as you scale.
    Sign Up Free
  • 5

    Autshumato Text Anonymiser

    Text anonymiser for the Autshumato project.

    A tool for the anonymisation of text corpora which entails the identification of entities that may convey confidential information and replacing those entities with with randomly selected entities of the same type.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6

    RedLDA

    Redundancy Aware LDA Gibbs Sampler

    Redundancy-Aware Topic Modeling Copy Paste Redundancy or Data Duplication are prevalent in many corpora.This redundancy has a negative impact on the quality of text mining and topic modeling in particular. This is a software package of a novel variant of Latent Dirichlet Allocation (LDA) topic modeling, Red-LDA, which takes into account the inherent redundancy of corpora when modeling content. My site: http://www.cs.bgu.ac.il/~cohenrap/ Lab site: http://www.cs.bgu.ac.il/~nlpproj/ Sister project: http://sourceforge.net/projects/corpusredundanc/
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    TextBlob

    TextBlob

    TextBlob is a Python library for processing textual data

    ...Also, it comes with a WordNet integration. If you only intend to use TextBlob’s default models (no model overrides), you can pass the lite argument. This downloads only those corpora needed for basic functionality. TextBlob is also available as a conda package.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    Knowtator is a general-purpose text annotation tool that is integrated with the Protégé knowledge representation system. Knowtator facilitates the manual creation of training and evaluation corpora for a variety of biomedical language processing tasks.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    Donatus is an on-going project consisting of Python, NLTK-based tools and grammars for deep parsing and syntactical annotation of Brazilian Portuguese corpora. It includes a user-friendly graphical user interface for building syntactic parsers with the NLTK, providing some additional functionalities.
    Downloads: 0 This Week
    Last Update:
    See Project
  • $300 Free Credits for Your Google Cloud Projects Icon
    $300 Free Credits for Your Google Cloud Projects

    Start building on Google Cloud with $300 in free credits. No commitment, no credit card required until you're ready to scale.

    Launch your next project with $300 in free Google Cloud credits—no strings attached. Test, build, and deploy without risk. Use your credits across the entire Google Cloud platform to find what works best for your needs. After your credits are used, continue with always-free tier services. Only pay when you're ready to scale. Sign up in minutes and start exploring.
    Start Free Trial
  • 10

    Hermes Natural Language Processing

    A repository of software, documentation and data for NLP

    Hermes is a repository of software, documentation and data for NLP. I am currently adding corpora extracted from Wikipedia (mostrly in Romance languages).
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    Various tools for creating annotated parallel corpora including pre-trained tagging and parsing models for various languages, sentence alignment tools and word alignment tools. Uplug also includes a web-based interface for interactive sentence and word alignment and scripts for indexing and querying parallel corpora using the Corpus Work Bench CWB. Download 'uplug-main' first and then add other packages.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    Arabic Computational Linguistics resources and Tools, Arabic Text Mining Tools, Arabic Language tools, Arabic Morphological Analysis (Stemming / Light Stemming), Arabic text preprocessing, Arabic Corpora, Open Source Arabic Corpora OSAC, Comparable Corpora. For more information: http://sites.google.com/site/motazsite
    Leader badge
    Downloads: 9 This Week
    Last Update:
    See Project
  • 13

    Corpora of Misspellings

    Corpora with misspellings marked

    This is a project for creating corpora with misspellings marked and the correct word given. Example use could be for testing spell checkers.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    A universal suite of utilities for large corpora processing.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    Fast Unsupervised Sentence Aligner described in "Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora", COLING 2010. NEWS : release 1.0b : bug fixed (release1.0a deprecated).
    Downloads: 1 This Week
    Last Update:
    See Project
  • 16
    A tool for large richly annotated parallel corpora preprocessing and Moses phrase-table extraction.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    CORPSE (CORPus SEarch) is a powerful search engine written in Java. The aim is to provide an efficient implementation of a word level inverted index search with various cool functions that can be used on very large corpora.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    A database of linguistic annotation of medical text (from MEDLINE), including corpora used with ABGene, BioCreative I and II, and the MedPost training corpus.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    Enrich and query corpora in the TEI-XML vocabulary. CorpusReader manage very large corpora and corpora containing milestone annotation. It provides tools for enriching corpora with output of linguistic parsers, and for extracting quantitative information
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    NooJ is used by linguists to describe linguistic phenomena and apply the formalized morphological, syntactic or semantic rules to corpora . It is used by non linguists in fields like psychology, sociology, history, literature studies as well.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    The NITE XML Toolkit supports the creation, analysis, and browsing of annotated multimodal, text, or spoken language corpora, and represents both timing and rich linguistic structure. It contains libraries for developers and some end user tools.
    Leader badge
    Downloads: 1 This Week
    Last Update:
    See Project
  • 22
    CorporAl implements a method for processing overlapping corpora. The current version supports parallel corpora. It works by aligning the corresponding language parts and then aligning the alignments between themselves.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    Xaira
    XAIRA (XML Aware Indexing and Retrieval Architecture) supports indexing and analysis of large XML textual resources such as natural language corpora.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 24
    Tool for processing XML-annotated linguistic corpora
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    Parallel Corpora tools.
    Downloads: 0 This Week
    Last Update:
    See Project
MongoDB Logo MongoDB