20 projects for "corpora" with 2 filters applied:

  • Build Agents and Models on One Platform Icon
    Build Agents and Models on One Platform

    Everything you need to build production-ready agents and models. Access 200+ Google and third-party AI models and tools.

    Gemini Enterprise Agent Platform is Google Cloud's comprehensive platform for developers to build, scale, govern, and optimize agents and models. Choose from Google's most advanced models and third-party models like Anthropic's Claude Model Family.
    Try It Free
  • Our Free Plans just got better! | Auth0 Icon
    Our Free Plans just got better! | Auth0

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.
    Try free now
  • 1

    Tokenized Text Aligner

    Aligns tokens in two versions of a text with differing tokenization.

    This tool performs token-by-token alignment of two versions of a text with differing tokenization by interpreting the results of a file diff (https://docs.python.org/3/library/difflib.html). It is intended for use in the preparation of annotated linguistic corpora, where differences in tokenization may arise (i) following corrections or modifications to the source text or (ii) through the creation of different layers of annotation (part-of-speech, treebank) requiring different tokenization. In its default implementation, it produces a human-readable CSV table associating tokens in text A with tokens in text B, and can also inject token-level annotation from text B to text A. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    JoBimText

    JoBimText

    Linking Language to Knowledge with Distributional Semantics

    JobimText is a software solution for automatic text expansion using contextualized distributional similarity. It provides text analysis tools for large corpora and has capabilities to create distributional semantic models (JoBimText models) and multi-word expressions.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    UnsupervisedMT

    UnsupervisedMT

    Phrase-Based & Neural Unsupervised Machine Translation

    Unsupervised Machine Translation is a research repository that implements both phrase-based SMT and neural MT approaches for translation without parallel corpora. The neural component supports multiple architectures—seq2seq, biLSTM with attention, and Transformer—and allows extensive parameter sharing across languages to improve data efficiency. Training relies on denoising auto-encoding and back-translation, with on-the-fly, multithreaded generation of synthetic parallel data to continually refresh supervision signals. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    concordia

    concordia

    Powerful search library, best suited for computer-aided translation

    ...This project now contains fully functional Concordia search library. In the near future, it will be extended by concordia-server: ligthweight, robust web server providing corpora search functionalities
    Downloads: 0 This Week
    Last Update:
    See Project
  • Forever Free Full-Stack Observability | Grafana Cloud Icon
    Forever Free Full-Stack Observability | Grafana Cloud

    Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

    Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
    Create free account
  • 5

    rcqp

    R interface to the Corpus Query Protocol

    Implements the Corpus Query Protocol as a package for the R statistical environment. It allows to query linguistic corpora and manipulate the data as native R objects. It is based on the CWB software.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6

    poliqarp2

    natural language corpora search engine

    This project aims at building an efficient indexer and search engine for natural language corpora with multilevel annotations.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7

    diasim

    Dialogue Similarity

    Tools for calculating similarity (including lexical and syntactic) between speakers in dialogue, across standard and randomised corpora.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8

    mwetoolkit

    THIS PROJECT MIGRATED TO https://gitlab.com/mwetoolkit/mwetoolkit3/

    THIS PROJECT MIGRATED TO https://gitlab.com/mwetoolkit/mwetoolkit3/ The Multiword Expressions toolkit aids in the automatic identification and extraction of multiword units in running text. These include idioms (kick the bucket), noun compounds (cable car), phrasal verbs (take off, give up), etc. Even though it focuses on multiword expresisons, the framework is quite complete and can also be useful in any corpus-based study in computational linguistics. The mwetoolkit can be...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 9
    EXMARaLDA
    EXMARaLDA stands for "Extensible Markup Language for Discourse Annotation". It's a system of concepts, data formats and tools for the computer assisted transcription and annotation of spoken language, and the analysis of spoken language corpora. This project's source code has moved to https://github.com/Exmaralda-Org/exmaralda
    Downloads: 0 This Week
    Last Update:
    See Project
  • Custom VMs From 1 to 96 vCPUs With 99.95% Uptime Icon
    Custom VMs From 1 to 96 vCPUs With 99.95% Uptime

    General-purpose, compute-optimized, or GPU/TPU-accelerated. Built to your exact specs.

    Live migration and automatic failover keep workloads online through maintenance. One free e2-micro VM every month.
    Try Free
  • 10

    Aelius Brazilian Portuguese POS-Tagger

    Python, NLTK-based package for shallow parsing of Brazilian Portuguese

    ...It also includes language resources such as language models, sample texts, and gold standards. Presently, Aelius already offers facilities for POS-tagging and chunking corpora and outputting annotations in different formats, such as in XML in the TEI P5 encoding scheme.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    Donatus is an on-going project consisting of Python, NLTK-based tools and grammars for deep parsing and syntactical annotation of Brazilian Portuguese corpora. It includes a user-friendly graphical user interface for building syntactic parsers with the NLTK, providing some additional functionalities.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12

    Hermes Natural Language Processing

    A repository of software, documentation and data for NLP

    Hermes is a repository of software, documentation and data for NLP. I am currently adding corpora extracted from Wikipedia (mostrly in Romance languages).
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    Various tools for creating annotated parallel corpora including pre-trained tagging and parsing models for various languages, sentence alignment tools and word alignment tools. Uplug also includes a web-based interface for interactive sentence and word alignment and scripts for indexing and querying parallel corpora using the Corpus Work Bench CWB. Download 'uplug-main' first and then add other packages.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    A tool for large richly annotated parallel corpora preprocessing and Moses phrase-table extraction.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    CORPSE (CORPus SEarch) is a powerful search engine written in Java. The aim is to provide an efficient implementation of a word level inverted index search with various cool functions that can be used on very large corpora.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    Enrich and query corpora in the TEI-XML vocabulary. CorpusReader manage very large corpora and corpora containing milestone annotation. It provides tools for enriching corpora with output of linguistic parsers, and for extracting quantitative information
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    NooJ is used by linguists to describe linguistic phenomena and apply the formalized morphological, syntactic or semantic rules to corpora . It is used by non linguists in fields like psychology, sociology, history, literature studies as well.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    The NITE XML Toolkit supports the creation, analysis, and browsing of annotated multimodal, text, or spoken language corpora, and represents both timing and rich linguistic structure. It contains libraries for developers and some end user tools.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    AmiGram is the AMI Graphical Representation and Annotation Module. It is a general-purpose tool for multimodal corpus annotation and allows the time line based annoation of NXT corpora in a layer based environment.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 20
    A programming language designed for searching and manipulating tree-structured data, particularly corpora of natural languages encoded in an s-expression-like format.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • Next