Showing 49 open source projects for "text processing"

View related business solutions
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • $300 in Free Credit Towards Top Cloud Services Icon
    $300 in Free Credit Towards Top Cloud Services

    Build VMs, containers, AI, databases, storage—all in one place.

    Start your project in minutes. After credits run out, 20+ products include free monthly usage. Only pay when you're ready to scale.
    Get Started
  • 1
    pyVideoTrans

    pyVideoTrans

    Translate the video from one language to another and embed dubbing

    pyVideoTrans is an ambitious open-source multimedia processing project that assembles speech recognition, subtitle generation, AI translation, voice synthesis, and video assembly into a unified pipeline for converting videos from one language to another with embedded dubbing and captions. At its core it runs speech-to-text models to transcribe audio tracks, translates the resulting text into a target language using local or cloud-based translation engines, synthesizes new speech to match the translated subtitles, and then merges that speech back into the video, creating a fully localized media file. ...
    Downloads: 27 This Week
    Last Update:
    See Project
  • 2

    multinotes

    Text architecture for music theory.

    ...Furthermore, dynamic interactive documents can be useful for presenting complicated interdependencies to the reader more clearly, far beyond conventional paper publication. The mulitNotes text architecture and processing pipeline is based on d2d and standard technologies (XSLT, ECMAScript. LilyPond, PostScript, etc.) and addresses these issues. An overview about the software architecture and its operation is given in: Journal of the Text Encoding Initiative, Open Issue 18/2024: "Using d2d for Writing XML --- The multiNotes Text Architecture for Musical Analysis" https://doi.org/10.4000/132ex
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    TXM

    TXM

    Unicode XML TEI text analysis platform

    TXM is a free and open-source cross-platform Unicode & XML based text analysis environment and graphical client, supporting Windows, Linux and Mac OS X. It can also be used online as a J2EE standard compliant web portal (GWT based) with access control built in. DOWNLOAD LATEST VERSION OF TXM : http://textometrie.ens-lyon.fr/spip.php?rubrique61&lang=en TXM offers a comprehensive range of analysis tools (concordances, collocate search, frequency lists, etc.) based on the powerfull CQP...
    Downloads: 16 This Week
    Last Update:
    See Project
  • 4

    modnlp-plugins

    External plugins for modnlp/teccli

    This is a general project for modnlp/teccli plugins, with focus on text visualizaton.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Compliant and Reliable File Transfers Backed by Top Security Certifications Icon
    Compliant and Reliable File Transfers Backed by Top Security Certifications

    Cerberus FTP Server delivers SOC 2 Type II certified security and FIPS 140-2 validated encryption.

    Stop relying on non-certified, legacy file transfer tools that creak under the weight of modern security demands. Get full audit trails, advanced access controls and more supported by an award-winning team of experts. Start your free 25-day trial today.
    Start Free Trial
  • 5

    MITRE Annotation Toolkit

    A toolkit for managing and manipulating text annotations

    The MITRE Annotation Toolkit (MAT) is a suite of tools which can be used for automated and human tagging of annotations. Annotation is a process, used mostly by researchers in natural language processing, of enhancing documents with information about the various phrase types the documents contain. MAT supports both UI interaction and command-line interaction, and provides various levels of control over the overall annotation process. It can be customized for specific tasks (e.g.,...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    Live Transcribe Speech Engine

    Live Transcribe Speech Engine

    Live Transcribe is an Android application

    Live Transcribe Speech Engine provides on-device speech recognition components that power real-time transcription for accessibility and everyday voice-first experiences. Its design prioritizes latency and robustness in noisy, far-field environments, enabling continuous transcription with low delay on mobile hardware. The engine manages audio front-end processing—such as noise suppression and voice activity detection—before feeding audio into compact, accurate acoustic and language models....
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    Leseratte is a Java parser for German written language. Currently, it contains a German lexicon (based on the Wiktionary), inflexion rules, a grammar and a parser. (Semantics component planned.) Usable as a Java library, also provides a graphical UI.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    wordTabulator
    Program wordTabulator is intended for text analysis. It can generate index of word elements extracted from defined text set. Word elements may be words, N-grams or phrases (syntagmes). The program can process texts as in ordinary 1-byte encoding (ANSI), as in multibyte UTF-8 encoding.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 9

    KSUCCA Corpus

    A 50 million tokens corpus of Classical Arabic.

    King Saud University Corpus of Classical Arabic (KSUCCA) is a pioneering 50 million tokens annotated corpus of Classical Arabic texts from the period of pre-Islamic era until the fourth Hijri century (equivalent to the period from the seventh until early eleventh century CE), which is the period of pure classical Arabic. The main aim of this corpus is to be used for studying the distributional lexical semantics of The Quran words. However, it can be used for other research purposes, such...
    Downloads: 3 This Week
    Last Update:
    See Project
  • Stop Cyber Threats with VM-Series Next-Gen Firewall on Azure Icon
    Stop Cyber Threats with VM-Series Next-Gen Firewall on Azure

    Native application identity and user-based security for your Azure cloud

    Gain integrated visibility across all traffic in a single pass. Deploy Palo Alto Networks VM-Series to determine application identity and content while automating security policy updates via rich APIs.
    Get a free trial
  • 10
    TIES

    TIES

    A smart search engine for medical documents

    TIES (Text Information Extraction System) is a clinical text search engine that uses Natural Language Processing techniques to extract medical concepts from free text clinical reports. It provides secure de-identified access to this information and has in built collaboration tools and honest broker functionality. It is licensed for academic use under the BSD license.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11

    Safe Harbor Deidentification

    Safe Harbor Deidentification for medical documents

    Phalanx - Deidentify Safe Harbor Deidentification Mode of Phalanx is an abridged pipeline of NLP annotators culminating in NER annotators which write output of text offsets. It uses the Safe Harbor deidentification method.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12

    Arabic Corpus

    Text categorization, arabic language processing, language modeling

    The Arabic Corpus {compiled by Dr. Mourad Abbas ( http://sites.google.com/site/mouradabbas9/corpora ) The corpus Khaleej-2004 contains 5690 documents. It is divided to 4 topics (categories). The corpus Watan-2004 contains 20291 documents organized in 6 topics (categories). Researchers who use these two corpora would mention the two main references: (1) For Watan-2004 corpus ---------------------- M. Abbas, K. Smaili, D. Berkani, (2011) Evaluation of Topic Identification Methods on...
    Leader badge
    Downloads: 9 This Week
    Last Update:
    See Project
  • 13

    Ghawwas_V4

    An open source system for Arabic corpora processing

    Ghawwas (previously known as Khawas) is an open source system for Arabic corpora processing. Ghawwas V4.0 provides the following main functions: a. Frequency list for single word and N-Grams b. Concordance c. Collocation (MI, CHI Squared, LL, T-Score, Z Score, Dice, Log Dice, Weirdness Coefficient) d. Lexical patterns search e. Two corpora frequency profile comparison based on MI, CHI, LL, T-Score, Z Score, Dice, Log Dice, Weirdness Coefficient f. Accept Windows and UTF-8 character...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 14
    IceNLP is an open source Natural Language Processing (NLP) toolkit for analyzing and processing Icelandic text. The toolkit is implemented in Java.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15

    Musaheb

    An Arabic collocation extraction tool

    “Musaheb”, an Arabic collocation extraction tool that has been designed and implemented to overcome the limitations of existing collocation extraction tools. “Musaheb” is able to extract n-gram collocations up to 5-gram, in addition to extracting the collocates of the nodes (the word-types we are looking for its collocates) within a window size of zero to 15 words. Moreover, it provides eight collocation statistics to calculate the strength of the collocation, and permits the input of...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16

    TEES

    Turku Event Extraction System

    Turku Event Extraction System (TEES) is a free and open source natural language processing system developed for the extraction of events and relations from biomedical text. It is written mostly in Python, and should work in generic Unix/Linux environments. Currently, the TEES source code repository still remains on GitHub at http://jbjorne.github.com/TEES/ where there is also a wiki with more information.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 17
    Welsh Natural Language Toolkit
    The project supports the Welsh Language Technology domain with a set of NLP tools that drive innovation and advance the development of sophisticated textual analysis solutions. The WNLT project delivers four core NLP modules; a) Word Segmentation for separating text into words b) Sentence Boundary Disambiguation for finding sentence boundaries c) Part of Speech Tagger for determining the part of speech of each word d) Morphological Analyser for identifying the root form (lemma) of words....
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18

    Discriminative Language Editor

    Discriminative language editor based on ontologies

    Text editor in Java that is able to detect discriminative expressions while the user is typing. When the internal ontology-based analyzer detects a potential discriminative expression the user is advised by underscoring the related words in the text. A descriptive message about the issue is also shown to the user when the cursor is placed over the potential discriminative expression.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19

    BioC

    We describe a simple XML format to share text documents and annotation

    A minimalist approach to share text documents and data annotations. Allows a large number of different annotations to be represented. Project files contain: - simple code to hold/read/write data and perform sample processing. - BioC-formatted corpora - BioC tools that work with BioC corpora BioC goals - simplicity - interoperability - broad use - reuse There should be little investment required to learn to use a format or a software module to process that format. ...
    Leader badge
    Downloads: 16 This Week
    Last Update:
    See Project
  • 20
    VoynichReader

    VoynichReader

    Voynich Manuscript viewer.

    Software for viewing, searching & analysing Prof. Stolfi's interlinear transcription of the Voynich Manuscript.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    Welsh Natural Language Toolkit

    Welsh Natural Language Toolkit

    WNLT is a suite of open source natural language modules for the Welsh

    The project supports the Welsh Language Technology domain with a set of NLP tools that drive innovation and advance the development of sophisticated textual analysis solutions. The WNLT project delivers four core NLP modules; a) Word Segmentation for separating text into words b) Sentence Boundary Disambiguation for finding sentence boundaries c) Part of Speech Tagger for determining the part of speech of each word d) Morphological Analyser for identifying the root form (lemma) of words....
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22

    bnf2xml

    simple BNF parser makes xml markup of matches

    bnf2xml a simple BNF parser that takes text as input, searches according to a BNF query file, and outputs text marked up by the xml labels that show context. bnf2xml is as simple to use as any text binary ie, awk(1) grep(1). bnf2xml does not require C API because it outputs simple xml labeling. README is visible on file dl page. EXAMPLE: $ echo "hi" | bnf2xml patternfile <word><alph>h</alph><alph>i</alph></word> or <gas>hydrogen iodide</gas> patternfile says how to find...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    tiny-hyphenator

    tiny-hyphenator

    C++ Library to hyphenate a text

    Use this class to achieve efficient and mostly exact word hyphenations. Currently the only language supported is German, but it can very easily be adapted to support other languages. This library spans ~370 lines of C++ code. At the moment there is no real documentation, pleas refer to the comments in the code, they should be quite helpful. A sample is also included. This library is not based on any algorithm published on the internet or elsewhere and it works without a dictionary. The...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    ATTENTION! Morfologik is now at GitHub: https://github.com/morfologik/
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    Virastyar

    Virastyar

    Virastyar is an spell checker for low-resource languages

    Virastyar is a free and open-source (FOSS) spell checker. It stands upon the shoulders of many free/libre/open-source (FLOSS) libraries developed for processing low-resource languages, especially Persian and RTL languages Publications: Kashefi, O., Nasri, M., & Kanani, K. (2010). Towards Automatic Persian Spell Checking. SCICT. Kashefi, O., Sharifi, M., & Minaie, B. (2013). A novel string distance metric for ranking Persian respelling suggestions. Natural Language Engineering,...
    Downloads: 56 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • Next
MongoDB Logo MongoDB