Showing 368 open source projects for "text processing"

View related business solutions
  • Gemini 3 and 200+ AI Models on One Platform Icon
    Gemini 3 and 200+ AI Models on One Platform

    Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

    Build, govern, and optimize agents and models with Gemini Enterprise Agent Platform.
    Start Free
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • 1
    gpt2-client

    gpt2-client

    Easy-to-use TensorFlow Wrapper for GPT-2 117M, 345M, 774M, etc.

    GPT-2 is a Natural Language Processing model developed by OpenAI for text generation. It is the successor to the GPT (Generative Pre-trained Transformer) model trained on 40GB of text from the internet. It features a Transformer model that was brought to light by the Attention Is All You Need paper in 2017. The model has 4 versions - 124M, 345M, 774M, and 1558M - that differ in terms of the amount of training data fed to it and the number of parameters they contain. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    MatchZoo

    MatchZoo

    Facilitating the design, comparison and sharing of deep text models

    The goal of MatchZoo is to provide a high-quality codebase for deep text matching research, such as document retrieval, question answering, conversational response ranking, and paraphrase identification. With the unified data processing pipeline, simplified model configuration and automatic hyper-parameters tunning features equipped, MatchZoo is flexible and easy to use. Preprocess your input data in three lines of code, keep track parameters to be passed into the model. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3

    AhoTTS Multilingual, a Multilingual TTS

    Text-to-Speech TTS for Basque, Spanish, Catalan, Galician and English

    Text-to-Speech conversor for Basque, Spanish, Catalan, Galician and English. It includes linguistic processing and built voices for all the languages aforementioned. Its acoustic engine is based on hts_engine and it uses a high quality vocoder called AhoCoder. Developed by Aholab Signal Processing Laboratory: https://aholab.ehu.es/aholab/ http://aholab.ehu.es/ahocoder/
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    artext

    artext

    Probabilistic Noising of Natural Language

    Artext is a work on injecting noise into text without affecting the core meaning for a human reader. This kind of data can be useful for many NLP tasks, particulary to make models robust to erroneous text. This is a work in progress, and we will publish the results of our experiments soon. Meanwhile, if you use artext in your research please cite this repository. Github: https://github.com/nlpcl-lab/artext
    Downloads: 1 This Week
    Last Update:
    See Project
  • Secure File Transfer for Windows with Cerberus by Redwood Icon
    Secure File Transfer for Windows with Cerberus by Redwood

    Protect and share files over FTP/S, SFTP, HTTPS and SCP with the #1 rated Windows file transfer server.

    Cerberus supports unlimited users and connections on a single IP, with built-in encryption, 2FA, and a browser-based web client — all deployable in under 15 minutes with a 25-day free trial.
    Try for Free
  • 5
    FalaBrasil

    FalaBrasil

    Resources for speech processing in Brazilian Portuguese

    The FalaBrasil Group provides free tools and resources for speech and natural language processing in Brazilian Portuguese, most of them under the BSD license. Tools include mainly scripts to do all sort of things with audio and text, whereas resources include ready-to-used acoustic and languages models, phonetic dictionaries, etc.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 6
    TIES

    TIES

    A smart search engine for medical documents

    TIES (Text Information Extraction System) is a clinical text search engine that uses Natural Language Processing techniques to extract medical concepts from free text clinical reports. It provides secure de-identified access to this information and has in built collaboration tools and honest broker functionality. It is licensed for academic use under the BSD license.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7

    Safe Harbor Deidentification

    Safe Harbor Deidentification for medical documents

    Phalanx - Deidentify Safe Harbor Deidentification Mode of Phalanx is an abridged pipeline of NLP annotators culminating in NER annotators which write output of text offsets. It uses the Safe Harbor deidentification method.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    TEXT2DATA

    TEXT2DATA

    Text Analytics Platform

    Bring Text Analytics Platform that uses NLP (Natural Language Processing) and Machine Learning to your work environment. Extract essential information from your text documents and let Artificial Intelligence save your time. Get detailed and agile reports on your unstructured data.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    lazynlp

    lazynlp

    Library to scrape and clean web pages to create massive datasets

    LazyNLP is a lightweight tool for collecting and curating large-scale text datasets for machine learning and NLP applications with minimal manual effort.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Streamline Azure Security with Palo Alto Networks VM-Series Icon
    Streamline Azure Security with Palo Alto Networks VM-Series

    Centrally manage physical and virtualized firewalls with Panorama

    Improve your security posture and reduce incident response time. Use the VM-Series to natively analyze Azure traffic and dynamically drive policy updates based on workload changes.
    Learn more
  • 10
    NeuroNER

    NeuroNER

    Named-entity recognition using neural networks

    Named-entity recognition (NER) aims at identifying entities of interest in the text, such as location, organization and temporal expression. Identified entities can be used in various downstream applications such as patient note de-identification and information extraction systems. They can also be used as features for machine learning systems for other natural language processing tasks. Leverages the state-of-the-art prediction capabilities of neural networks (a.k.a. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    TextRank

    TextRank

    TextRank implementation for Python 3

    TextRank is an implementation of the TextRank algorithm for extractive text summarization and keyword extraction, inspired by Google’s PageRank.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12

    Arabic Corpus

    Text categorization, arabic language processing, language modeling

    The Arabic Corpus {compiled by Dr. Mourad Abbas ( http://sites.google.com/site/mouradabbas9/corpora ) The corpus Khaleej-2004 contains 5690 documents. It is divided to 4 topics (categories). The corpus Watan-2004 contains 20291 documents organized in 6 topics (categories). Researchers who use these two corpora would mention the two main references: (1) For Watan-2004 corpus ---------------------- M. Abbas, K. Smaili, D. Berkani, (2011) Evaluation of Topic Identification Methods on...
    Leader badge
    Downloads: 7 This Week
    Last Update:
    See Project
  • 13
    mbFXWords

    mbFXWords

    Analyze text. Diagonal read subject, predicate, obj. Search other pdf.

    Version 1.04. Applies and builds upon Apache OpenNLP. For English, French and German files. JavaFX Application, runs with Oracle Java Runtime Environment version 8 that is including JavaFX. NLP extensions: - Divide sentences in subclauses: segmentation. - Divide plain text: subject, predicate, object. - Count words: stemming. - Search for similar content: pdf's. Gives out subject, predicate and object of sentences of pdf and plain text files. Provides comfortable GUI. Automatic...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    pdfsandwich generates "sandwich" OCR pdf files, i.e. pdf files which contain only images (but no editable text) will be processed by optical character recognition (OCR) and the text will be added to each page invisibly "behind" the images. pdfsandwich is a command line tool which is supposed to be useful to OCR scanned books or journals. It is able to recognize the page layout even for multicolumn text. Essentially, pdfsandwich is a wrapper script which calls the following binaries: convert, unpaper, tesseract, gs, and hocr2pdf (if tesseract < 3.03). ...
    Leader badge
    Downloads: 299 This Week
    Last Update:
    See Project
  • 15
    DSTK - Data Science TooKit 3

    DSTK - Data Science TooKit 3

    Data and Text Mining Software for Everyone

    DSTK - Data Science Toolkit 3 is a set of data and text mining softwares, following the CRISP DM model. DSTK offers data understanding using statistical and text analysis, data preparation using normalization and text processing, modeling and evaluation for machine learning and algorithms. It is based on the old version DSTK at https://sourceforge.net/projects/dstk2/ DSTK Engine is like R.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    DeepLearn

    DeepLearn

    Implementation of research papers on Deep Learning+ NLP+ CV in Python

    Welcome to DeepLearn. This repository contains an implementation of the following research papers on NLP, CV, ML, and deep learning. The required dependencies are mentioned in requirement.txt. I will also use dl-text modules for preparing the datasets. If you haven't use it, please do have a quick look at it. CV, transfer learning, representation learning.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    IceNLP is an open source Natural Language Processing (NLP) toolkit for analyzing and processing Icelandic text. The toolkit is implemented in Java.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18

    file_lemmater

    text file quick lemmater

    This executable get a text file (input name "in.txt" at the same folder where the executable is) and creates a file called "out.txt" with the same content but each noun, adjective or verb is lemmatized. From the Aseryla (https://memla.000webhostapp.com/index.html) system that combines the Stanford Core NLP (https://stanfordnlp.github.io/CoreNLP/index.html) and the CSTlemmatiser(http://cst.dk/online/lemmatiser/uk/)
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19

    cbrTekStraktor

    an application to automatically extract text from comic books.

    cbrTekStraktor is an application to automatically extract text from the text bubbles or speech balloons present in comic book reader files (CBR). Its prime goal is to perform analysis on the texts of comic books. cbrTekStraktor can however also be used for scanlation or similar purposes. The application also enables to manually define text areas in CBR files. The application comprises a simple graphical editor for further processing the extracted text. ...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 20

    TEES

    Turku Event Extraction System

    Turku Event Extraction System (TEES) is a free and open source natural language processing system developed for the extraction of events and relations from biomedical text. It is written mostly in Python, and should work in generic Unix/Linux environments. Currently, the TEES source code repository still remains on GitHub at http://jbjorne.github.com/TEES/ where there is also a wiki with more information.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    Welsh Natural Language Toolkit
    The project supports the Welsh Language Technology domain with a set of NLP tools that drive innovation and advance the development of sophisticated textual analysis solutions. The WNLT project delivers four core NLP modules; a) Word Segmentation for separating text into words b) Sentence Boundary Disambiguation for finding sentence boundaries c) Part of Speech Tagger for determining the part of speech of each word d) Morphological Analyser for identifying the root form (lemma) of words....
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    ...Multiple narratives can be listed in the text file, where narratives are separated using a # symbol. The text upload process entitles the initial (POS) tagging of uploaded text using Stanford (POS) tagger. The user can later modify and extend the initial tagging. The resultant annotations are stored in the supporting database. These results can be exported to excel or text files for further processing.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23

    JCLTP

    A Java Class Library for Text Processing

    JCLTP is a class library designed for processing text. JCLTP is free, open source and developed with the Java programming language. JCLTP is distributed under the GNU license. It incorporates several technologies that enable process information while applying AI techniques, in order to build predictive models for text classification. Through a flexible structure of interfaces and classes, the opportunity to extend, adapt and add functionality JCLTP is provided. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24

    BioC

    We describe a simple XML format to share text documents and annotation

    A minimalist approach to share text documents and data annotations. Allows a large number of different annotations to be represented. Project files contain: - simple code to hold/read/write data and perform sample processing. - BioC-formatted corpora - BioC tools that work with BioC corpora BioC goals - simplicity - interoperability - broad use - reuse There should be little investment required to learn to use a format or a software module to process that format. ...
    Leader badge
    Downloads: 2 This Week
    Last Update:
    See Project
  • 25
    Ansj Chinese word segmentation

    Ansj Chinese word segmentation

    Ansj word segmentation

    The real java implementation of ict. The word segmentation effect is faster than the open source version of ict. Chinese word segmentation, name recognition, part-of-speech tagging, user-defined dictionary. This is a java implementation of Chinese word segmentation based on n-Gram+CRF+HMM. The word segmentation speed reaches about 2 million words per second (tested under mac air), and the accuracy rate can reach more than 96%. At present, it has realized the functions of Chinese word...
    Downloads: 0 This Week
    Last Update:
    See Project
MongoDB Logo MongoDB