Showing 14 open source projects for "text processing"

View related business solutions
  • Gemini 3 and 200+ AI Models on One Platform Icon
    Gemini 3 and 200+ AI Models on One Platform

    Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

    Build, govern, and optimize agents and models with Gemini Enterprise Agent Platform.
    Start Free
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • 1
    OCRmyPDF

    OCRmyPDF

    OCRmyPDF adds an OCR text layer to scanned PDF files

    OCRmyPDF adds an optical character recognition (OCR) text layer to scanned PDF files, allowing them to be searched. PDF is the best format for storing and exchanging scanned documents. Unfortunately, PDFs can be difficult to modify. OCRmyPDF makes it easy to apply image processing and OCR (recognized, searchable text) to existing PDFs.
    Downloads: 103 This Week
    Last Update:
    See Project
  • 2
    Unredact

    Unredact

    A simple tool for reading in poorly redacted documents

    Unredact is a specialized tool that attempts to reconstruct redacted or obscured text in images, PDFs, or screenshots using a combination of image processing and generative AI inference to suggest plausible completions of blurred, black-boxed, or jumbled content. Unlike traditional optical character recognition (OCR), which only reads visible text, Unredact focuses on inferring missing content where redaction has been applied by analyzing surrounding context, font characteristics, and linguistic patterns to produce candidate reconstructions. ...
    Downloads: 12 This Week
    Last Update:
    See Project
  • 3
    EpiDoc: Epigraphic Documents in TEI XML

    EpiDoc: Epigraphic Documents in TEI XML

    XML text markup for ancient documents

    The EpiDoc Collaborative is developing specifications and tools for standards-based, digital publication and interchange of scholarly and educational editions of documentary and literary texts like inscriptions and papyri. The link below will take you to the EpiDoc home page on this site.
    Leader badge
    Downloads: 4 This Week
    Last Update:
    See Project
  • 4
    Budou

    Budou

    Budou is an auto organizer tool for beautiful line breaking in CJK

    Budou is a Python library developed by Google to improve web typography for CJK (Chinese, Japanese, Korean) languages by producing semantically meaningful line breaks. Unlike English, CJK scripts lack spaces or hyphenation cues, often resulting in awkward or unreadable text wrapping on web pages. Budou addresses this issue by segmenting sentences into logical lexical chunks and wrapping each chunk in non-breaking HTML <span> tags. These spans can be styled with CSS to ensure smooth, visually coherent line breaks without splitting words or phrases. The tool supports multiple segmentation backends, including Google Cloud Natural Language API, MeCab, and TinySegmenter, enabling flexibility for both cloud-based and offline processing. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Try Google Cloud Risk-Free With $300 in Credit Icon
    Try Google Cloud Risk-Free With $300 in Credit

    No hidden charges. No surprise bills. Cancel anytime.

    Use your credit across every product. Compute, storage, AI, analytics. When it runs out, 20+ products stay free. You only pay when you choose to.
    Start Free
  • 5
    PDF-Shuffler
    PDF-Shuffler is a small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface. It is a frontend for python-pyPdf.
    Downloads: 37 This Week
    Last Update:
    See Project
  • 6

    dpanalyzer

    postprocessing tool for Project Gutenberg Distributed Proofreaders

    Specialized tool for PostProcessors of books produced by Project Gutenberg Distributed Proofreaders. Parses the markup structure of a project file out of the formatting rounds; reports about the text structure found, and identifies markup errors. Planned future features: generation of normalized dp output by rejoining split paragraphs and moving around footnotes, renumbering of pages; conversion to basic LaTeX and basic HTML markup for further processing.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    A python script that uses wxwidgets. View or edit delimited data.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    PROJECT HAS MOVED: https://github.com/wiki2beamer
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    A set of Unix command line tools for quick and convenient batch processing of tabular text files (a.k.a., tab-delimited, csv, or flat file format) with a header line. Provides delimiter and compression detection, column reference by name. * tblmap: per-line ("map") computation: derive columns through an expression, delete, reorder, filter rows. * tblred: compute ("reduce") aggregations (e.g., sum, average) over groups defined by key columns
    Downloads: 0 This Week
    Last Update:
    See Project
  • Train ML Models With SQL You Already Know Icon
    Train ML Models With SQL You Already Know

    BigQuery automates data prep, analysis, and predictions with built-in AI assistance.

    Build and deploy ML models using familiar SQL. Automate data prep with built-in Gemini. Query 1 TB and store 10 GB free monthly.
    Try Free
  • 10
    Multiple format bibliography processor in Python.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    Contains a LaTeX style file and an associated GUI that allow for the annotation of LaTeX documents. Tracks changes made by multiple editors. This package provides a way for multiple authors to collaboratively edit a latex document.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    ZML, the Zeitung Markup Language, is a simple CMS for small newspapers. It was specifically designed to publish a student newspaper in print and on the Web. It uses LaTeX and XHTML. So far, it is documented in German only.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    Pybtex is a drop-in replacement for BibTeX written in Python.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    SchemaDoc is a XML-based markup language for documenting XML schemas. The work products include both the vocabulary and a set of tools for combining it with the schema source (e.g. a DTD) to produce documentation in HTML, XML DocBook, LaTeX, etc.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • Next
MongoDB Logo MongoDB