Search Results for "docx metadata extraction"

Showing 90 open source projects for "docx metadata extraction"

View related business solutions
  • Build Securely on AWS with Proven Frameworks Icon
    Build Securely on AWS with Proven Frameworks

    Lay a foundation for success with Tested Reference Architectures developed by Fortinet’s experts. Learn more in this white paper.

    Moving to the cloud brings new challenges. How can you manage a larger attack surface while ensuring great network performance? Turn to Fortinet’s Tested Reference Architectures, blueprints for designing and securing cloud environments built by cybersecurity experts. Learn more and explore use cases in this white paper.
    Download Now
  • Go From AI Idea to AI App Fast Icon
    Go From AI Idea to AI App Fast

    One platform to build, fine-tune, and deploy ML models. No MLOps team required.

    Access Gemini 3 and 200+ models. Build chatbots, agents, or custom models with built-in monitoring and scaling.
    Try Free
  • 1
    ContextGem

    ContextGem

    ContextGem: Effortless LLM extraction from documents

    ContextGem is an open-source framework designed to simplify the extraction of structured data and insights from documents using large language models (LLMs). It provides a flexible, intuitive API that minimizes boilerplate code, enabling developers to build complex extraction workflows efficiently. ContextGem supports various document formats and integrates with multiple LLM providers, making it a versatile tool for tasks like contract analysis, anomaly detection, and information retrieval.​
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    Extractous

    Extractous

    Fast and efficient unstructured data extraction

    Extractous is a Rust-based unstructured data extraction library focused on fast local parsing of documents and other content-heavy files. Its purpose is to extract text and metadata efficiently from formats such as PDF, Word, HTML, email archives, images, and more, without depending on external APIs or separate parsing servers. The project emphasizes performance and low memory usage, and its maintainers describe it as a local-first alternative to heavier extraction stacks. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    MegaParse

    MegaParse

    File Parser optimised for LLM Ingestion with no loss

    MegaParse is a file parser optimized for Large Language Model (LLM) ingestion, ensuring no loss of information. It efficiently parses various document formats, such as PDFs, DOCX, and PPTX, converting them into formats ideal for processing by LLMs. This tool is essential for applications that require accurate and comprehensive data extraction from diverse document types.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    OpenDataLoader PDF

    OpenDataLoader PDF

    PDF Parser for AI-ready data. Automate PDF accessibility

    OpenDataLoader PDF is an open-source document processing system designed to convert complex PDF files into structured, AI-ready formats such as Markdown, JSON, and HTML while preserving layout, hierarchy, and semantic meaning. It focuses on enabling downstream use cases like retrieval-augmented generation (RAG), knowledge extraction, and document intelligence pipelines by maintaining accurate reading order and spatial metadata through bounding boxes. The tool combines deterministic parsing methods with an optional hybrid AI-powered mode that improves extraction quality for difficult layouts such as multi-column documents, scanned files, and scientific papers. ...
    Downloads: 7 This Week
    Last Update:
    See Project
  • Enterprise-grade ITSM, for every business Icon
    Enterprise-grade ITSM, for every business

    Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity.

    Freshservice is an intuitive, AI-powered platform that helps IT, operations, and business teams deliver exceptional service without the usual complexity. Automate repetitive tasks, resolve issues faster, and provide seamless support across the organization. From managing incidents and assets to driving smarter decisions, Freshservice makes it easy to stay efficient and scale with confidence.
    Try it Free
  • 5
    PDFPatcher

    PDFPatcher

    A versatile toolkit for PDF manipulation

    PDFPatcher (aka “PDF补丁丁”) is a versatile toolkit for PDF manipulation—editing document metadata, bookmarks, page layout, content restrictions, rotation, compression, merging/splitting, image extraction, and more, all within an intuitive interface. Merge/split PDFs or images, preserve or add bookmarks, and set page dimensions. Batch style/color/target changes, regex/XPath search/replace, mid‑page positioning. Modify PDF metadata, page numbers, links, initial view mode, and remove open actions.
    Downloads: 57 This Week
    Last Update:
    See Project
  • 6
    nhentai

    nhentai

    A library for interacting with the nhentai API

    nhentai is a JavaScript and TypeScript library designed to interact with the nhentai API and retrieve doujinshi metadata and content information. It enables developers to programmatically access galleries, titles, tags, covers, and page URLs from the nhentai platform. The library supports both CommonJS and ES6 module imports, making it easy to integrate into different Node.js projects. Developers can use it to fetch specific doujin entries, explore associated metadata, and process gallery...
    Downloads: 57 This Week
    Last Update:
    See Project
  • 7
    Portable Executable Parser

    Portable Executable Parser

    lightweight Go package to parse, analyze and extract metadata

    Saferwall PE is a lightweight Go package for parsing, analyzing, and extracting metadata from Portable Executable (PE) binaries. Designed with malware analysis in mind, it is robust against malformed PE files and provides detailed insights into executable structures.​
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    GROBID

    GROBID

    A machine learning software for extracting information

    ...All the usual publication metadata are covered (including DOI, PMID, etc.).
    Downloads: 3 This Week
    Last Update:
    See Project
  • 9
    KaraKeep

    KaraKeep

    A self-hostable bookmark-everything app

    ...Automatic fetching of link titles, descriptions, and images streamlines saving content without manual edits, while rule-based management lets users define customized workflows. With support for image OCR and structured data extraction, Karakeep functions as a flexible personal knowledge base for researchers, content creators, and heavy bookmarkers.
    Downloads: 2 This Week
    Last Update:
    See Project
  • AI-generated apps that pass security review Icon
    AI-generated apps that pass security review

    Stop waiting on engineering. Build production-ready internal tools with AI—on your company data, in your cloud.

    Retool lets you generate dashboards, admin panels, and workflows directly on your data. Type something like “Build me a revenue dashboard on my Stripe data” and get a working app with security, permissions, and compliance built in from day one. Whether on our cloud or self-hosted, create the internal software your team needs without compromising enterprise standards or control.
    Try Retool free
  • 10
    MinerU

    MinerU

    A high-quality tool for convert PDF to Markdown and JSON

    MinerU is an open-source, high-quality document extraction toolkit focused on converting PDFs (and other document formats) into structured Markdown and JSON. It leverages OCR and layout analysis to preserve semantic structure and metadata, ideal for research and data science workflows.
    Downloads: 10 This Week
    Last Update:
    See Project
  • 11
    NeMo Retriever Library

    NeMo Retriever Library

    Document content and metadata extraction microservice

    NeMo Retriever Library is a scalable microservice framework designed for extracting, structuring, and enriching content from documents to support downstream generative AI applications. It processes various document types by splitting them into components such as text, tables, charts, and images, and then applies OCR and contextual analysis to convert them into structured data formats. The system is built on NVIDIA NIM microservices, enabling high-performance parallel processing and efficient...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 12
    Zotero

    Zotero

    Tool to help you collect, organize, annotate, cite, and share research

    Zotero is a powerful, free, open-source research management application designed to help students, academics, and professionals collect, organize, annotate, cite, and share research sources and materials for papers, projects, or books. It can save web pages, PDFs, books, articles, and more with metadata, automatically extract bibliographic information, and organize items into collections and tag systems, while supporting notes and annotations directly alongside references. Zotero’s interface...
    Downloads: 18 This Week
    Last Update:
    See Project
  • 13
    Khazix Skills

    Khazix Skills

    Digital Life Kazik Open Source AI Skills Collection

    Khazix Skills project is an automation framework designed to transform GitHub repositories into structured, reusable AI agent skills. It acts as a pipeline that analyzes a repository’s metadata, extracts relevant information such as README content and commit hashes, and converts it into a standardized skill format that can be integrated into agent ecosystems. The system emphasizes lifecycle management by embedding versioning, traceability, and metadata directly into generated skill files,...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 14
    Docspell

    Docspell

    Assist in organizing your piles of documents

    Docspell is a personal document organizer. Or sometimes called a "Document Management System" (DMS). You'll need a scanner to convert your papers into files. Docspell can then assist in organizing the resulting mess. It can unify your files from scanners, emails, and other sources. It is targeted for home use, i.e. families, households, and also for smaller groups/companies. You can associate tags, set correspondent,s and lots of other predefined and custom metadata. If your documents are...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    Trafilatura

    Trafilatura

    Python & command-line tool to gather text on the Web

    Trafilatura is a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text-processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats. Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the noise caused by recurring elements (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to make sense of the data. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 16
    Instaloader

    Instaloader

    Download pictures (or videos) along with their captions

    Instaloader is a mature open-source utility for downloading and archiving Instagram content along with rich metadata. It enables users to retrieve posts, stories, reels, highlights, profile pictures, and associated information such as captions, comments, timestamps, and geotags. The tool supports both public and permitted private content when proper authentication is provided, making it useful for research, digital archiving, and social media analysis. Instaloader can be run as a simple...
    Downloads: 24 This Week
    Last Update:
    See Project
  • 17
    CommunityScrapers

    CommunityScrapers

    This is a public repository containing scrapers

    Stash Community Scrapers is a large open-source collection of metadata extraction tools designed to work with the Stash media management platform, enabling automated scraping of content information from various online sources. The repository contains hundreds of scraper definitions written primarily in YAML and Python, each tailored to extract structured metadata such as titles, performers, tags, and media details from specific websites.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    newspaper4k

    newspaper4k

    Python library for scraping and analyzing online news articles easily

    ...It is a continuation and active fork of the original newspaper3k library, which had stopped receiving updates, with the goal of keeping the ecosystem maintained while adding improvements and bug fixes. It provides developers with tools to automatically download web pages, extract the main article content, and collect associated metadata such as titles, authors, images, and publication dates. Newspaper4k also includes natural language processing capabilities that can generate summaries and identify keywords from extracted article text. Newspaper4k supports both single-article extraction and full news site processing, allowing users to build sources representing entire publications and iterate through their articles. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    X-osint

    X-osint

    Open source OSINT tool for gathering data on emails, phones, and IPs

    X-osint is an open source intelligence framework designed to collect and analyze publicly available information from multiple sources. It focuses on gathering useful and credible data about entities such as phone numbers, email addresses, and IP addresses using a range of automated OSINT techniques. It provides investigators and researchers with a centralized interface for running information-gathering tasks that would normally require multiple separate tools. X-osint can also perform...
    Downloads: 42 This Week
    Last Update:
    See Project
  • 20
    Dungbeetle

    Dungbeetle

    A distributed job server

    Dungbeetle is a metadata and data lineage tracking tool developed by Zerodha to map and visualize how data flows across systems. It helps teams maintain data transparency by tracking dependencies between databases, tables, and reports, offering a centralized view of data pipelines. Dungbeetle is designed to enhance observability and trust in analytics ecosystems.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    Bili23 Downloader

    Bili23 Downloader

    Cross platform GUI tool for downloading videos from Bilibili sites

    Bili23-Downloader is an open source desktop application designed for downloading video content from the Bilibili platform. It provides a graphical interface that allows users to download various types of media including user-uploaded videos, series episodes, movies, and other hosted content. It focuses on ease of use with a zero-configuration setup, making it accessible to both beginners and experienced users. It supports high performance downloads through multi-threading and includes resume...
    Downloads: 12 This Week
    Last Update:
    See Project
  • 22
    dude uncomplicated data extraction

    dude uncomplicated data extraction

    dude uncomplicated data extraction: A simple framework

    Dude is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax. Dude is currently in Pre-Alpha. Please expect breaking changes. You can run your scraper from terminal/shell/command-line by supplying URLs, the output filename of your choice and the paths to your python scripts to dude scrape command.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    Markdownify MCP Server

    Markdownify MCP Server

    Convert files and web content into clean, usable Markdown easily

    Markdownify MCP is a Model Context Protocol server that converts many types of files and web content into clean Markdown. It supports formats such as PDFs, images, audio with transcription, DOCX, XLSX, and PPTX, along with web sources like YouTube transcripts, Bing results, and general webpages. Markdownify MCP is designed to simplify content extraction and make data easier to read, share, and reuse in structured workflows. Developers can install dependencies, build, and run the server locally, then extend functionality by modifying its TypeScript-based tools and server logic. ...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 24
    CL4R1T4S

    CL4R1T4S

    Archive of leaked AI system prompts and internal instruction sets

    CL4R1T4S is a public repository that collects and archives extracted system prompts, internal guidelines, and behavioral instructions used by various artificial intelligence models and agents. Its stated goal is to promote transparency by documenting the hidden prompt scaffolding that shapes how AI systems behave and respond to users. CL4R1T4S organizes these materials by company or product, with directories containing prompt files and related instructions for many well-known AI systems....
    Downloads: 3 This Week
    Last Update:
    See Project
  • 25
    paperless-gpt

    paperless-gpt

    Use LLMs and LLM Vision (OCR) to handle paperless-ngx

    paperless-gpt is an AI-powered extension for document management systems that enhances the capabilities of paperless-ngx by integrating large language models and vision-based OCR to automate document processing and organization. It is designed to transform scanned or uploaded documents into structured, searchable, and intelligently categorized data without requiring manual tagging or sorting. The system uses OCR combined with LLM reasoning to extract text, classify documents, and generate...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • Next
MongoDB Logo MongoDB