Showing 13 open source projects for "data extraction"

View related business solutions
  • Gemini 3 and 200+ AI Models on One Platform Icon
    Gemini 3 and 200+ AI Models on One Platform

    Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

    Build generative AI apps with Vertex AI. Switch between models without switching platforms.
    Start Free
  • Powerful App Monitoring Without Surprise Bills Icon
    Powerful App Monitoring Without Surprise Bills

    AppSignal starts at $23/month with all features included. No overages, no hidden fees. 30-day free trial.

    Tired of monitoring tools that punish you for scaling? AppSignal offers transparent, predictable pricing with every feature unlocked on every plan. Track errors, monitor performance, detect anomalies, and manage logs across Ruby, Python, Node.js, and more. Trusted by developers since 2012 with free dev-to-dev support. No credit card required to start your 30-day trial.
    Try AppSignal Free
  • 1
    zpdf

    zpdf

    Zero-copy PDF text extraction library written in Zig

    zpdf is a high-performance PDF text extraction library written in Zig that focuses on speed, low overhead, and modern parsing techniques. It leans heavily on memory-mapped file reading and zero-copy patterns where possible, so it can scan large PDFs without repeatedly copying data around in memory. The library supports streaming extraction using efficient arena allocation, making it well suited for workloads that need to process big documents quickly or in batches. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    Python-Spider

    Python-Spider

    Python3 web crawler practice

    Python-Spider is a repository intended to teach or provide examples for writing web spiders / crawlers in Python — part of a broader learning and resource collection by its author. The code and documentation are oriented toward beginners or intermediate learners who want to learn how to fetch, parse, and extract data from websites programmatically. As part of the author’s public learning-path repositories, python-spider likely includes examples of HTTP requests, HTML parsing, maybe...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    Scrapy

    Scrapy

    A fast, high-level web crawling and web scraping framework

    Scrapy is a fast, open source, high-level framework for crawling websites and extracting structured data from these websites. Portable and written in Python, it can run on Windows, Linux, macOS and BSD. Scrapy is powerful, fast and simple, and also easily extensible. Simply write the rules to extract the data, and add new functionality if you wish without having to touch the core. Scrapy does the rest, and can be used in a number of applications. It can be used for data mining, monitoring...
    Downloads: 23 This Week
    Last Update:
    See Project
  • 4
    Parsera

    Parsera

    Lightweight library for scraping web-sites with LLMs

    Scrape data from any website with only a link and column descriptions. Parsera is a tool designed to scrape web content, specifically handling poorly structured or messy websites.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Our Free Plans just got better! | Auth0 Icon
    Our Free Plans just got better! | Auth0

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.
    Try free now
  • 5
    DrissionPage

    DrissionPage

    Python based web automation tool. Powerful and elegant

    DrissionPage is a Python-based automation framework that blends the capabilities of Selenium for browser automation with Requests-HTML for fast, headless web data extraction. It enables seamless switching between browser-controlled and headless HTTP sessions within the same interface. Ideal for web scraping, testing, and automation, DrissionPage is lightweight and highly efficient, offering more flexibility than standard Selenium or Requests usage alone.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 6
    Prompt Engineering Interactive Tutorial

    Prompt Engineering Interactive Tutorial

    Anthropic's Interactive Prompt Engineering Tutorial

    Prompt-eng-interactive-tutorial is a comprehensive, hands-on tutorial that teaches the craft of prompt engineering with Claude through guided, executable lessons. It starts with the anatomy of a good prompt and moves into techniques that deliver the “80/20” gains—separating instructions from data, specifying schemas, and setting evaluation criteria. The course leans heavily on realistic failure modes (ambiguity, hallucination, brittle instructions) and shows how to iteratively debug prompts the way you would debug code. Lessons include building prompts from scratch for common tasks like extraction, classification, transformation, and step-by-step reasoning, with checkpoints that let you compare your outputs against solid baselines. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    txtai

    txtai

    Build AI-powered semantic search applications

    ...Innovation is happening at a rapid pace, models can understand concepts in documents, audio, images and more. Machine-learning pipelines to run extractive question-answering, zero-shot labeling, transcription, translation, summarization and text extraction. Cloud-native architecture that scales out with container orchestration systems (e.g. Kubernetes). Applications range from similarity search to complex NLP-driven data extractions to generate structured databases. The following applications are powered by txtai.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    RAG Anything

    RAG Anything

    RAG-Anything: All-in-One RAG Framework

    RAG-Anything is an open-source unified framework that extends the Retrieval-Augmented Generation (RAG) paradigm to fully multimodal document and knowledge retrieval, enabling systems to ingest, parse, represent, and query rich content that includes text, images, tables, formulas, and other structured or visual elements. Traditional RAG systems are typically limited to text and cannot effectively work across heterogeneous document layouts, but RAG-Anything addresses this by modeling...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 9
    LangExtract

    LangExtract

    A Python library for extracting structured information

    ...The system excels at handling long documents using optimized chunking, multi-pass extraction, and parallel processing to ensure both high recall and structured consistency.
    Downloads: 0 This Week
    Last Update:
    See Project
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • 10
    CNN for Image Retrieval
    cnn-for-image-retrieval is a research-oriented project that demonstrates the use of convolutional neural networks (CNNs) for image retrieval tasks. The repository provides implementations of CNN-based methods to extract feature representations from images and use them for similarity-based retrieval. It focuses on applying deep learning techniques to improve upon traditional handcrafted descriptors by learning features directly from data. The code includes training and evaluation scripts that...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 11
    pyhanlp

    pyhanlp

    Chinese participle

    ...It is commonly used for Chinese-language NLP tasks where you want production-grade tokenization and linguistic analysis, but still want the convenience of Python scripting. The project focuses on making HanLP’s capabilities accessible through a Python-friendly API surface, so you can integrate NLP steps into data pipelines, notebooks, and downstream ML or information-extraction code. In practice, it serves as a bridge layer: Python calls are translated into the corresponding HanLP operations, so you can keep your application logic in Python while relying on HanLP’s implementations. It is especially useful when you need a pragmatic “get results quickly” NLP layer for segmentation, tagging, entity extraction, parsing, or keyword-style tasks rather than experimenting with model training from scratch.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    TextBlob

    TextBlob

    TextBlob is a Python library for processing textual data

    Simple, Pythonic, text processing, Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more. TextBlob stands on the giant shoulders of NLTK and pattern, and plays nicely with both. Supports word inflection (pluralization and singularization) and lemmatization,...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    Lioness (Languages Interop Framework)
    Framework for making Windows applications that are one .exe file in AutoHotKey_L,C++,C#, VB.NET,Java,Groovy,Common Lisp,Nemerle,Ruby,Python,PHP,Lua,Tcl,Perl,Jint,S#,WSH VBScript,HTML/JavaScript/CSS,COM, PowerShell without compiling . For .NET 4.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • Next
MongoDB Logo MongoDB