data extraction free download

Showing 20 open source projects for "data extraction"

View related business solutions

Libraries Mac Clear Filters & Widen Search

Try Google Cloud Risk-Free With $300 in Credit
No hidden charges. No surprise bills. Cancel anytime.

Use your credit across every product. Compute, storage, AI, analytics. When it runs out, 20+ products stay free. You only pay when you choose to.

Start Free
Gemini 3 and 200+ AI Models on One Platform
Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

Build generative AI apps with Vertex AI. Switch between models without switching platforms.

Start Free
1

Hacks

A collection of hacks and one-off scripts

Hacks is a collection of experimental scripts, utilities, and one-off tools created to solve specific problems in security research, data processing, and automation. Rather than being a single cohesive application, it serves as a repository of practical command-line tools that can be used independently or combined into workflows. The scripts cover a wide range of tasks, including URL manipulation, parameter replacement, data extraction, and reconnaissance automation. ...

Downloads: 4 This Week

Last Update: 2 days ago
See Project
2

X-Crawl

Flexible Node.js AI-assisted crawler library

A high-performance web crawling and scraping framework for Node.js, designed for large-scale data extraction.

Downloads: 0 This Week

Last Update: 2025-04-06
See Project
3

zpdf

Zero-copy PDF text extraction library written in Zig

zpdf is a high-performance PDF text extraction library written in Zig that focuses on speed, low overhead, and modern parsing techniques. It leans heavily on memory-mapped file reading and zero-copy patterns where possible, so it can scan large PDFs without repeatedly copying data around in memory. The library supports streaming extraction using efficient arena allocation, making it well suited for workloads that need to process big documents quickly or in batches. ...

Downloads: 1 This Week

Last Update: 2026-02-01
See Project
4

Symfony DomCrawler

Eases DOM navigation for HTML and XML documents

Symfony DomCrawler is a PHP component that provides powerful tools for navigating and extracting data from HTML and XML documents. It allows developers to parse, filter, and manipulate web pages using CSS selectors and XPath expressions. DomCrawler is widely used for web scraping, testing, and processing structured content, and integrates well with other Symfony components like BrowserKit.

Downloads: 0 This Week

Last Update: 2026-02-26
See Project
MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
5

Python-Spider

Python3 web crawler practice

Python-Spider is a repository intended to teach or provide examples for writing web spiders / crawlers in Python — part of a broader learning and resource collection by its author. The code and documentation are oriented toward beginners or intermediate learners who want to learn how to fetch, parse, and extract data from websites programmatically. As part of the author’s public learning-path repositories, python-spider likely includes examples of HTTP requests, HTML parsing, maybe...

Downloads: 0 This Week

Last Update: 2025-12-08
See Project
6

Parsera

Lightweight library for scraping web-sites with LLMs

Scrape data from any website with only a link and column descriptions. Parsera is a tool designed to scrape web content, specifically handling poorly structured or messy websites.

Downloads: 0 This Week

Last Update: 2025-10-08
See Project
7

Integrant

Micro-framework for data-driven architecture

Integrant is a minimalistic micro-framework for building applications following a data-driven architecture. It lets you define system components declaratively as configuration data and handles lifecycle actions (init, halt, resume) in dependency order, serving as a modern alternative to Component or Mount. Integrant was built as a reaction to fix some perceived weaknesses with Component. In Component, systems are created programmatically. Constructor functions are used to build records,...

Downloads: 0 This Week

Last Update: 2025-10-02
See Project
8

Prompt Engineering Interactive Tutorial

Anthropic's Interactive Prompt Engineering Tutorial

Prompt-eng-interactive-tutorial is a comprehensive, hands-on tutorial that teaches the craft of prompt engineering with Claude through guided, executable lessons. It starts with the anatomy of a good prompt and moves into techniques that deliver the “80/20” gains—separating instructions from data, specifying schemas, and setting evaluation criteria. The course leans heavily on realistic failure modes (ambiguity, hallucination, brittle instructions) and shows how to iteratively debug prompts the way you would debug code. Lessons include building prompts from scratch for common tasks like extraction, classification, transformation, and step-by-step reasoning, with checkpoints that let you compare your outputs against solid baselines. ...

Downloads: 1 This Week

Last Update: 2025-10-06
See Project
9

Article Extractor

To extract main article from given URL with Node.js

A Node.js library for extracting main content from web articles, removing unnecessary clutter like ads and navigation elements.

Downloads: 0 This Week

Last Update: 2025-09-04
See Project
8 Monitoring Tools in One APM. Install in 5 Minutes.
Errors, performance, logs, uptime, hosts, anomalies, dashboards, and check-ins. One interface.

AppSignal works out of the box for Ruby, Elixir, Node.js, Python, and more. 30-day free trial, no credit card required.

Start Free
10

sharp

High performance Node.js image processing module

...Colour spaces, embedded ICC profiles and alpha transparency channels are all handled correctly. Lanczos resampling ensures quality is not sacrificed for speed. As well as image resizing, operations such as rotation, extraction, compositing and gamma correction are available. Most modern macOS, Windows and Linux systems running Node.js v10+ do not require any additional install or runtime dependencies. This module supports reading JPEG, PNG, WebP, AVIF, TIFF, GIF and SVG images. Output images can be in JPEG, PNG, WebP, AVIF and TIFF formats as well as uncompressed raw pixel data. ...

Downloads: 1 This Week

Last Update: 2025-11-06
See Project
11

LangExtract

A Python library for extracting structured information

...The system excels at handling long documents using optimized chunking, multi-pass extraction, and parallel processing to ensure both high recall and structured consistency.

Downloads: 3 This Week

Last Update: 7 days ago
See Project
12

DocWire SDK

Award-winning modern data processing SDK in C++20

DocWire SDK, a standout C++20AI driven data processing tool, has received award from SourceForge and strong backing from Microsoft. It handles nearly 100 file types, empowering efficient text extraction, web data extraction, and document analysis. For businesses, the shift to DocWire SDK signifies a leap forward. It promises comprehensive document format support and the ability to extract valuable insights from email boxes, databases, and websites using cutting-edge AI. ...

Downloads: 5 This Week

Last Update: 3 days ago
See Project
13

Specter

Clojure(Script)'s missing piece

Specter is a powerful Clojure (and ClojureScript) library that revolutionizes navigation and manipulation of deeply nested and recursive data structures through a flexible, high-performance API beyond what vanilla Clojure offers. Specter has an extremely simple core, just a single abstraction called "navigator". Queries and transforms are done by composing navigators into a "path" precisely targeting what you want to retrieve or change. Navigators can be composed with any other navigators,...

Downloads: 0 This Week

Last Update: 2025-08-19
See Project
14

Exifr

The fastest and most versatile JS EXIF reading library

Exifr is a fast and very versatile JavaScript EXIF reading library that works everywhere, parses everything and handles just about anything you throw at it. It can handle any input: buffers, url, <img> tag and more; .jpg, .tif, and .heic files; and TIFF (EXIF, GPS, etc.), XMP, ICC, IPTC, JFIF segments. It skips parsing tags you don’t need, and reads only the first few bytes. There’s no need to read the whole file to see if there’s an EXIF file in it, or extract all the data when you just...

Downloads: 0 This Week

Last Update: 2022-06-29
See Project
15

Duckling (Old)

Clojure library that parses text into structured data

Duckling (the “old” archived version) is a natural language processing library (in Clojure) for parsing text to structured data — specifically, recognizing quantities such as dates, times, durations, measurements, currencies, etc., from free-form text. To use Duckling in your project, you just need two functions: load! to load the default configuration, and parse to parse a string. Duckling is a Clojure library that parses text into structured data. See our blog post announcement for more...

Downloads: 0 This Week

Last Update: 2025-09-24
See Project
16

Enlive

Selector-based templating and transformation system for Clojure

Enlive is a Clojure library for HTML templating, transformation, and scraping, supporting composable manipulation of HTML/XML in a functional style. It allows selecting, transforming, and generating HTML fragments using CSS selectors, and supports server-side template composition, dynamic pages, and content rewriting. By default selector-transformation pairs are run sequentially. When you know that several transformations are independent, you can now specify (as an optimization) to process...

Downloads: 0 This Week

Last Update: 2025-09-24
See Project
17

iText®, a JAVA PDF library

PDF Library for Developers

iText is an open-source PDF library available for Java and .NET (C#). iText allows you to effortlessly generate and manipulate standards-compliant PDF documents with a powerful and feature-rich SDK. With iText, you can create archivable and accessible PDFs, split and merge documents, fill and flatten forms, digitally sign documents, and more. iText add-ons enable additional functionality, such as PDF creation from HTML templates, secure redaction, OCR, and much more. The latest...

Downloads: 184 This Week

Last Update: 2024-06-01
See Project
18

Deeplearning-papernotes

Summaries and notes on Deep Learning research papers

Deeplearning-papernotes is an implementation of Convolutional Neural Networks for sentence and text classification in TensorFlow, based on a well-known research paper that applies CNN architectures to natural language processing tasks with strong performance in sentiment analysis and similar classification problems. The repository provides the complete network definition, including an embedding layer to convert words into dense representations, convolution and max-pooling layers to extract...

Downloads: 0 This Week

Last Update: 2026-02-12
See Project
19

CMIS Input plugin for Pentaho

Allows querying Content Management Systems that use the CMIS.

...Imagine using the information extracted for statistical purposes, for creating reports and, more generally, to analyse your document archives in a way unthinkable until now with the current tools available. All this is possible within the Pentaho Suite, the Open Source Business Intelligence platform, which is useful to the extraction and analysis of structured and semi-structured data. With this goal (the extraction and analysis of data) has been designed and developed the CMIS Input plugin for Pentaho Data Integration (Kettle) that allows querying Content Management Systems that use the CMIS interoperability standard. The data, once extracted, can be stored and analyzed and perhaps presented in customized reports be published in various formats for the end user (PDF, Excel, etc..).

Downloads: 0 This Week

Last Update: 2014-11-09
See Project
20

TextBlob

TextBlob is a Python library for processing textual data

Simple, Pythonic, text processing, Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more. TextBlob stands on the giant shoulders of NLTK and pattern, and plays nicely with both. Supports word inflection (pluralization and singularization) and lemmatization,...

Downloads: 0 This Week

Last Update: 2021-07-23
See Project