Search Results for "data extraction" - Page 4

Sort By:

Showing 386 open source projects for "data extraction"

View related business solutions

Windows Clear Filters & Widen Search

MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
Enterprise-grade ITSM, for every business
Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity.

Freshservice is an intuitive, AI-powered platform that helps IT, operations, and business teams deliver exceptional service without the usual complexity. Automate repetitive tasks, resolve issues faster, and provide seamless support across the organization. From managing incidents and assets to driving smarter decisions, Freshservice makes it easy to stay efficient and scale with confidence.

Try it Free
1

LLM Scraper

Extract structured data from webpages using LLM-powered scraping

LLM Scraper is a TypeScript library designed to extract structured data from webpages using large language models. Instead of relying on fragile HTML selectors or manual parsing rules, the tool interprets webpage content with language models and converts it into structured data according to a defined schema. Developers can specify the data structure using tools such as Zod or JSON Schema, enabling the model to extract relevant information directly into typed objects. LLM Scraper integrates...

Downloads: 3 This Week

Last Update: 7 hours ago
See Project
2

GenAIScript

Automatable GenAI Scripting

JavaScript-ish environment with convenient tooling for file ingestion, prompt development, and structured data extraction. A Microsoft tool that generates AI-powered text based on prompts, useful for content creation and automation.

Downloads: 0 This Week

Last Update: 2025-09-26
See Project
3

Kor

LLM

This is a half-baked prototype that “helps” you extract structured data from text using LLMs. Specify the schema of what should be extracted and provide some examples. Kor will generate a prompt, send it to the specified LLM and parse out the output. You might even get results back.

Downloads: 0 This Week

Last Update: 2024-07-20
See Project
4

AI Powered Knowledge Graph Generator

AI Powered Knowledge Graph Generator

AI-Powered Knowledge Graph is an open-source project focused on building knowledge graph systems that integrate artificial intelligence and machine learning to represent complex relationships between data entities. Knowledge graphs organize information as networks of nodes and relationships, allowing applications to analyze connections between concepts, datasets, or real-world entities. By incorporating AI techniques such as natural language processing and semantic reasoning, the project...

Downloads: 1 This Week

Last Update: 2026-03-06
See Project
Earn up to 16% annual interest with Nexo.
More flexibility. More control.

Generate interest, access liquidity without selling, and execute trades seamlessly. All in one platform. Geographic restrictions, eligibility, and terms apply.

Get started with Nexo.
5

NLP-Knowledge-Graph

Research and application of technologies such as nl processing

...The project aims to help researchers and developers understand how structured knowledge representations can enhance language processing systems. It includes curated materials covering key topics such as knowledge graph construction, entity recognition, relation extraction, graph embeddings, and semantic reasoning. By combining NLP techniques with graph-based data models, knowledge graphs allow systems to represent complex relationships between entities and improve tasks such as question answering, information retrieval, and recommendation systems. The repository aggregates research papers, technical articles, tutorials, and open-source tools related to these areas.

Downloads: 0 This Week

Last Update: 2026-03-06
See Project
6

VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs)

...The toolkit provides a unified framework that allows researchers and developers to evaluate multimodal models across a wide range of datasets and standardized benchmarks with minimal setup. Instead of requiring complex data preparation pipelines or multiple repositories for each benchmark, the system enables evaluation through simple commands that automatically handle dataset loading, model inference, and metric computation. VLMEvalKit supports generation-based evaluation methods, allowing models to produce textual responses to visual inputs while measuring performance through techniques such as exact matching or language-model-assisted answer extraction.

Downloads: 0 This Week

Last Update: 2026-03-05
See Project
7

nw_wrld

nw_wrld is an event-driven sequencer for triggering visuals

nw_wrld is a procedurally generated world-building engine tailored for game developers and interactive storytellers who want to craft rich, random yet coherent environments without hand-crafting every detail. It uses noise functions and modular terrain algorithms to generate expansive maps, diverse biomes, and layered features like rivers, mountain ranges, forests, and resource nodes. The system is designed to be extensible, letting developers plug in new generation rules or tweak parameters...

Downloads: 1 This Week

Last Update: 2026-04-02
See Project
8

Spider

High-performance Rust web crawler and scraper for large-scale data

...Spider can operate concurrently across many pages, allowing it to gather large datasets in a short period of time. Spider also provides mechanisms for subscribing to crawl events so developers can process page data such as URLs, status codes, or HTML content as it is discovered. It supports advanced capabilities such as headless browser rendering, background crawling tasks, and configurable rules that control crawl depth or ignored paths. These capabilities make the project suitable for building search indexers, data extraction pipelines, & SEO analysis tools.

Downloads: 2 This Week

Last Update: 2026-03-31
See Project
9

broom

Convert statistical analysis objects from R into tidy format

broom is part of the tidymodels ecosystem that converts statistical model outputs (e.g. from lm, glm, t.test, lme4, etc.) into tidy tibbles — standardized data frames — using functions tidy(), glance(), and augment(). These are easier to manipulate, visualize, and report programmatically.

Downloads: 0 This Week

Last Update: 2025-12-03
See Project
Go From AI Idea to AI App Fast
One platform to build, fine-tune, and deploy ML models. No MLOps team required.

Access Gemini 3 and 200+ models. Build chatbots, agents, or custom models with built-in monitoring and scaling.

Try Free
10

Dendrite

Tools to build web AI agents that can authenticate

Dendrite Python SDK is a toolkit for building web AI agents that can authenticate, interact with, and extract data from any website, facilitating web automation tasks.

Downloads: 0 This Week

Last Update: 2025-01-29
See Project
11

MegaParse

File Parser optimised for LLM Ingestion with no loss

...It efficiently parses various document formats, such as PDFs, DOCX, and PPTX, converting them into formats ideal for processing by LLMs. This tool is essential for applications that require accurate and comprehensive data extraction from diverse document types.

Downloads: 0 This Week

Last Update: 2025-02-14
See Project
12

DotnetSpider

Lightweight .NET framework for fast web crawling and data scraping

DotnetSpider is a web crawling and data extraction framework built on the .NET Standard platform. It is designed to help developers create efficient and scalable crawlers for collecting structured data from websites. It provides a high-level API that simplifies the process of defining spiders, managing requests, and extracting content from web pages. Developers can create custom spiders by extending base classes and configuring pipelines that handle downloading, parsing, and storing collected data. ...

Downloads: 0 This Week

Last Update: 2026-03-10
See Project
13

Obsidian Visual Skills Pack

Generate Canvas, Excalidraw, and Mermaid diagrams from text

LLM-TLDR is a Python-based tool designed to dramatically reduce the amount of code a large language model needs to read by extracting the essential structure and context from a codebase and presenting only the most relevant parts to the model. Traditional approaches often dump entire files into a model’s context, which quickly exceeds token limits; LLM-TLDR instead indexes project structure, traces dependencies, and summarizes code in a way that preserves semantic relevance while shrinking...

Downloads: 0 This Week

Last Update: 2026-02-12
See Project
14

ClimateTools.jl

Climate science package for Julia

Climate analysis tools in Julia. ClimateTools.jl is a collection of commonly-used tools in Climate science. Basics of climate field analysis are covered, with some forays into exploratory techniques associated with climate scenario design. The package is aimed to ease the typical steps of analysis of climate models outputs and gridded datasets (support for weather stations is a work-in-progress). Climate indices and bias correction functions are coded to leverage the use of multiple threads....

Downloads: 0 This Week

Last Update: 2026-04-22
See Project
15

Article Extractor

To extract main article from given URL with Node.js

A Node.js library for extracting main content from web articles, removing unnecessary clutter like ads and navigation elements.

Downloads: 0 This Week

Last Update: 2025-09-04
See Project
16

wombat

Lightweight Ruby DSL for scraping structured data from web pages

Wombat is a lightweight web crawling and scraping library written in Ruby that focuses on extracting structured data from web pages using a concise domain-specific language (DSL). It is designed to simplify the process of defining how information should be collected from HTML documents without requiring large amounts of scraping boilerplate code. Developers can declare the data fields they want and specify selectors or rules for retrieving them, allowing Wombat to parse and return structured...

Downloads: 0 This Week

Last Update: 2026-04-07
See Project
17

paperless-gpt

Use LLMs and LLM Vision (OCR) to handle paperless-ngx

paperless-gpt is an AI-powered extension for document management systems that enhances the capabilities of paperless-ngx by integrating large language models and vision-based OCR to automate document processing and organization. It is designed to transform scanned or uploaded documents into structured, searchable, and intelligently categorized data without requiring manual tagging or sorting. The system uses OCR combined with LLM reasoning to extract text, classify documents, and generate...

Downloads: 2 This Week

Last Update: 2026-03-19
See Project
18

Docling

Get your documents ready for gen AI

Docling is an open-source document processing toolkit built to prepare diverse content types for modern generative AI and data workflows. The project focuses on converting and parsing many document formats into a unified structured representation that downstream systems can easily consume. It supports advanced PDF understanding, including layout detection, table extraction, and reading order analysis, enabling high-fidelity document intelligence pipelines.

Downloads: 2 This Week

Last Update: 14 hours ago
See Project
19

HeadlessX

The undetected self-hosted browser automation platform

...The tool can perform tasks such as HTML extraction, screenshot generation, content parsing, and search result scraping while appearing like a normal user browser. Because it is self-hosted, organizations can run the platform on their own infrastructure to maintain privacy and control over automation workflows.

Downloads: 0 This Week

Last Update: 2026-03-25
See Project
20

ESPnet

End-to-end speech processing toolkit

ESPnet is a comprehensive end-to-end speech processing toolkit covering a wide spectrum of tasks, including automatic speech recognition (ASR), text-to-speech (TTS), speech translation (ST), speech enhancement, speaker diarization, and spoken language understanding. It uses PyTorch as its deep learning engine and adopts a Kaldi-style data processing pipeline for features, data formats, and experimental recipes. This combination allows researchers to leverage modern neural architectures while...

Downloads: 2 This Week

Last Update: 2026-04-22
See Project
21

DINOv3

Reference PyTorch implementation and models for DINOv3

DINOv3 is the third-generation iteration of Meta’s self-supervised visual representation learning framework, building upon the ideas from DINO and DINOv2. It continues the paradigm of learning strong image representations without labels using teacher–student distillation, but introduces a simplified and more scalable training recipe that performs well across datasets and architectures. DINOv3 removes the need for complex augmentations or momentum encoders, streamlining the pipeline while...

Downloads: 20 This Week

Last Update: 2026-03-30
See Project
22

Gooo

Toolkit for developing web applications in Vue, Templ, and Go

...The project emphasizes simplicity and flexibility, enabling users to integrate its components into scripts or larger systems. While not as feature-heavy as enterprise frameworks, it serves as a foundation for experimentation and rapid prototyping in data extraction or automation tasks. Its design reflects a developer-centric approach, prioritizing extensibility and ease of modification over polished interfaces.

Downloads: 0 This Week

Last Update: 2026-03-17
See Project
23

Dungbeetle

A distributed job server

Dungbeetle is a metadata and data lineage tracking tool developed by Zerodha to map and visualize how data flows across systems. It helps teams maintain data transparency by tracking dependencies between databases, tables, and reports, offering a centralized view of data pipelines. Dungbeetle is designed to enhance observability and trust in analytics ecosystems.

Downloads: 0 This Week

Last Update: 2025-06-11
See Project
24

NeMo Curator

Scalable data pre processing and curation toolkit for LLMs

NeMo Curator is a Python library specifically designed for fast and scalable dataset preparation and curation for large language model (LLM) use-cases such as foundation model pretraining, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT) and paramter-efficient fine-tuning (PEFT). It greatly accelerates data curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline...

Downloads: 0 This Week

Last Update: 2026-02-23
See Project
25

BrowserOS

Agentic browser; privacy-first alternative to ChatGPT Atlas

BrowserOS is an open-source, agentic web browser built on a Chromium base that integrates AI agents directly into the browsing experience. Rather than just doing standard browsing, it places AI intelligence at the core: you can connect your own API keys (for e.g., OpenAI, Anthropic, Google Gemini) or run local models (via e.g., Ollama) so that your browsing data and automation stay on your machine — privacy and control are emphasized throughout. The interface remains familiar to users of...

Downloads: 19 This Week

Last Update: 2026-04-08
See Project