metadata extraction tool free download

Showing 36 open source projects for "metadata extraction tool"

View related business solutions

Internet Mac Clear Filters & Widen Search

Gemini 3 and 200+ AI Models on One Platform
Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

Build generative AI apps with Vertex AI. Switch between models without switching platforms.

Start Free
Go from Code to Production URL in Seconds
Cloud Run deploys apps in any language instantly. Scales to zero. Pay only when code runs.

Skip the Kubernetes configs. Cloud Run handles HTTPS, scaling, and infrastructure automatically. Two million requests free per month.

Try it free
1

Trafilatura

Python & command-line tool to gather text on the Web

Trafilatura is a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text-processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats.

Downloads: 0 This Week

Last Update: 2024-12-03
See Project
2

dude uncomplicated data extraction

dude uncomplicated data extraction: A simple framework

Dude is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax. Dude is currently in Pre-Alpha. Please expect breaking changes. You can run your scraper from terminal/shell/command-line by supplying URLs, the output filename of your choice and the paths to your python scripts to dude scrape command.

Downloads: 0 This Week

Last Update: 2024-03-02
See Project
3

KaraKeep

A self-hostable bookmark-everything app

...Automatic fetching of link titles, descriptions, and images streamlines saving content without manual edits, while rule-based management lets users define customized workflows. With support for image OCR and structured data extraction, Karakeep functions as a flexible personal knowledge base for researchers, content creators, and heavy bookmarkers.

Downloads: 0 This Week

Last Update: 2026-02-22
See Project
4

Bili23 Downloader

Cross platform GUI tool for downloading videos from Bilibili sites

Bili23-Downloader is an open source desktop application designed for downloading video content from the Bilibili platform. It provides a graphical interface that allows users to download various types of media including user-uploaded videos, series episodes, movies, and other hosted content. It focuses on ease of use with a zero-configuration setup, making it accessible to both beginners and experienced users. It supports high performance downloads through multi-threading and includes resume...

Downloads: 3 This Week

Last Update: 2026-03-10
See Project
Train ML Models With SQL You Already Know
BigQuery automates data prep, analysis, and predictions with built-in AI assistance.

Build and deploy ML models using familiar SQL. Automate data prep with built-in Gemini. Query 1 TB and store 10 GB free monthly.

Try Free
5

news-please

Python tool for crawling and extracting structured data from news site

news-please is an open source news crawler and information extraction tool designed to collect and structure articles from online news websites. It provides an integrated pipeline that crawls news sites, retrieves article pages, and extracts structured information such as headlines, authors, publication dates, and article text. news-please can recursively follow internal links and read RSS feeds to gather both recent and archived articles from a news outlet when given only the root URL of a site. ...

Downloads: 0 This Week

Last Update: 3 days ago
See Project
6

MDCx

Movie metadata scraper and organizer for media libraries and NFO

MDCx is an open source media metadata scraping and organization tool designed to automate the process of collecting detailed information for movie files. It retrieves metadata from multiple online sources and applies it to local media collections, helping users maintain structured and well-organized libraries. MDCx can download information such as titles, cast data, artwork, and other metadata, then generate standardized NFO files compatible with media management systems. ...

Downloads: 3 This Week

Last Update: 2026-03-10
See Project
7

CommunityScrapers

This is a public repository containing scrapers

Stash Community Scrapers is a large open-source collection of metadata extraction tools designed to work with the Stash media management platform, enabling automated scraping of content information from various online sources. The repository contains hundreds of scraper definitions written primarily in YAML and Python, each tailored to extract structured metadata such as titles, performers, tags, and media details from specific websites.

Downloads: 0 This Week

Last Update: 5 days ago
See Project
8

newspaper4k

Python library for scraping and analyzing online news articles easily

...It is a continuation and active fork of the original newspaper3k library, which had stopped receiving updates, with the goal of keeping the ecosystem maintained while adding improvements and bug fixes. It provides developers with tools to automatically download web pages, extract the main article content, and collect associated metadata such as titles, authors, images, and publication dates. Newspaper4k also includes natural language processing capabilities that can generate summaries and identify keywords from extracted article text. Newspaper4k supports both single-article extraction and full news site processing, allowing users to build sources representing entire publications and iterate through their articles. ...

Downloads: 0 This Week

Last Update: 2026-03-11
See Project
9

WebHarvest - web data extraction tool

Web data extraction (web data mining, web scraping) tool. It leverages well proved XML and text processing techologies in order to easely extract useful data from arbitrary web pages.

14 Reviews

Downloads: 3 This Week

Last Update: 2025-10-27
See Project
Fully Managed MySQL, PostgreSQL, and SQL Server
Automatic backups, patching, replication, and failover. Focus on your app, not your database.

Cloud SQL handles your database ops end to end, so you can focus on your app.

Try Free
10

diskover-community

Open source file indexing & storage analytics powered by Elasticsearch

Diskover Community Edition is an open source file system indexing and storage analytics platform designed to help organizations understand and manage large volumes of file data. It crawls file systems and indexes metadata using Elasticsearch, enabling fast search, analysis, and organization of files stored across different storage systems. It allows administrators and users to explore file structures, monitor storage usage, and gain insights into how data is distributed across infrastructure. By indexing file metadata from sources such as local file systems, network shares like NFS and SMB, and cloud storage, the tool provides a centralized way to analyze heterogeneous storage environments. ...

Downloads: 0 This Week

Last Update: 2026-03-11
See Project
11

watercrawl

AI-ready web crawler that extracts and structures website content

...WaterCrawl supports customizable extraction rules so users can focus only on relevant elements while ignoring unnecessary page components. WaterCrawl also offers real-time monitoring capabilities, allowing users to track crawling progress, performance metrics, and errors during large data collection jobs. Developers can integrate the tool into applications through a REST API and multiple client SDKs, enabling automated data pipelines and AI data preparation workflows.

Downloads: 0 This Week

Last Update: 2026-03-11
See Project
12

CyberScraper 2077

A Powerful web scraper powered by LLM | OpenAI, Gemini & Ollama

CyberScraper 2077 is not just another web scraping tool – it's a glimpse into the future of data extraction. Born from the neon-lit streets of a cyberpunk world, this AI-powered scraper uses OpenAI, Gemini and LocalLLM Models to slice through the web's defenses, extracting the data you need with unparalleled precision and style.

Downloads: 7 This Week

Last Update: 2026-01-20
See Project
13

HeadlessX

The undetected self-hosted browser automation platform

...It is built using modern technologies including Node.js, Next.js, TypeScript, and Playwright, and uses a specialized browser engine called Camoufox based on Firefox. One of the platform’s goals is to bypass common bot-detection systems by implementing advanced fingerprint spoofing and stealth techniques. The tool can perform tasks such as HTML extraction, screenshot generation, content parsing, and search result scraping while appearing like a normal user browser. Because it is self-hosted, organizations can run the platform on their own infrastructure to maintain privacy and control over automation workflows.

Downloads: 0 This Week

Last Update: 2026-03-25
See Project
14

Weibo Crawler

Python crawler for collecting and downloading Sina Weibo user data

weibo-crawler is a Python-based data collection tool designed to retrieve information from Sina Weibo user accounts. It automates the process of gathering posts, user profile details, and engagement metrics from one or more target accounts. weibo-crawler can extract comprehensive information about users, including profile attributes such as nickname, follower count, following count, and account metadata.

Downloads: 1 This Week

Last Update: 3 days ago
See Project
15

Symfony Panther

A browser testing and web crawling library for PHP and Symfony

Symfony Panther is a browser testing and web scraping tool that allows developers to interact with websites programmatically. It uses headless Chrome or Firefox to automate browser tasks, making it suitable for end-to-end testing and data extraction. Panther integrates well with Symfony and PHPUnit, allowing developers to write comprehensive tests for web applications.

Downloads: 0 This Week

Last Update: 2026-01-08
See Project
16

Social-Analyzer

API, CLI, and Web App for analyzing and finding a person's profile

Social Analyzer is an open source OSINT tool that helps investigators discover and analyze a person’s presence across a very large number of social media platforms. It provides a unified API, CLI, and web interface capable of scanning hundreds or thousands of sites for username matches and related metadata. The project includes modular detection and analysis components that users can enable depending on their investigative needs.

Downloads: 7 This Week

Last Update: 2026-03-02
See Project
17

Interactsh

An OOB interaction gathering server and client library

Interactsh is an open-source tool for detecting out-of-band interactions. It is a tool designed to detect vulnerabilities that cause external interactions. Interactsh Cli client requires go1.17+ to install successfully. interactsh-client with -sf, -session-file flag can be used store/read the current session information from user defined file which is useful to resume the same session to poll the interactions even after the client gets stopped or closed. Running the interactsh-client in...

Downloads: 2 This Week

Last Update: 2026-03-10
See Project
18

videodl

Lightweight Python tool for downloading videos from many platforms

Videodl is a lightweight video downloader implemented entirely in Python that allows users to retrieve videos from a wide range of online media platforms. It focuses on providing a fast and simple way to parse video pages and download media files, often prioritizing high-definition versions without watermarks when available. It supports numerous video platforms across both Chinese and international streaming ecosystems, enabling users to fetch content from many popular services through a...

Downloads: 10 This Week

Last Update: 2 days ago
See Project
19

ffsend

Easily and securely share files from the command line

Easily and securely share files and directories from the command line through a safe, private and encrypted link using a single simple command. Files are shared using the Send service and may be up to 1GB. Others are able to download these files with this tool, or through their web browser. All files are always encrypted on the client, and secrets are never shared with the remote host. An optional password may be specified, and a default file lifetime of 1 (up to 20) download or 24 hours is...

Downloads: 3 This Week

Last Update: 2025-02-04
See Project
20

Browserless

The headless Chrome/Chromium driver on top of Puppeteer

Browserless is an open-source headless browser automation library and service built on top of Puppeteer that simplifies the process of running and scaling Chromium-based browser tasks in production environments. It provides a high-level API for interacting with headless Chrome, allowing developers to perform operations such as generating PDFs, capturing screenshots, extracting text or HTML, and automating web navigation. The project is designed to act as a production-ready abstraction layer...

Downloads: 0 This Week

Last Update: 2026-03-17
See Project
21

S3cmd

Command line tool for managing Amazon S3 and CloudFront services

...Lots of features and options have been added to S3cmd, since its very first release in 2008.... we recently counted more than 60 command-line options, including multipart uploads, encryption, incremental backup, s3 sync, ACL and Metadata management, S3 bucket size, bucket policies, and more!

Downloads: 2 This Week

Last Update: 2023-12-12
See Project
22

mlscraper

ML-based HTML scraper that learns extraction rules from examples

...This approach simplifies web scraping tasks by shifting the focus from rule-writing to example-based training. Internally, the project processes HTML documents, identifies relevant elements in the DOM, and builds extraction logic based on statistical or heuristic analysis of the training samples. The result is a developer-oriented tool that aims to automate common scraping workflows.

Downloads: 2 This Week

Last Update: 1 day ago
See Project
23

RED HAWK

All-in-one reconnaissance and vulnerability scanning toolkit for sites

...RED HAWK includes utilities for performing DNS lookups, port scans, subdomain discovery, and reverse IP analysis, giving users a comprehensive view of a target environment. In addition to vulnerability detection, RED HAWK offers crawling features that gather links and metadata from websites to support deeper reconnaissance.

Downloads: 0 This Week

Last Update: 3 days ago
See Project
24

hordes

WordPress Plug curates list of links with titles icons and categories.

Creates a front side submitted form that allows logged in user to add links or bookmarks of their favorite websites. Just copy and paste your address into the link text box and add then a title. Hordes will do the rest. There are many options including the ability to show all the data about your link, or to only show the title and the link which is handy to show more links on the page when you know what they are by only looking at the title and the link. If you need to search for...

Downloads: 0 This Week

Last Update: 2020-08-13
See Project
25

LymPHOS2

LymPHOS2 Web-App

LymPHOS2 is a web-based Application at www.LymPHOS.org containing peptidic and protein sequences and spectrometric information on the PhosphoProteome of human T-Lymphocytes. - Nguyen, TD., Vidal-Cortes, O., Gallardo, Ó., Abian, J., Carrascal, M., LymPHOS 2.0: an update of a phosphosite database of primary human T cells. Database 2015, 2015. DOI: 10.1093/database/bav115 - Carrascal, M., Ovelleiro, D., Casas, V., Gay, M., Abian, J., Phosphorylation analysis of primary human T lymphocytes...

1 Review

Downloads: 0 This Week

Last Update: 2020-07-03
See Project