Showing 36 open source projects for "metadata extraction tool"

View related business solutions
  • Gemini 3 and 200+ AI Models on One Platform Icon
    Gemini 3 and 200+ AI Models on One Platform

    Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

    Build generative AI apps with Vertex AI. Switch between models without switching platforms.
    Start Free
  • Go from Code to Production URL in Seconds Icon
    Go from Code to Production URL in Seconds

    Cloud Run deploys apps in any language instantly. Scales to zero. Pay only when code runs.

    Skip the Kubernetes configs. Cloud Run handles HTTPS, scaling, and infrastructure automatically. Two million requests free per month.
    Try it free
  • 1
    Trafilatura

    Trafilatura

    Python & command-line tool to gather text on the Web

    Trafilatura is a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text-processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    dude uncomplicated data extraction

    dude uncomplicated data extraction

    dude uncomplicated data extraction: A simple framework

    Dude is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax. Dude is currently in Pre-Alpha. Please expect breaking changes. You can run your scraper from terminal/shell/command-line by supplying URLs, the output filename of your choice and the paths to your python scripts to dude scrape command.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    KaraKeep

    KaraKeep

    A self-hostable bookmark-everything app

    ...Automatic fetching of link titles, descriptions, and images streamlines saving content without manual edits, while rule-based management lets users define customized workflows. With support for image OCR and structured data extraction, Karakeep functions as a flexible personal knowledge base for researchers, content creators, and heavy bookmarkers.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    Bili23 Downloader

    Bili23 Downloader

    Cross platform GUI tool for downloading videos from Bilibili sites

    Bili23-Downloader is an open source desktop application designed for downloading video content from the Bilibili platform. It provides a graphical interface that allows users to download various types of media including user-uploaded videos, series episodes, movies, and other hosted content. It focuses on ease of use with a zero-configuration setup, making it accessible to both beginners and experienced users. It supports high performance downloads through multi-threading and includes resume...
    Downloads: 3 This Week
    Last Update:
    See Project
  • Train ML Models With SQL You Already Know Icon
    Train ML Models With SQL You Already Know

    BigQuery automates data prep, analysis, and predictions with built-in AI assistance.

    Build and deploy ML models using familiar SQL. Automate data prep with built-in Gemini. Query 1 TB and store 10 GB free monthly.
    Try Free
  • 5
    news-please

    news-please

    Python tool for crawling and extracting structured data from news site

    news-please is an open source news crawler and information extraction tool designed to collect and structure articles from online news websites. It provides an integrated pipeline that crawls news sites, retrieves article pages, and extracts structured information such as headlines, authors, publication dates, and article text. news-please can recursively follow internal links and read RSS feeds to gather both recent and archived articles from a news outlet when given only the root URL of a site. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    MDCx

    MDCx

    Movie metadata scraper and organizer for media libraries and NFO

    MDCx is an open source media metadata scraping and organization tool designed to automate the process of collecting detailed information for movie files. It retrieves metadata from multiple online sources and applies it to local media collections, helping users maintain structured and well-organized libraries. MDCx can download information such as titles, cast data, artwork, and other metadata, then generate standardized NFO files compatible with media management systems. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 7
    CommunityScrapers

    CommunityScrapers

    This is a public repository containing scrapers

    Stash Community Scrapers is a large open-source collection of metadata extraction tools designed to work with the Stash media management platform, enabling automated scraping of content information from various online sources. The repository contains hundreds of scraper definitions written primarily in YAML and Python, each tailored to extract structured metadata such as titles, performers, tags, and media details from specific websites.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    newspaper4k

    newspaper4k

    Python library for scraping and analyzing online news articles easily

    ...It is a continuation and active fork of the original newspaper3k library, which had stopped receiving updates, with the goal of keeping the ecosystem maintained while adding improvements and bug fixes. It provides developers with tools to automatically download web pages, extract the main article content, and collect associated metadata such as titles, authors, images, and publication dates. Newspaper4k also includes natural language processing capabilities that can generate summaries and identify keywords from extracted article text. Newspaper4k supports both single-article extraction and full news site processing, allowing users to build sources representing entire publications and iterate through their articles. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    WebHarvest - web data extraction tool
    Web data extraction (web data mining, web scraping) tool. It leverages well proved XML and text processing techologies in order to easely extract useful data from arbitrary web pages.
    Downloads: 3 This Week
    Last Update:
    See Project
  • Fully Managed MySQL, PostgreSQL, and SQL Server Icon
    Fully Managed MySQL, PostgreSQL, and SQL Server

    Automatic backups, patching, replication, and failover. Focus on your app, not your database.

    Cloud SQL handles your database ops end to end, so you can focus on your app.
    Try Free
  • 10
    diskover-community

    diskover-community

    Open source file indexing & storage analytics powered by Elasticsearch

    Diskover Community Edition is an open source file system indexing and storage analytics platform designed to help organizations understand and manage large volumes of file data. It crawls file systems and indexes metadata using Elasticsearch, enabling fast search, analysis, and organization of files stored across different storage systems. It allows administrators and users to explore file structures, monitor storage usage, and gain insights into how data is distributed across infrastructure. By indexing file metadata from sources such as local file systems, network shares like NFS and SMB, and cloud storage, the tool provides a centralized way to analyze heterogeneous storage environments. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    watercrawl

    watercrawl

    AI-ready web crawler that extracts and structures website content

    ...WaterCrawl supports customizable extraction rules so users can focus only on relevant elements while ignoring unnecessary page components. WaterCrawl also offers real-time monitoring capabilities, allowing users to track crawling progress, performance metrics, and errors during large data collection jobs. Developers can integrate the tool into applications through a REST API and multiple client SDKs, enabling automated data pipelines and AI data preparation workflows.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    CyberScraper 2077

    CyberScraper 2077

    A Powerful web scraper powered by LLM | OpenAI, Gemini & Ollama

    CyberScraper 2077 is not just another web scraping tool – it's a glimpse into the future of data extraction. Born from the neon-lit streets of a cyberpunk world, this AI-powered scraper uses OpenAI, Gemini and LocalLLM Models to slice through the web's defenses, extracting the data you need with unparalleled precision and style.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 13
    HeadlessX

    HeadlessX

    The undetected self-hosted browser automation platform

    ...It is built using modern technologies including Node.js, Next.js, TypeScript, and Playwright, and uses a specialized browser engine called Camoufox based on Firefox. One of the platform’s goals is to bypass common bot-detection systems by implementing advanced fingerprint spoofing and stealth techniques. The tool can perform tasks such as HTML extraction, screenshot generation, content parsing, and search result scraping while appearing like a normal user browser. Because it is self-hosted, organizations can run the platform on their own infrastructure to maintain privacy and control over automation workflows.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    Weibo Crawler

    Weibo Crawler

    Python crawler for collecting and downloading Sina Weibo user data

    weibo-crawler is a Python-based data collection tool designed to retrieve information from Sina Weibo user accounts. It automates the process of gathering posts, user profile details, and engagement metrics from one or more target accounts. weibo-crawler can extract comprehensive information about users, including profile attributes such as nickname, follower count, following count, and account metadata.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 15
    Symfony Panther

    Symfony Panther

    A browser testing and web crawling library for PHP and Symfony

    Symfony Panther is a browser testing and web scraping tool that allows developers to interact with websites programmatically. It uses headless Chrome or Firefox to automate browser tasks, making it suitable for end-to-end testing and data extraction. Panther integrates well with Symfony and PHPUnit, allowing developers to write comprehensive tests for web applications.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    Social-Analyzer

    Social-Analyzer

    API, CLI, and Web App for analyzing and finding a person's profile

    Social Analyzer is an open source OSINT tool that helps investigators discover and analyze a person’s presence across a very large number of social media platforms. It provides a unified API, CLI, and web interface capable of scanning hundreds or thousands of sites for username matches and related metadata. The project includes modular detection and analysis components that users can enable depending on their investigative needs.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 17
    Interactsh

    Interactsh

    An OOB interaction gathering server and client library

    Interactsh is an open-source tool for detecting out-of-band interactions. It is a tool designed to detect vulnerabilities that cause external interactions. Interactsh Cli client requires go1.17+ to install successfully. interactsh-client with -sf, -session-file flag can be used store/read the current session information from user defined file which is useful to resume the same session to poll the interactions even after the client gets stopped or closed. Running the interactsh-client in...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 18
    videodl

    videodl

    Lightweight Python tool for downloading videos from many platforms

    Videodl is a lightweight video downloader implemented entirely in Python that allows users to retrieve videos from a wide range of online media platforms. It focuses on providing a fast and simple way to parse video pages and download media files, often prioritizing high-definition versions without watermarks when available. It supports numerous video platforms across both Chinese and international streaming ecosystems, enabling users to fetch content from many popular services through a...
    Downloads: 10 This Week
    Last Update:
    See Project
  • 19
    ffsend

    ffsend

    Easily and securely share files from the command line

    Easily and securely share files and directories from the command line through a safe, private and encrypted link using a single simple command. Files are shared using the Send service and may be up to 1GB. Others are able to download these files with this tool, or through their web browser. All files are always encrypted on the client, and secrets are never shared with the remote host. An optional password may be specified, and a default file lifetime of 1 (up to 20) download or 24 hours is...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 20
    Browserless

    Browserless

    The headless Chrome/Chromium driver on top of Puppeteer

    Browserless is an open-source headless browser automation library and service built on top of Puppeteer that simplifies the process of running and scaling Chromium-based browser tasks in production environments. It provides a high-level API for interacting with headless Chrome, allowing developers to perform operations such as generating PDFs, capturing screenshots, extracting text or HTML, and automating web navigation. The project is designed to act as a production-ready abstraction layer...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    S3cmd

    S3cmd

    Command line tool for managing Amazon S3 and CloudFront services

    ...Lots of features and options have been added to S3cmd, since its very first release in 2008.... we recently counted more than 60 command-line options, including multipart uploads, encryption, incremental backup, s3 sync, ACL and Metadata management, S3 bucket size, bucket policies, and more!
    Downloads: 2 This Week
    Last Update:
    See Project
  • 22
    mlscraper

    mlscraper

    ML-based HTML scraper that learns extraction rules from examples

    ...This approach simplifies web scraping tasks by shifting the focus from rule-writing to example-based training. Internally, the project processes HTML documents, identifies relevant elements in the DOM, and builds extraction logic based on statistical or heuristic analysis of the training samples. The result is a developer-oriented tool that aims to automate common scraping workflows.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 23
    RED HAWK

    RED HAWK

    All-in-one reconnaissance and vulnerability scanning toolkit for sites

    ...RED HAWK includes utilities for performing DNS lookups, port scans, subdomain discovery, and reverse IP analysis, giving users a comprehensive view of a target environment. In addition to vulnerability detection, RED HAWK offers crawling features that gather links and metadata from websites to support deeper reconnaissance.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    hordes

    hordes

    WordPress Plug curates list of links with titles icons and categories.

    Creates a front side submitted form that allows logged in user to add links or bookmarks of their favorite websites. Just copy and paste your address into the link text box and add then a title. Hordes will do the rest. There are many options including the ability to show all the data about your link, or to only show the title and the link which is handy to show more links on the page when you know what they are by only looking at the title and the link. If you need to search for...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    LymPHOS2

    LymPHOS2

    LymPHOS2 Web-App

    LymPHOS2 is a web-based Application at www.LymPHOS.org containing peptidic and protein sequences and spectrometric information on the PhosphoProteome of human T-Lymphocytes. - Nguyen, TD., Vidal-Cortes, O., Gallardo, Ó., Abian, J., Carrascal, M., LymPHOS 2.0: an update of a phosphosite database of primary human T cells. Database 2015, 2015. DOI: 10.1093/database/bav115 - Carrascal, M., Ovelleiro, D., Casas, V., Gay, M., Abian, J., Phosphorylation analysis of primary human T lymphocytes...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • Next
MongoDB Logo MongoDB