Web Scrapers for Linux

View 10 business solutions
  • Our Free Plans just got better! | Auth0 by Okta Icon
    Our Free Plans just got better! | Auth0 by Okta

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your secuirty. Auth0 now, thank yourself later.
    Try free now
  • Bright Data - All in One Platform for Proxies and Web Scraping Icon
    Bright Data - All in One Platform for Proxies and Web Scraping

    Say goodbye to blocks, restrictions, and CAPTCHAs

    Bright Data offers the highest quality proxies with automated session management, IP rotation, and advanced web unlocking technology. Enjoy reliable, fast performance with easy integration, a user-friendly dashboard, and enterprise-grade scaling. Powered by ethically-sourced residential IPs for seamless web scraping.
    Get Started
  • 1
    Firecrawl

    Firecrawl

    Turn entire websites into LLM-ready markdown or structured data

    Crawl and convert any website into LLM-ready markdown or structured data. Built by Mendable.ai and the Firecrawl community. Includes powerful scraping, crawling, and data extraction capabilities. Firecrawl is an API service that takes a URL, crawls it, and converts it into clean markdown or structured data. We crawl all accessible subpages and give you clean data for each. No sitemap is required.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 2
    Gerapy

    Gerapy

    Distributed Crawler Management Framework Based on Scrapy

    Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js. Someone who has worked as a crawler with Python may use Scrapy. Scrapy is indeed a very powerful crawler framework. It has high crawling efficiency and good scalability. It is basically a necessary tool for developing crawlers using Python. If you use Scrapy as a crawler, then of course we can use our own host to crawl when crawling, but when the crawl is very large, we can’t run the crawler on our own machine, a good one. The method is to deploy Scrapy to a remote server for execution. At this time, you might use Scrapyd. With it, we only need to install Scrapyd on the remote server and start the service. We can deploy the Scrapy project we wrote. Go to the remote host. In addition, Scrapyd provides a variety of operations API, which gives you free control over the operation of the Scrapy project.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 3
    JSSoup

    JSSoup

    JavaScript + BeautifulSoup = JSSoup

    I'm a fan of Python library BeautifulSoup. It's feature-rich and very easy to use. But when I am working on a small react-native project, and I tried to find a HTML parser library like BeautifulSoup, I failed. So I want to write a HTML parser library that can be so easy to use just like BeautifulSoup in Javascript. JSSoup uses tautologistics/node-htmlparser as HTML dom parser, and creates a series of BeautifulSoup like API on top of it. JSSoup supports both node and react-native. JSSoup tries to use the same interfaces as BeautifulSoup so BeautifulSoup user can use JSSoup seamlessly. However, JSSoup uses Javascript's camelCase naming style instead of Python's underscore naming style.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 4
    MechanicalSoup

    MechanicalSoup

    A Python library for automating interaction with websites

    A Python library for automating interaction with websites. MechanicalSoup automatically stores and sends cookies, follows redirects, and can follow links and submit forms. It doesn't do JavaScript. MechanicalSoup was created by M Hickford, who was a fond user of the Mechanize library. Unfortunately, Mechanize was incompatible with Python 3 until 2019 and its development stalled for several years. MechanicalSoup provides a similar API, built on Python giants Requests (for HTTP sessions) and BeautifulSoup (for document navigation). Since 2017 it is a project actively maintained by a small team.
    Downloads: 1 This Week
    Last Update:
    See Project
  • Conversational AI for fast and friendly customer care | watsonx Assistant Icon
    Conversational AI for fast and friendly customer care | watsonx Assistant

    Get started on your generative AI journey

    IBM watsonx Assistant is a next-gen conversational AI solution—it that empowers a broader audience that includes non-technical business users, anyone in your organization to effortlessly build generative AI Assistants that deliver frictionless self-service experiences to customers across any device or channel, help boost employee productivity, and scale across your business.
    Learn More
  • 5
    PHPScraper

    PHPScraper

    A universal web-util for PHP

    PHPScraper is a universal web-scraping util for PHP, built with simplicity in mind. The goal is to make xPath Selectors optional and avoid the commonly needed boilerplate code. Just create an instance of PHPScraper, go to a website, and start collecting data. All scraping functionality can be accessed either as a function call or a property call. For example, the title can be accessed in two ways. Many common use cases are covered already. You can find prepared extractors for various HTML tags, including interesting attributes. You can filter and combine these to your needs. In some cases there is an option to get a simple or detailed version. PHPScraper can assist in collecting feeds such as RSS feeds, sitemap.xml-entries and static search indexes. This can be useful when deciding on the next page to crawl or building up a list of pages on a website.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 6
    Web Scraping for Laravel

    Web Scraping for Laravel

    Laravel adapter for Roach, the complete web scraping toolkit for PHP

    This is the Laravel adapter for Roach, the complete web scraping toolkit for PHP. Easily integrate Roach into any Laravel application. The Laravel adapter mostly provides the necessary container bindings for the various services Roach uses, as well as making certain configuration options available via a config file. The Laravel adapter of Roach registers a few Artisan commands to make out development experience as pleasant as possible. Roach ships with an interactive shell (often called Read-Evaluate-Print-Loop, or Repl for short) which makes prototyping our spiders a breeze. We can use the provided roach:shell command to launch a new Repl session.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 7
    X-RAY

    X-RAY

    The next web scraper, see through the <html> noise

    Supports strings, arrays, arrays of objects, and nested object structures. The schema is not tied to the structure of the page you're scraping, allowing you to pull the data in the structure of your choosing. The API is entirely composable, giving you great flexibility in how you scrape each page. Paginate through websites, scraping each page. X-ray also supports a request delay and a pagination limit. Scraped pages can be streamed to a file, so if there's an error on one page, you won't lose what you've already scraped. Start on one page and move to the next easily. The flow is predictable, following a breadth-first crawl through each of the pages. X-ray has support for concurrency, throttles, delays, timeouts and limits to help you scrape any page responsibly. Swap in different scrapers depending on your needs. Currently supports HTTP and PhantomJS driver drivers. In the future, I'd like to see a Tor driver for requesting pages through the Tor network.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 8
    changedetection.io

    changedetection.io

    The best free open source website change detection and restock service

    Loved by smart shoppers, data journalists, research engineers, data scientists, security researchers, and more. From simply monitoring website pages that have a change (such as watching prices, and restocking notifications), to deep inspection such as PDF text support, JSON and XML monitoring, and extensive text triggers. Monitor out-of-stock products and get alerts when those products are back in stock, get restock alerts via Discord, Slack, email, and many other platforms. Using the browser steps configuration, add basic steps before performing change detection, such as logging into websites, adding a product to a cart, accepting cookie logins, entering dates, and refining searches. Monitor and track PDF file changes, and know when a PDF file has text changes. Know when your favourite product is on sale, or other special deals are announced before anyone else. Detect and monitor changes in JSON API responses.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 9
    finvizfinance

    finvizfinance

    Finviz analysis python library

    finvizfinance is a package that collects financial information from FinViz website. Stock charts, fundamental & technical information, insider information and stock news. Forex charts and performance. Crypto charts and performance. Screener and Group provide data frames for comparing stocks according to different filters and trading signals. Getting information (fundament, description, outer rating, stock news, inside trader) of an individual stock.
    Downloads: 1 This Week
    Last Update:
    See Project
  • PRTG Network Monitor | Making the lives of sysadmins easier Icon
    PRTG Network Monitor | Making the lives of sysadmins easier

    Stay ahead of IT infrastructure issues

    PRTG Network Monitor is an all-inclusive monitoring software solution developed by Paessler. Equipped with an easy-to-use, intuitive interface with a cutting-edge monitoring engine, PRTG Network Monitor optimizes connections and workloads as well as reduces operational costs by avoiding outages while saving time and controlling service level agreements (SLAs). The solution is packed with specialized monitoring features that include flexible alerting, cluster failover solution, distributed monitoring, in-depth reporting, maps and dashboards, and more.
    Learn More
  • 10
    rvest

    rvest

    Simple web scraping for R

    rvest helps you scrape (or harvest) data from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup and RoboBrowser. If you’re scraping multiple pages, I highly recommend using rvest in concert with polite. The polite package ensures that you’re respecting the robots.txt and not hammering the site with too many requests.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 11
    serverless-chrome

    serverless-chrome

    Run headless Chrome/Chromium on AWS Lambda

    Serverless Chrome contains everything you need to get started running headless Chrome on AWS Lambda (possibly Azure and GCP Functions soon). The aim of this project is to provide the scaffolding for using Headless Chrome during a serverless function invocation. Serverless Chrome takes care of building and bundling the Chrome binaries and making sure Chrome is running when your serverless function executes. In addition, this project also provides a few example services for common patterns (e.g. taking a screenshot of a page, printing to PDF, some scraping, etc.). Why? Because it's neat. It also opens up interesting possibilities for using the Chrome DevTools Protocol (and tools like Chromeless or Puppeteer) in serverless architectures and doing testing/CI, web-scraping, pre-rendering, etc. You must configure your AWS credentials either by defining AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environmental variables, or using an AWS profile.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 12
    A function-testing, performance-measuring, site-mirroring, web spider that is widely portable and capable of using scenarios to process a wide range of web transactions, including ssl and forms.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 13
    ItSucks
    This project is a java web spider (web crawler) with the ability to download (and resume) files. It is also highly customizable with regular expressions and download templates. All backend functionalities are also available in a separate library.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 14
    A Java implementation of a flexible and extensible web spider engine. Optional modules allow functionality to be added (searching dead links, testing the performance and scalability of a site, creating a sitemap, etc ..
    Downloads: 11 This Week
    Last Update:
    See Project
  • 15
    WebSPHINX is a web crawler (robot, spider) Java class library, originally developed by Robert Miller of Carnegie Mellon University. Multithreaded, tollerant HTML parsing, URL filtering and page classification, pattern matching, mirroring, and more.
    Leader badge
    Downloads: 4 This Week
    Last Update:
    See Project
  • 16
    Web Spider, Web Crawler, Email Extractor

    Web Spider, Web Crawler, Email Extractor

    Free Extracts Emails, Phones and custom text from Web using JAVA Regex

    In Files there is WebCrawlerMySQL.jar which supports MySql Connection Please follow this link to get latest version https://sourceforge.net/projects/web-spider-web-crawler-extract/ Free Web Spider & Crawler. Extracts Information from Web by parsing millions of pages. Store data into Derby OR MySQL Database and data are not being lost after force closing the spider. - Free Web Spider , Parser, Extractor, Crawler - Extraction of Emails , Phones and Custom Text from Web - Export to Excel File - Data Saved into Derby Database - Written in Java Cross Platform See also Free Email Sender in this link: https://sourceforge.net/projects/gitst-free-email-ender/ Please install Microsoft OpenJDK to start the application https://www.microsoft.com/openjdk
    Downloads: 2 This Week
    Last Update:
    See Project
  • 17
    Easyspider - Distributed Web Crawler

    Easyspider - Distributed Web Crawler

    Easy Spider is a distributed Perl Web Crawler Project from 2006

    Easy Spider is a distributed Perl Web Crawler Project from 2006. It features code from crawling webpages, distributing it to a server and generating xml files from it. The client site can be any computer (Windows or Linux) and the Server stores all data. Websites that use EasySpider Crawling for Article Writing Software: https://www.artikelschreiber.com/en/ https://www.unaique.net/en/ https://www.unaique.com/ https://www.artikelschreiber.com/marketing/ https://www.paraphrasingtool1.com/ https://www.artikelschreiben.com/ https://buzzerstar.com/ https://easyperlspider.sourceforge.io/ http://artikelschreiber.net/ http://sebastianenger.com/ http://unaique.de/ http://unaique.org/ It is fun to look at some code that is few years ago and to see how one has improved himself. If you want to write text automatically try https://www.artikelschreiber.com/en/ or https://www.unaique.net/en/!
    Downloads: 3 This Week
    Last Update:
    See Project
  • 18

    Web Crawler Security Tool

    A web crawler oriented to information security.

    Last update on tue mar 26 16:25 UTC 2012 The Web Crawler Security is a python based tool to automatically crawl a web site. It is a web crawler oriented to help in penetration testing tasks. The main task of this tool is to search and list all the links (pages and files) in a web site. The crawler has been completely rewritten in v1.0 bringing a lot of improvements: improved the data visualization, interactive option to download files, increased speed in crawling, exports list of found files into a separated file (useful to crawl a site once, then download files and analyse them with FOCA), generate an output log in Common Log Format (CLF), manage basic authentication and more! Many of the old features has been reimplemented and the most interesting one is the capability of the crawler to search for directory indexing.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 19
    Catbird Linux

    Catbird Linux

    Linux for content creation, web scraping, coding, and data analysis.

    Catbird Linux is an operating system built for media creation, web scraping, and software coding. It is the daily driver you want for retrieving data, making videos or podcasts, and making software tools to automate the repetitive tasks. It is ready for work in Python, Lua, and Go languages, with numerous packages for web scraping or downloading data via API calls. Using Catbird Linux, it is possible to accomplish in depth stock market analysis, track weather trends, follow social media sentiment, or do other tasks in data science. The system is programmer friendly, ready for creating and running the tools you use to measure and understand your world. In addition to search and GPT tools, you have what you need to take notes, write reports or presentations, record and edit audio or video. Under the hood, the system is tuned to be fast and responsive on modest equipment, with a real time kernel and lightweight tiling / tabbing window manager.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 20

    dorker-py

    Descubre archivos, rutas escondidas realizando busquedas avanzadas

    Dorking Google - Dorker Py Descubre archivos, rutas escondidas realizando busquedas avanzadas (ES) Discover files, hidden paths by performing advanced searches (EN)
    Downloads: 3 This Week
    Last Update:
    See Project
  • 21
    A multi-threaded web spider that finds free porn thumbnail galleries by visiting a list of known TGPs (Thumbnail Gallery Posts). It optionally downloads the located pictures and movies. TGP list is included. Public domain perl script running on Linux.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 22
    ScrapBot 1.40 64bits

    ScrapBot 1.40 64bits

    Task automation software for accessing and manipulating website data.

    ScrapBot is a task automation software that allows you to access, authenticate, extract, and insert data on any website. The software utilizes JavaScript to execute tasks, eliminating the need for server or additional software installations. The system can control the accessed webpage through JavaScript, and the entire navigation can be viewed in the program window. The main.js script runs in a separate frame from the navigation frame but can access all page content without any restrictions.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 23

    WebCollector

    WebCollector is an open source web crawler framework based on Java.

    WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes. Github: https://github.com/CrawlScript/WebCollector Demo: https://github.com/CrawlScript/WebCollector/blob/master/YahooCrawler.java
    Downloads: 2 This Week
    Last Update:
    See Project
  • 24
    anonme.sh

    anonme.sh

    anonymous tools [uncontinued]

    anonme.sh {bash script} V1.0 Operative Systems Suported: Linux Dependencies: slowloris macchanger decrypter.py description of the script * this script makes it easy tasks such as DoS attacks, change you MAC address, inject XSS on target website, file upload vulns, MD5 decrypter, webcrawler (scan websites for vulns) and we can use WGET to download files from target domain or retrieve the all website... tutorial:http://www.youtube.com/watch?v=PrlrBuioCMc
    Downloads: 2 This Week
    Last Update:
    See Project
  • 25

    Rapid Reference

    An extension that allows for hassle-free website citation/referencing.

    Please do not distribute with the goal of selling my program. How to attach to your Chrome/Edge/Brave etc Browser: 1. Download the extension(rapidreference.zip) 2. Extract 3. Go to chrome://extensions if on Chrome, or navigate to your extension management setting in your browser 4. Enable developer mode (usually top right) 5. Add unpacked extension 6. Choose the extracted extension's folder 7. There you go! How to use: 1. Start a session in the panel of the extension 2. Do required research/website navigation 3. You can check the preview of the cited references in the panel 4. You can copy the citation list to the clipboard through simply clicking the "Copy Citation" button 5. You're Done! Thanks for using my software 😊
    Downloads: 1 This Week
    Last Update:
    See Project