Page 4 | Best Open Source Linux Web Scrapers 2024

Web Scrapers for Linux

View 10 business solutions

Web Scrapers Linux Clear Filters

Our Free Plans just got better! | Auth0 by Okta
With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your secuirty. Auth0 now, thank yourself later.

Try free now
Bright Data - All in One Platform for Proxies and Web Scraping
Say goodbye to blocks, restrictions, and CAPTCHAs

Bright Data offers the highest quality proxies with automated session management, IP rotation, and advanced web unlocking technology. Enjoy reliable, fast performance with easy integration, a user-friendly dashboard, and enterprise-grade scaling. Powered by ethically-sourced residential IPs for seamless web scraping.

Get Started
1

GOPA

GOPA, a spider written in Golang, for Elasticsearch

GOPA, a spider written in Golang, for Elasticsearch. Lightweight, low footprint, memory requirement should, be 100MB. Easy to deploy, no runtime or dependency required. Easy to use, no programming or script ability needed, out-of-box features. First of all, get it, two opinions: download the pre-built package or compile it yourself. Besides Elasticsearch, Gopa doesn't require any other dependencies, just simply run ./gopa to start the program. It's safety to press ctrl+c to stop the current running Gopa, Gopa will handle the rest, saving the checkpoint, you may restore the job later, but the world is still in your hand.

Downloads: 0 This Week

Last Update: 2023-04-12
See Project
2

GitGet

Ever wanted to download only a part of a Git repository.

Ever wanted to download only a part of a Git repository. Just paste the URL of the repo you want to download and sit back and enjoy. This simple java application makes use of Web Scraping and downloads only those files you need, thus helping you save your precious bandwidth and space.

1 Review

Downloads: 0 This Week

Last Update: 2018-09-03
See Project
3

GoSpider

Gospider - Fast web spider written in Go

GoSpider - Fast web spider written in Go. Fast web crawling. Brute force and parse sitemap.xml. Parse robots.txt. Generate and verify link from JavaScript files. Link Finder. Find AWS-S3 from response source. Find subdomains from the response source. Get URLs from Wayback Machine, Common Crawl, Virus Total, Alien Vault. Format output easy to Grep. Support Burp input. Crawl multiple sites in parallel.

Downloads: 0 This Week

Last Update: 2023-01-27
See Project
4

Grab Framework Project

Web Scraping Framework

Grab is a python framework for building web scrapers. With Grab you can build web scrapers of various complexity, from simple 5-line scripts to complex asynchronous website crawlers processing millions of web pages. Grab provides an API for performing network requests and for handling the received content e.g. interacting with DOM tree of the HTML document. The single request/response API that allows you to build network request, perform it and work with the received content. The API is built on top of urllib3 and lxml libraries. The Spider API to build asynchronous web crawlers. You write classes that define handlers for each type of network request. Each handler is able to spawn new network requests. Network requests are processed concurrently with a pool of asynchronous web sockets. Grab provides interface called Spider to develop multithreaded web-site scrapers.

Downloads: 0 This Week

Last Update: 2022-11-24
See Project
Claims Processing solution for healthcare practitioners.
Very easy to use for medical, dental and therapy offices.

Speedy Claims became the top CMS-1500 Software by providing the best customer service imaginable to our thousands of clients all over America. Medical billing isn't the kind of thing most people get excited about - it is just a tedious task you have to do. But while it will never be a fun task, it doesn't have to be as difficult or time consumimg as it is now. With Speedy Claims CMS-1500 software you can get the job done quickly and easily, allowing you to focus on the things you love about your job, like helping patients. With a simple interface, powerful features to eliminate repetitive work, and unrivaled customer support, it's simply the best HCFA 1500 software available on the market. A powerful built-in error checking helps ensure your HCFA 1500 form is complete and correctly filled out, preventing CMS-1500 claims from being denied.

Learn More
5

Harvest Web Indexing

Harvest is a web indexing package, originally disigned for distributed indexing, it can form a powerful system for indexing both large and small web sites. Also now includes Harvest-NG a highly efficient, modular, perl-based web crawler.

Downloads: 0 This Week

Last Update: 2013-04-09
See Project
6

HtmlClient

HtmlClient provides an SGML/HTML/XHTML parser and connection client making web-spidering as easy for developers as actually surfing the web with a premade browser. Based on Apache's HttpClient.

Downloads: 0 This Week

Last Update: 2013-03-08
See Project
7

J-Obey (Robots.txt Crawler Module)

J-Obey is a Java Library/package, which allows people writing their own crawlers to have a stable Robots.txt parser, if you are writing a web crawler of some sort you can use J-Obey to take out the hassle of writing a Robots.txt parser/intrepreter.

Downloads: 0 This Week

Last Update: 2015-08-05
See Project
8

Java Web Spider

Spider web scritto in java che consente un utilizzo sia come applicazione stand alone, sia come core di altre applicazioni che sfruttino le sue funzionalità.

Downloads: 0 This Week

Last Update: 2013-04-19
See Project
9

Java web crawler

a minimal Java web crawler

Downloads: 0 This Week

Last Update: 2016-07-23
See Project
PRTG Network Monitor | Making the lives of sysadmins easier
Stay ahead of IT infrastructure issues

PRTG Network Monitor is an all-inclusive monitoring software solution developed by Paessler. Equipped with an easy-to-use, intuitive interface with a cutting-edge monitoring engine, PRTG Network Monitor optimizes connections and workloads as well as reduces operational costs by avoiding outages while saving time and controlling service level agreements (SLAs). The solution is packed with specialized monitoring features that include flexible alerting, cluster failover solution, distributed monitoring, in-depth reporting, maps and dashboards, and more.

Learn More
10

JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.

Scrape job websites into a single spreadsheet with no duplicates. Automated tool for scraping job postings into a .csv file. You can search for jobs with YAML configuration files or by passing command arguments. By performing regular scraping and reviewing, you can cut through the noise of even the busiest job markets. Run funnel with your settings YAML to populate your master CSV file with jobs from available providers. JobFunnel can be easily automated to run nightly with crontab. If you have a job website you'd like to write a scraper for, you are welcome to implement it, Review the Base Scraper for implementation details. JobFunnel supports scraping jobs from the same job website across locales & domains. If you are interested in adding support, you may only need to define session headers and domain strings, Review the Base Scraper for further implementation details.

Downloads: 0 This Week

Last Update: 2024-09-29
See Project
11

Letterboxd Recommendations

Scraping publicly-accessible Letterboxd data for movie recommendations

Scraping publicly-accessible Letterboxd data and creating a movie recommendation model with it that can generate recommendations when provided with a Letterboxd username. A user's "star" ratings are scraped from their Letterboxd profile and assigned numerical ratings from 1 to 10 (accounting for half stars). Their ratings are then combined with a sample of ratings from the top 4000 most active users on the site to create a collaborative filtering recommender model using singular value decomposition (SVD). All movies in the full dataset that the user has not rated are run through the model for predicted scores and the items with the top predicted scores are returned. Due to constraints in time and computing power, the maximum sample size that a user is allowed to select is 500,000 samples, though there are over five million ratings in the full dataset from the top 4000 Letterboxd users alone.

Downloads: 0 This Week

Last Update: 2024-05-24
See Project
12

Mapp.it

A web-spider, based on the availability of URL APIs to most web based databases, mapping web pages to two dimensional FreeMind mind-maps. Mapp.it runs locally like a web application and uses a small footprint CherryPy webserver.

Downloads: 0 This Week

Last Update: 2013-03-13
See Project
13

Methabot Web Crawler

Methanol is a scriptable multi-purpose web crawling system with an extensible configuration system and speed-optimized architectural design. Methabot is the web crawler of Methanol.

2 Reviews

Downloads: 0 This Week

Last Update: 2013-05-15
See Project
14

Mowglee

Mowglee - The Geo Crawler!

Mowglee is a distributed, multi-threaded, asynchronous task execution based web crawler in Java.It is designed for geographic affinity and is highly modular.

Downloads: 0 This Week

Last Update: 2021-02-18
See Project
15

NightCrawler

NightCrawler is a multithreaded web spider which uses MIME types to download files.

Downloads: 0 This Week

Last Update: 2013-04-22
See Project
16

Nomad - Tiny Search Engine

Nomad is tiny but efficient search engine and web crawler. This works very good for searching with in the set of corporate websites on internet and/or intranet's HTML documents or knowledge repositories.

Downloads: 0 This Week

Last Update: 2013-03-14
See Project
17

NuzeBot

Finds interesting news headlines.

This is a bot to finds the news you want to see. It can be made to find the news that interests you and reject everything else. View on one page the most interesting headlines from many websites.

Downloads: 0 This Week

Last Update: 2024-10-07
See Project
18

One Big Soup

We are integrating existing communication systems including Wiki, IRC, Instant Messaging, e-mail, and even static web sites. We write web scrapers and servers for managing events, IRC bots, logs, local names, templates, and groups.

Downloads: 0 This Week

Last Update: 2014-06-09
See Project
19

PGBuild

Compile your mobile web pages into mobile aps via build.phonegap.com

PGbuild is a Phonegap development system that automates the development process by connecting your CMS/web server with the online service [Phonegap Build](http://build.phonegap.com). PGBuild is essentially a web spider that make off-line versions of web pages. The off-line version is zippped and send to the Phonegap Build service. The spider is controlled by a project file that sets the rules for the spider and the options for the phonebap build service. You may create and manage your phonegap project source files manually on your webserver or use PGBuild to connect to a CMS system to extract content. PGBuild is managed from a small widget that you may use your self or integrate into a CMS system.

Downloads: 0 This Week

Last Update: 2015-08-07
See Project
20

Perl Web Scraping Project

Perl Web Scraping Project

Web scraping (web harvesting or web data extraction) is data scraping used for extracting data from websites.[1] Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis. Web scraping a web page involves fetching it and extracting from it.[1][2] Fetching is the downloading of a page (which a browser does when you view the page). Therefore, web crawling is a main component of web scraping, to fetch pages for later processing. Once fetched, then extraction can take place. The content of a page may be parsed, searched, reformatted, its data copied into a spreadsheet, and so on.

Downloads: 0 This Week

Last Update: 2017-10-12
See Project
21

Python Crawler Library

Python Web Crawler Library

A simple library for crawling the web. This library will give you the ability to create macros for crawling web site and preforming simple actions like preforming "log in" and other simple actions in web sites.

Downloads: 0 This Week

Last Update: 2015-06-04
See Project
22

Regular Expression web replication

Yet another web crawler? Yes, but this ones uses the full power of regular expressions to accept or reject, examine or ignore, save or refuse pages. You also use MIME types to do all this. Powerful and flexible.

Downloads: 0 This Week

Last Update: 2013-05-30
See Project
23

SAWS - Semi Automated Web Scraping

Purpose of SAWS is to facilitate process of web scraping by - 1) providing a pattern specification mechanism on top of normal regular expressions 2) and implementation of common matching algorithm to run specified pattern on given source for any matches.

Downloads: 0 This Week

Last Update: 2015-04-25
See Project
24

SEO MACROSCOPE

SEO Macroscope is a website scanning tool, to check your website

The website broken link scanner and technical SEO toolbox. SEO Macroscope for Microsoft Windows is a free and open-source website broken link checking and scanning tool, with some technical SEO functionality for common website problems. Find broken links on your website, both internal and external. Report robots.txt statuses. Check and report canonical, hreflang, and other metadata problems. Perform simple, configurable Technical SEO checks on titles and descriptions. Report fastest/slowest pages. Export reports to Excel and CSV formats. Generate and export text and XML sitemaps from the crawled pages. Analyze redirect chains. Use custom filters to verify the presence/absence of tracking tags. Use CSS Selectors, XPath Queries, and Regular Expressions to scrape website data.

Downloads: 0 This Week

Last Update: 2023-04-12
See Project
25

Scra.php

Scrape anything!

The ultimate customiseable YAML-ised Web Scraper for PHP

Downloads: 0 This Week

Last Update: 2014-01-20
See Project