Showing 23 open source projects for "gist web crawler"

View related business solutions
  • AI-powered service management for IT and enterprise teams Icon
    AI-powered service management for IT and enterprise teams

    Enterprise-grade ITSM, for every business

    Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity. Maximize operational efficiency with refreshingly simple, AI-powered Freshservice.
    Try it Free
  • Our Free Plans just got better! | Auth0 Icon
    Our Free Plans just got better! | Auth0

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.
    Try free now
  • 1
    Spatie Crawler

    Spatie Crawler

    An easy to use, powerful crawler implemented in PHP

    Spatie Crawler is a PHP library that allows developers to crawl websites and extract information efficiently. It can be used for web scraping, link checking, or automated testing of web pages. The library is simple to use and supports customizable crawling strategies, including controlling crawl depth and handling redirects. It’s suitable for building crawlers that navigate large or dynamically generated websites.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 2
    Heritrix

    Heritrix

    Internet Archive's open-source, web-scale, web crawler project

    Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix/heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt. Heritrix is designed to respect the robots.txt exclusion directives...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 3
    miniblink49

    miniblink49

    Lighter, faster browser kernel of blink to integrate HTML UI in apps

    ... electron). Customize as you wish, simulate another browser environment. Perfect HTML5 support, friendly to various front-end libraries (support HTML5, and friendly to front framework). After turning off the cross-domain switch, you can use various cross-domain functions (support cross-domain). Headless mode, which greatly saves resources and is suitable for crawlers (headless mode, be suitable for Web Crawler).
    Downloads: 8 This Week
    Last Update:
    See Project
  • 4
    Bdash

    Bdash

    Simple SQL Client for lightweight data analysis

    Simple SQL Client for lightweight data analysis. You can share the result with gist. Supports MySQL, PostgreSQL (Amazon Redshift), SQLite3, Google BigQuery, Treasure Data, Amazon Athena. You can download and install from Web Site or Releases.
    Downloads: 1 This Week
    Last Update:
    See Project
  • Build apps or websites quickly on a fully managed platform Icon
    Build apps or websites quickly on a fully managed platform

    Get two million requests free per month.

    Run frontend and backend services, batch jobs, host LLMs, and queue processing workloads without the need to manage infrastructure.
    Try it for free
  • 5
    WebMagic

    WebMagic

    A scalable web crawler framework for Java

    WebMagic is a scalable crawler framework. It covers the whole lifecycle of crawler, downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler. WebMagic is a simple but scalable crawler framework. You can develop a crawler easily based on it. WebMagic has a simple core with high flexibility, a simple API for html extracting. It also provides annotation with POJO to customize a crawler, and no configuration is needed. Some other features...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    X-Crawl

    X-Crawl

    Flexible Node.js AI-assisted crawler library

    A high-performance web crawling and scraping framework for Node.js, designed for large-scale data extraction.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 7
    PureScript

    PureScript

    A strongly-typed language that compiles to JavaScript

    Compile to readable JavaScript and reuse existing JavaScript code easily. An extensive collection of libraries for development of web applications, web servers, apps and more. Excellent tooling and editor support with instant rebuilds. An active community with many learning resources. Build real-world applications using functional techniques and expressive types, such as: Algebraic data types and pattern matching. Row polymorphism and extensible records. Higher kinded types and type classes...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    TorBot

    TorBot

    Dark Web OSINT Tool

    Contributions to this project are always welcome. To add a new feature fork the dev branch and give a pull request when your new feature is tested and complete. If its a new module, it should be put inside the modules directory. The branch name should be your new feature name in the format <Feature_featurename_version(optional)>. On Linux platforms, you can make an executable for TorBot by using the install.sh script. You will need to give the script the correct permissions using chmod +x...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 9
    Gerapy

    Gerapy

    Distributed Crawler Management Framework Based on Scrapy

    Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js. Someone who has worked as a crawler with Python may use Scrapy. Scrapy is indeed a very powerful crawler framework. It has high crawling efficiency and good scalability. It is basically a necessary tool for developing crawlers using Python. If you use Scrapy as a crawler, then of course we can use our own host to crawl when crawling, but when the crawl is very large, we can’t run...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Sales CRM and Pipeline Management Software | Pipedrive Icon
    Sales CRM and Pipeline Management Software | Pipedrive

    The easy and effective CRM for closing deals

    Pipedrive’s simple interface empowers salespeople to streamline workflows and unite sales tasks in one workspace. Unlock instant sales insights with Pipedrive’s visual sales pipeline and fine-tune your strategy with robust reporting features and a personalized AI Sales Assistant.
    Try it for free
  • 10
    SiteOne Crawler (desktop app)

    SiteOne Crawler (desktop app)

    A free, feature-rich web analyzer and exporter/cloner you will love!

    A free in-depth website analyzer providing audits of security, performance, SEO, accessibility and other technical aspects. Available as a desktop application for Windows/macOS/Linux and as a CLI tool for advanced users and CI/CD processes. It also includes an offline web page exporter (website clone, mirror).
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    Goutte

    Goutte

    Goutte, a simple PHP Web Scraper

    Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extract data from the HTML/XML responses. Goutte depends on PHP 7.1+. Add fabpot/goutte as a require dependency in your composer.json file. Create a Goutte Client instance (which extends Symfony\Component\BrowserKit\HttpBrowser). Make requests with the request() method. The method returns a Crawler object (Symfony\Component\DomCrawler\Crawler). To use your own HTTP settings, you may...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    ReconSpider

    ReconSpider

    Most Advanced Open Source Intelligence (OSINT) Framework

    ... the capabilities of Wave, Photon and Recon Dog to do a comprehensive enumeration of attack surfaces. Reconnaissance is a mission to obtain information by various detection methods, about the activities and resources of an enemy or potential enemy, or geographic characteristics of a particular area. A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering).
    Downloads: 16 This Week
    Last Update:
    See Project
  • 13
    CEF Python

    CEF Python

    Python bindings for the Chromium Embedded Framework (CEF)

    ... use cases for CEF. Use it as a modern HTML5 based rendering engine that can act as a replacement for classic desktop GUI frameworks. Think of it as Electron for Python. Embed a web browser widget in a classic Qt / GTK / wxPython desktop application. Use it for automated testing of web applications with more advanced capabilities than Selenium web browser automation due to CEF low level programming APIs.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 14
    ShadowSocksShare

    ShadowSocksShare

    Python ShadowSocks framework

    This project obtains the shared ss(r) account from the ss(r) shared website crawler, redistributes the account and generates a subscription link by parsing and verifying the account connectivity. Since Google plus will be closed on April 2, 2019, almost all the available accounts crawled before come from Google plus. So if you are building your own website, please keep an eye on the updates of this project and redeploy using the latest source code.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    Catberry

    Catberry

    Catberry is an isomorphic framework

    Catberry is an isomorphic framework for building universal front-end apps using components, Flux architecture and progressive rendering. Catberry builds a bundle for running the application in a browser as a Single Page Application. Cat-Components – similar to web-components but organized as directories, can be rendered on the server and published/installed as NPM packages. The entire architecture of the framework is built using the Service Locator pattern, which helps to manage module...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    phoneutria
    A Java Web crawler: multi-threaded, scalable, with high performance, extensible and polite. It can be used to crawl and index any web or enterprise domain and is configurable through a XML configuration file.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    Node Crawler

    Node Crawler

    Web Crawler/Spider for NodeJS + server-side jQuery

    Most powerful, popular and production crawling/scraping package for Node, happy hacking.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18

    sitecheck

    Modular web site spider for web developers.

    More than just a link checker, sitecheck is a website spider (also known as a crawler) which can assist with SEO by testing an entire site plus both inbound links from search engines and outbound links to other sites for the following issues: looping redirects (HTTP 301/302), broken links (HTTP 404), server errors (HTTP 500), spelling mistakes, low readability scores (using the Flesch Reading Ease test), missing/empty/duplicate meta tags, duplicate content, slow page speed, W3C validation...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 19
    Zoozle Search & Download Suchmaschine

    Zoozle Search & Download Suchmaschine

    Zoozle 2008 - 2010 Webpage, Tools and SQL Files

    Download search engine and directory with Rapidshare and Torrent - zoozle Download Suchmaschine All The files that run the World Leading German Download Search Engine in 2010 with 500 000 unique visitors a day - all the tools you need to set up a clone. Code Contains: - PHP Files for zoozle - Perl Crawler for gathering new content to database and all other cool tools i have created https://www.artikelschreiber.com/en/ https://www.unaique.net/en/ https://www.unaique.com/ https...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 20
    ** Guys I have built a much more powerful Fully Featured CMS system at: https://github.com/MacdonaldRobinson/FlexDotnetCMS Macs CMS is a Flat File ( XML and SQLite ) based AJAX Content Management System. It focuses mainly on the Edit In Place editing concept. It comes with a built in blog with moderation support, user manager section, roles manager section, SEO / SEF URL
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    Web-as-corpus tools in Java. * Simple Crawler (and also integration with Nutch and Heritrix) * HTML cleaner to remove boiler plate code * Language recognition * Corpus builder
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    A toolkit for crawling information from web pages by combining different kinds of "actions". Actions are simple operations such as navigation to a specified url or extraction of text from the html. Also available is a graphic user interface.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    LogCrawler is an ANT task for automatic testing of web applications. Using a HTTP crawler it visits all pages of a website and checks the server logfiles for errors. Use it as a "smoketest" with your CI system like CruiseControl.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • Next
Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.