Showing 5 open source projects for "gist web crawler"

View related business solutions
  • Keep company data safe with Chrome Enterprise Icon
    Keep company data safe with Chrome Enterprise

    Protect your business with AI policies and data loss prevention in the browser

    Make AI work your way with Chrome Enterprise. Block unapproved sites and set custom data controls that align with your company's policies.
    Download Chrome
  • Fully managed relational database service for MySQL, PostgreSQL, and SQL Server Icon
    Fully managed relational database service for MySQL, PostgreSQL, and SQL Server

    Focus on your application, and leave the database to us

    Cloud SQL manages your databases so you don't have to, so your business can run without disruption. It automates all your backups, replication, patches, encryption, and storage capacity increases to give your applications the reliability, scalability, and security they need.
    Try for free
  • 1
    WebMagic

    WebMagic

    A scalable web crawler framework for Java

    WebMagic is a scalable crawler framework. It covers the whole lifecycle of crawler, downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler. WebMagic is a simple but scalable crawler framework. You can develop a crawler easily based on it. WebMagic has a simple core with high flexibility, a simple API for html extracting. It also provides annotation with POJO to customize a crawler, and no configuration is needed. Some other...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    Heritrix

    Heritrix

    Internet Archive's open-source, web-scale, web crawler project

    Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix/heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.
    Downloads: 11 This Week
    Last Update:
    See Project
  • 3
    phoneutria
    A Java Web crawler: multi-threaded, scalable, with high performance, extensible and polite. It can be used to crawl and index any web or enterprise domain and is configurable through a XML configuration file.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    Web-as-corpus tools in Java. * Simple Crawler (and also integration with Nutch and Heritrix) * HTML cleaner to remove boiler plate code * Language recognition * Corpus builder
    Downloads: 0 This Week
    Last Update:
    See Project
  • Axe Credit Portal - ACP- is axefinance’s future-proof AI-driven solution to digitalize the loan process from KYC to servicing, available as a locally hosted or cloud-based software. Icon
    Axe Credit Portal - ACP- is axefinance’s future-proof AI-driven solution to digitalize the loan process from KYC to servicing, available as a locally hosted or cloud-based software.

    Banks, lending institutions

    Founded in 2004, axefinance is a global market-leading software provider focused on credit risk automation for lenders looking to provide an efficient, competitive, and seamless omnichannel financing journey for all client segments (FI, Retail, Commercial, and Corporate.)
    Learn More
  • 5
    LogCrawler is an ANT task for automatic testing of web applications. Using a HTTP crawler it visits all pages of a website and checks the server logfiles for errors. Use it as a "smoketest" with your CI system like CruiseControl.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • Next