The archive-crawler project is building Heritrix: a flexible, extensible, robust, and scalable web crawler capable of fetching, archiving, and analyzing the full diversity and breadth of internet-accesible content.
The “Media Crawler” is an extensible Eclipse RCP based desktop application which will crawl a given file system, extract metadata from files, map metadata to internal schemas and store the metadata in a databse. This project is ANDS-funded.
RiverGlass EssentialScanner is an open source web and file system crawler which indexes the text content of discovered files so they can be retrieved and analyzed. It provides simple scanner capabilities as part of larger enterprise search solutions.
Ex-Crawler is divided into 3 subprojects (Crawler Daemon, distributed gui Client, (web) search engine) which together provide a flexible and powerful search engine supporting distributed computing. More informations: http://ex-crawler.sourceforge.net
It's a Java based Extract Transform Load(ETL) tool with following features -- 1. It can take data from any source to any destination, any thing you can think of - for example from a web crawler to a database or filesystem 2. It's multithreaded and
Fight skyrocketing paid media costs by turning your customers into a primary vehicle for acquisition, awareness, and activation with Extole.
The platform's advanced capabilities ensure companies get the most out of their referral programs. Leverage custom events, profiles, and attributes to enable dynamic, audience-specific referral experiences. Use first-party data to tailor customer segment messaging, rewards, and engagement strategies. Use our flexible APIs to build management capabilities and consumer experiences–headlessly or hybrid. We have all the tools you need to build scalable, secure, and high-performing referral programs.
This project is a java web spider (web crawler) with the ability to download (and resume) files. It is also highly customizable with regular expressions and download templates. All backend functionalities are also available in a separate library.
Agent based Regional Crawler strategy implementation - gathers users' common needs and interests in a certain domain. It crawls based on these interests, instead of crawling the web without any predefined order.
This is a simple webcrawler for FaceBook (TM) written in Java. The crawler will surf the public user pages (this means that you do not need to provide ann account) to reconstruct the friendship graph for further studies and analises
A Web crawler prototype designed to index pages of certain resource sharing platforms based on folksonomy tags. The results are displayed in an Excel spreadsheet.
MuSE-CIR is a Multigram-based Search Engine and Collaborative Information Retrieval system. Written in Java /JSP, supports any JDBC connectable database - thoroughly tested only with OracleXE, and somewhat with MySQL, JSP on Apache Tomcat 5.5
Web-as-corpus tools in Java.
* Simple Crawler (and also integration with Nutch and Heritrix)
* HTML cleaner to remove boiler plate code
* Language recognition
* Corpus builder
nxs crawler is a program to crawl the internet. The program generates random ip numbers and attempts to connect to the hosts. If the host will answer, the result will be saved in a xml file. After than the crawler will disconnect... Additionally you can
The Java Sitemap Parser can parse a website's Sitemap (http://www.sitemaps.org/). This is useful for web crawlers that want to discover URLs from a website that is using the Sitemap Protocol.
This project has been incorporated into crawler-commons (https://github.com/crawler-commons/crawler-commons) and is no longer being maintained.
A java game that was developed for a class project. The original intention was to make it similar to Secret of Mana, but it became more of a dungeon crawler. (8/15/09) Development was slowed due to Summer. We should be resuming development shortly.
The project aims at developing a system that will consist of a crawler, a user interface and a database that will allow user to obtain research papers in PDF format from any domain and carry out the analysis.
Retriever is a simple crawler packed as a Java library that allows developers to collect and manipulate documents reachable by a variety of protocols (e.g. http, smb). You'll easily crawl documents shared in a LAN, on the Web, and many other sources.
The DeDuplicator is an add-on module (plug-in) for the web crawler Heritrix. It offers a means to reduce the amount of duplicate data collected in a series of snapshot crawls.
LogCrawler is an ANT task for automatic testing of web applications. Using a HTTP crawler it visits all pages of a website and checks the server logfiles for errors. Use it as a "smoketest" with your CI system like CruiseControl.