Search engine and data mining applications and ClueWeb datasets.
The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software, including the Indri search engine in C++, the Galago search engine research framework in Java, the RankLib learning to rank library, ClueWeb09 and ClueWeb12 datasets and the Sifaka data mining application.
Imgur Gallery Downloader
Users can now search Imgur for any phrase and ImgurDL/Loadur will automatically search for matching images. ImgurDL/Loadur will download the images while displaying the progress to the user.
Simple Porn Downloader is a tiny all Java based application that uses a list of keywords and starting urls to crawl webpages and branch out searching for specific media extensions which are downloaded and presented in an html page.
Desk.Now is a cross-platform Java client for the WhereIsNow WebService which allows you to know where is the latest version of a document, with just two clicks.
IRToolkit is an attempt to build and develop a generic search engine that integrates state-of-the-art Information Retrieval (IR) models. Furthermore, it offers a capability to compare the performance (in terms of precision, recall, index size, search response time and so on) between several open source IR applications. If you use the IRToolkit please cite the following work: https://sites.google.com/site/dinhbaduy/bibtex#Dinh-Phdthesis-2012
Program for web-search by defenite sites and periods of time. Definition by user. Using: http://www.yandex.ru/, http://www.google.ru. Search achived by redirect search query to search services. In other words Bolter - wrapper of existing search services. Visit http://vk.com/bolter_app for more info.
The goal of bookman is to implement a network based service for managing and distributing bookmarks transparently from a central server to any bookman-enabled client software (curently focussing on Mozilla, IE and Opera).
Analyze and visualization of the social structuring from "lastfm.de" which contained user data, friendslist, groups, group-members and musical neighbours.
Craigslist Scanner is an application that parses through the HTML in the for sale ads on craigslist to assist you in finding items you are looking for as soon as they come up. It is able to send you an E-mail whenever it finds a new item.
Fire.now is a Firefox plugin that automatically adds your documents to the WhereIsNow latest version discovery service. Everytime you upload a document somewhere, Fire.now integrates the WhereIsNow keys into the file and add it's url to WhereIsNow.
An application used to search various web-based genealogy sites simultaneously and review and analyse the data gathered.
IGLU is a Java class library designed to facilitate sharing of code among Artificial Intelligence/Information Retrieval researchers to illustrate how various problems can be solved in Java. It is developed and maintained by the IGLU Research Group.
A web crawler which uses regular expressions on text downloaded from a site.
The LEADERS toolkit is a generic toolset that enables the creation of an online environment which integrates EAD finding aids and EAC authority records with TEI transcripts and digitised images of archival material suitable to a wide variety of archives.
OMax is set of projects including real estate crawler and management system.
OpenSiteSearch is the new Open Source version of OCLC's original java-based web application for building Z39.50 portals (i.e. virtual union catalogues). This project is specifically aimed at the library community.
The 221BoT (SherlockBoT) is created as a solution to face the problem of high resource need to do a successful web crawling. This is a practical distributed web Crawler. Blog: - http://www.221BoT.BlogSpot.com Home Page: - http://www.221BoT.com
TUSeKe-a supporting platform for chinese text categorization technology research
The Web Search Ajax-like result Portal Framework Lib(WASP-lib) is based on an innovative concept to design a web 2.0 style sorting index page from original search result UI. It can improve the normal web search result UI performance to much better level
A hypertext-browser written in Java which filters links (emails, docs or pics for e.g.) out of .html-documents and paints them on screen in hierarchical order. Users get a quick overview of how a website is put together.
contentix - open source content management system contentix is a cms and a framework to develop any personalized browser based application. It use xml to store data in media nutral way and xsl to generate output. Check the demowebsite from downloads.
Analysis and interactive visualization of a web-based community. Supports different focuses on the given social network to present community groups to the user. Also specific information of each member is provided.
eXhaustive is a search software that crawls the Internet to answer a specific query. It has to work during 1hour - 1day and this way gives the user really pertinent results plus an analysis of all the data downloaded (tonality / related words / ... )
Simple application for downloading pictures from Zerochan.net
Simple java application for downloading high-quality pictures from Zerochan.net. You can find images by size or a tag. It's simple. And flat. All you need to do: download .jar file and run it with Oracle JVM (or any another JVM supporting image decoding)
WebCommic (newest picture) > PDF converter (history)
Geek & Poke Atom / nichtlustig.de > PDF converter (versioning style)