Search engine and data mining applications and ClueWeb datasets.
The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software, including the Indri search engine in C++, the Galago search engine research framework in Java, the RankLib learning to rank library, ClueWeb09 and ClueWeb12 datasets and the Sifaka data mining application.
PDFBox is a Java PDF Library. This project will allow access to all of the components in a PDF document. More PDF manipulation features will be added as the project matures. This ships with a utility to take a PDF document and output a text file.
The stuff here has no documentation and some of it may never be completed. This is my playground, use at your own risk.
Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems (file systems, web sites, mail boxes, ...) and the file formats (documents, images, ...) occurring in these systems.
PHP Crawler is a simple website search script for small-to-medium websites. The only requrements are PHP and MySQL, no shell access required.
This was an UI course project. In this project we built an interface prototype of an online travel reservation system. This service was meant to revolutionize the travel idustry in several ways for occasional travelers as well as for large businesses.
WebExtractor360 is a free and open source web data extractor. It uses Regular Expressions to find, extract and scrape internet data quickly and easily.
Project moved to GitHub! https://github.com/carrot2/carrot2 Carrot2 is an Open Source Search Results Clustering Engine. It can automatically organize small collections of documents, e.g. search results, into thematic categories. Carrot2 integrates very well with both Open Source and proprietary search engines.
WebExtractor360 is a free and open source web data extractor. It allows you to extract Images, Phrases, URLs (Links), URLs (Keywords), Emails, Phone, Fax and ANY other information on the web by specifying a Regular Expression. See http://www.webextractor
Retrieve Google Search results, cached web pages and other services using this Java client.
Monitors webpages for changes and emails output with differences to subscribers. Permits user accounts and registration. PHP/MYSQL.
A collection of Dokuwiki plugins that will enable the user to spatially enable and use the wiki, currently we have: openlayersmap (a map), geotag (ways of geotagging a page)
XmlTvProducer for PHP is extendable engine to grab tv/radio listings from websites and produce XMLTV output. Data distribution for TV-Browser is included. Primary focus is on Slovak and Czech channels, but the development is open to anybody.
FT3's goal is to add full text index support to sqlite3 databases.
A drop-in framework for adding tagging (folksonomy) capabilities to existing applications
QZARCH - Quick free-text search The project aims to deliver a light-weight file-based free-text search engine for Java based websites to adopt easily. The features include: - Search for one or more keywords in the content of one or more files -
Robust featureful multi-threaded CLI web spider using apache commons httpclient v3.0 written in java. ASpider downloads any files matching your given mime-types from a website. Tries to reg.exp. match emails by default, logging all results using log4j.
BMW (Bags of Multiple Words) is a project based in Lucene 2.0. that try to work with the query-term dependency. BMW offer a simple method that can be applied to several standard ranking functions to exploit a simple type of term dependency.
A web app for creating a repository of pictures (our focus is birds). Users submit pictures, with a wizard that generates RDF descriptiors. Sumissions are forwarded to Admins for aproval. Instances will export the RDF so that repositories may cooperate.
Blogometro is simplified implementation of "Blodgex", to track the weblog's links and updates.
BlueBox is PHP-MySQL powered search engine. It can be installed on every webserver without any permission. Only FTP and database management rights are required. BlueBox is very fast even on more than 1'000'000 pages scanned.
The Cornell Web Lab Collaboration Server is a suite of tools and services for GUI-based extraction, analysis and sharing of archived web data. See http://weblab.infosci.cornell.edu/ and http://www.cs.cornell.edu/~weigel for details about the project.
CoverYourASP.com - complete Active Server Pages source (JScript) for this popular web site. Includes full membership system, diary, online db admin, banner ad system and loads more.
Port of the Google sitemap generator, from Python to Csharp aka C-Sharp aka C# aka .NET aka dotNet.
DVDWeb is a Web Service which provides organization/search/lookup services through JAX-RPC API. The search can be done against the builtin DB (the user\'s private list of DVDs according to UPC codes) or against other Internet sites such as imdb or yahoo.