The JSearch Project wants to provide the internet with a Java based generic interface for search engines. It consists of a core interface, search engine adaptors, a sort/merge module and a JSP based GUI.
Relational storage for tagged documents
Restad is an indexing-querying tool for tagged documents. It uses a relational database for storage and querying. See the last news on the blog : https://sourceforge.net/p/restad/blog/ The Ruby first prototype can be found there : https://github.com/ymoreau/Restad
Triplify provides a building block for the semantification of Web applications. Triplify is a small plugin for Web applications, which converts database content into RDF or JSON feeds and provides a Linked Data interface.
XmlTvProducer for PHP is extendable engine to grab tv/radio listings from websites and produce XMLTV output. Data distribution for TV-Browser is included. Primary focus is on Slovak and Czech channels, but the development is open to anybody.
ALTSE is an alternative search engine technology. It can index up to a couple million Web pages.
The Anywhere Location Search allows for location searches using a wide range of inputs (address, city/state, zip code, search string, IP address, landmark name, etc).
The BeeGram library is a portable open source search engine toolkit written in C. BeeGram provides a number of building blocks for the construction of powerful general-purpose text-based search tools.
Book management system with webservice written in php
CatMDServices is a Web application for describing and searching web services by means of metadata. Developed by IAAA (Univ. of Zaragoza) and GeoSpatiumLab S.L., sponsored by IGN Spain. Technical details: Java, GWT, XML, multiplatform, multilingual.
Crawl-By-Example runs a crawl, which classifies the processed pages by subjects and finds the best pages according to examples provided by the operator. Crawl-By-Example is a plugin to the Heritrix crawler, and was done as a part of GSoC06 program.
DOSE: a distributed platform for semantic elaboration that provides semantic services such as automatic annotation of web resources at the document substructure level, semantic search facilities, semantic annotation storage and retrieval.
Data Fountains is an automated collection building system of benefit to Internet portals, digital libraries and library catalogs. Web crawlers find new resources. Text extractors/classifiers create metadata, descriptions, rich full-text. C++.
dCrawler (Distributed Crawler) alias D-HarvestMan (Distributed HarvestMan) is a distributed Web crawler implemented in the Python programming language. dCrawler is developed on top of the existing open source Web crawler named HarvestMan.
Search Engine that gives full control over the search result. The user can do searches by category, and then combine previous search results to build complex search results, without the need of an advances query language.
Event Driven Federated Search platform to aggregate search results from distributed content providers.
An extensible framework and user interface for combining various structured search and document clustering techniques.
FlexibleShare has FlexSpaces Alfresco doc mgt, workflow and search in pods with a dashboard style UI with added Flex UI pods (wiki, blog, discussions, calendar, doc lib pods) for Alfresco Share back-end. Based on FlexibleDashboard, supporting plug-able pod modules for BI/charting/reporting, etc. AIR version with desktop file drag/drop, in browser version, and Mobile (Android and iOS) version. Downloads and source now only at http://code.google.com/p/flexibleshare/ Developed by Integrated Semantics: http://integratedsemantics.com blog: http://integratedsemantics.org
Fast Local File Search Using Lucene, HTMLParser and Highlighter Support Chinese now
GImageSpider is an Image Spider that has two abilities. GIS can search web by image search engines to find images. GIS can act as an image spider that crawls your arbitrary site by your constraints and find images.
A collection of Java Servlets relating to searching. Use of these servlets should make future transitions between search appliances less painful as well as simplify the query parameters.
Retrieve Google Search results, cached web pages and other services using this Java client.
HORUS is a system for knowledge acquisition, hypothesis generation, inference and learning. It is an interactive, internet environment accessible to a diverse community of users (public-access or membership basis) - see also UMKAILASH project for more.
XPath HTML parser
HXPath is a command line tool useful to extract data from HTML documents. HXPath can select sub trees, like the standard xpath tool, but is also able to read contents and attributes and output them in a bash friendly format. HTML Tidy and HTTP/HTTPS get are built in too.
Harvestman is a context aware metasearch engine which functions as a universal infromation gatherer and data mining system for the internet.