Provide a robust and efficient implementation of n-gram based classifiers to Java. N-Gram algorithms have shown to be surprisingly good at tasks like guessing the language/encoding from an arbitrary text file. And there are many more applications.
EasyGIS simplifies GIS data management, sharing, and publishing. REST interfaces (json, html views). Lucene based FTS searches. Thematic maps, business cartography. Integration with external GIS data providers - Google, OSM.
nxs crawler is a program to crawl the internet. The program generates random ip numbers and attempts to connect to the hosts. If the host will answer, the result will be saved in a xml file. After than the crawler will disconnect... Additionally you can
This is an ***old archive*** of tools developed for facilitating the use of Creative Commons licenses and metadata. --- For the most up to date representation of any of the projects listed here, please see: http://creativecommons.org/project/Developer.
Deploy in 115+ regions with the modern database for every enterprise.
MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
AIS - Associative Indexing Service, an application for storing bookmarks, memos, indexing of big (lifetime) archives for fast future access to the data by (personalized) keywords. In other words - it is an extension of human associative memory :)
JeCARS (Java Extendable Contents And Rights System) is a RESTful webservice which delivers pluggable output formats, e.g. Atom feeds or HTML.
Third party applications can be plugged in.
A JCR (JSR-170) repository (Jackrabbit) is used for storage.
The Semantic Web implementation using native xml database as backend storage. A SPARQL java compiler to XQuery using Jena. There are XQuery scripts for native xml database Sedna(http://modis.ispras.ru/sedna/).
iVia is an Internet subject portal or virtual library system. As a hybrid expert and machine built collection creation and management system, resources can be crawled and metadata and selected full-text can be automatically generated/extracted.
The Cornell Web Lab Collaboration Server is a suite of tools and services for GUI-based extraction, analysis and sharing of archived web data. See http://weblab.infosci.cornell.edu/ and http://www.cs.cornell.edu/~weigel for details about the project.
Narrows search result produced by popular Internet search engines, allowing to put extra filtering conditions, as certain words presented, certain words excluded, and so on.
Javen library is a framework for developing C++ application simply, with similar API to Java library. Hawk search engine is a software platform that used to build Vertical Search Product more easily for the Moderate Company or End Users.
A fusion of several open-source libraries and a web application to parse and filter RSS feeds, as well as generate RSS feeds based on user defined search terms
Contineo is a Web-based Document Management System (DMS). Features: Folder organization, document Versioning, Bulk import, import from mailbox. NOTE: this project has been DISMISSED in favor of LogicalDOC http://sourceforge.net/projects/logicaldoc
FlixFinder: Tivo & Netflix marriage. Automatically find and schedule upcoming movies in cable/satellite listings based on your netflix queue. Now Greasemonkey script. (Original project deprecated since the tv listings are no longer available).
S3B - Social Semantic Search and Browsing - is a middleware that delivers a set of search and browsing components that can be used in J2EE web applications to deliver user-oriented features based on semantic descriptions and social networking
DLESE (Digital Library for Earth System Education) is a community-supported digital library dedicated to the collection, enhancement, and distribution of materials that facilitate learning about the Earth. Sponsored by the US National Science Foundation.
The complete suggestions framework for java, supporting single and multi field suggest, java suggest box, client/server with hessian or json-rpc, and GWT AJAX suggest box, phonetic plugins. Proven high performance for data sets > 1 Mio.
Info Hub is an open source web based data/information repository/search engine. It allows browse and keyword search to documents and outside links. It is a great solution for project related data managment.
vbullmin is a data miner bot for vBulletin boards. vbullmin can get all Forums, Topics, Post and Users from a vBulletin. It can be export this values with phpbb2 database schema. It's a sample for Machine Learning. It's using patterns for getting data.
FathomFive is a classification aware lucene powered spidering and indexing solution, written in pure Java. It supports a variety of content types, provides an easy to use admin interface, and a customisable search interface. It spiders from HTTP and OAI.
Simple Porn Downloader is a tiny all Java based application that uses a list of keywords and starting urls to crawl webpages and branch out searching for specific media extensions which are downloaded and presented in an html page.
InfoCrawler allows you to crawl and index various types of documents, accessing data from various resources: Intranets, public WEB sites, local or remote file systems. For product information please see our website at http://www.infocrawler.org/
OpenOffice Search is a document indexer and search engine for OpenOffice documents. It is Java-based, so it will run on any J2SE enabled platform and uses an embedded Derby (Cloudscape) database.
Visualization of the contact network and user data from the popular business network XING.com. The web-based software can be used by every registered user from XING.