An object relational-mapping (ORM) library for Java
Hibernate is an Object/Relational Mapper tool. It's very popular among Java applications and implements the Java Persistence API. Hibernate ORM enables developers to more easily write applications whose data outlives the application process. As an Object/Relational Mapping (ORM) framework, Hibernate is concerned with data persistence as it applies to relational databases (via JDBC).
Digital Library Software
Greenstone is a complete digital library creation, management and distribution package created and distributed by the New Zealand Digital Library Project. There are two major versions of the software. Greenstone 3 is under active development, and is recommended for download. We also provide maintenance releases for its forerunner, Greenstone 2. Featured download not what you're looking for? Click "Browse all files" to access binaries and source releases of both versions.
Search engine and data mining applications and ClueWeb datasets.
The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software, including the Indri search engine in C++, the Galago search engine research framework in Java, the RankLib learning to rank library, ClueWeb09 and ClueWeb12 datasets and the Sifaka data mining application.
An open source search engine with RESTFul API and crawlers
OpenSearchServer is a powerful, enterprise-class, search engine program. Using the web user interface, the crawlers (web, file, database, etc.) and the client libraries (REST/API , Ruby, Rails, Node.js, PHP, Perl) you will be able to integrate quickly and easily advanced full-text search capabilities in your application: Full-text with basic semantic, join queries, boolean queries, facet and filter, document (PDF, Office, etc.) indexation, web scrapping,etc. OpenSearchServer runs on Windows and Linux/Unix/BSD.
PDFBox is a Java PDF Library. This project will allow access to all of the components in a PDF document. More PDF manipulation features will be added as the project matures. This ships with a utility to take a PDF document and output a text file.
Project moved to GitHub! https://github.com/carrot2/carrot2 Carrot2 is an Open Source Search Results Clustering Engine. It can automatically organize small collections of documents, e.g. search results, into thematic categories. Carrot2 integrates very well with both Open Source and proprietary search engines.
The MangaStream Downloader is an open source application written in Java for managing and downloading manga from the site mangastream.com and mangafox.me. It is written under the GNU-GPL license and uses an open source HTML parser - TagSoup. Follow the project page on Facebook for updates: https://www.facebook.com/MangastreamDownloader
Archive your personal history
ResCarta Toolkit offers an open source solution to creating, storing, viewing, and searching digital collections. Applications in the toolkit let users create and edit metadata, convert data to open standard ResCarta format, index and host collections.
NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces.
A search application to watch and download movies and TV shows
A federated search desktop application to read about, preview, watch, and download any movie and television titles that are being shared online.
The Wikipedia Miner toolkit provides simplified access to Wikipedia. This open encyclopedia represents a vast, constantly evolving multilingual database of concepts and semantic relations; a promising resource for nlp and related research.
Regain is a Java search engine based on Jakarta Lucene. It provides indexing and searching files for plenty of formats (HTML,XML,doc(x),xls(x),ppt(x),oo,PDF,RTF,mp3,mp4,Java). A TagLibrary eases integrating search results in your JSP based web page.
cpDetector is a proxy for codepage detection of documents. It delegates to multiple instances that try to detect the codepage by different techinques. A command line executeable is shipped that allows to sort documents by codepage.
TouchGraph provides a set of interfaces for graph visualization using force-based layout and focus+context techniques. For now only older code is available, but we are planning to release new versions as well.
A Java implementation of a flexible and extensible web spider engine. Optional modules allow functionality to be added (searching dead links, testing the performance and scalability of a site, creating a sitemap, etc ..
Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems (file systems, web sites, mail boxes, ...) and the file formats (documents, images, ...) occurring in these systems.
Free Extracts Emails, Phones and custom text from Web using JAVA Regex
In Files there is WebCrawlerMySQL.jar which supports MySql Connection Please follow this link to get latest version https://sourceforge.net/projects/web-spider-web-crawler-extract/ Free Web Spider & Crawler. Extracts Information from Web by parsing millions of pages. Store data into Derby OR MySQL Database and data are not being lost after force closing the spider. - Free Web Spider , Parser, Extractor, Crawler - Extraction of Emails , Phones and Custom Text from Web - Export to Excel File - Data Saved into Derby Database - Written in Java Cross Platform See also Free Email Sender in this link: https://sourceforge.net/projects/gitst-free-email-ender/
Free Extracts Emails, Phones and custom text from Web using JAVA Regex
In Files there is WebCrawlerMySQL.jar which supports MySql Connection Free Web Spider & Crawler. Extracts Information from Web by parsing millions of pages. Store data into Derby Database and data are not being lost after force closing the spider. - Free Web Spider , Parser, Extractor, Crawler - Extraction of Emails , Phones and Custom Text from Web - Export to Excel File - Data Saved into Derby and MySQL Database - Written in Java Cross Platform Also See Free email Sender : https://sourceforge.net/projects/gitst-free-email-ender/
Classifier4J is a java library that provides an API for automatic classification of text. The default (and only current) implementation of this API is a Bayesian classifier. This library can be used for multiple purposes - as a spam filter or a blog cl
OpenEphyra is an open framework for question answering (QA). It retrieves answers to natural language questions from the Web and other sources. Visit http://www.ephyra.info/ for more details and information on joining this open research initiative.
The "Netiquette abolishment project" ! Replace content RATING by content POSITIONNING. This project's main goal is to create an on line 'real place': It will work like a 3D visualisation software: you select your interests by geting close to them,
This is the official collaborative development environment of the Large Knowledge Collider (LarKC), a platform for massive distributed reasoning that aims to remove the scalability barriers of currently existing reasoning systems for the Semantic Web
Digital Library Search Engine
SeerSuite is an application toolkit for digital libraries and search engines; i.e., CiteSeerX. CiteSeerX has moved to GitHub, please get the latest code from: https://github.com/SeerLabs/CiteSeerX
Imgur Gallery Downloader
Users can now search Imgur for any phrase and ImgurDL/Loadur will automatically search for matching images. ImgurDL/Loadur will download the images while displaying the progress to the user.
Web Search by the people, for the people
YaCy is a free search engine that anyone can use to build search the internet (www and ftp) or to create a search portal for others (internet or intranet). The scale of YaCy is limited only by the number of users and can index billions of web pages. In p2p mode it is fully decentralized, all users of the search engine network are equal and it is not possible for anyone to censor the content of the distributed index.