An open source search engine with RESTFul API and crawlers
OpenSearchServer is a powerful, enterprise-class, search engine program. Using the web user interface, the crawlers (web, file, database, etc.) and the client libraries (REST/API , Ruby, Rails, Node.js, PHP, Perl) you will be able to integrate quickly and easily advanced full-text search capabilities in your application: Full-text with basic semantic, join queries, boolean queries, facet and filter, document (PDF, Office, etc.) indexation, web scrapping,etc. OpenSearchServer runs on Windows and Linux/Unix/BSD.
Web data extraction (web data mining, web scraping) tool. It leverages well proved XML and text processing techologies in order to easely extract useful data from arbitrary web pages.
Free Extracts Emails, Phones and custom text from Web using JAVA Regex
In Files there is WebCrawlerMySQL.jar which supports MySql Connection Please follow this link to get latest version https://sourceforge.net/projects/web-spider-web-crawler-extract/ Free Web Spider & Crawler. Extracts Information from Web by parsing millions of pages. Store data into Derby OR MySQL Database and data are not being lost after force closing the spider. - Free Web Spider , Parser, Extractor, Crawler - Extraction of Emails , Phones and Custom Text from Web - Export to Excel File - Data Saved into Derby Database - Written in Java Cross Platform See also Free Email Sender in this link: https://sourceforge.net/projects/gitst-free-email-ender/
Imgur Gallery Downloader
Users can now search Imgur for any phrase and ImgurDL/Loadur will automatically search for matching images. ImgurDL/Loadur will download the images while displaying the progress to the user.
Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems (file systems, web sites, mail boxes, ...) and the file formats (documents, images, ...) occurring in these systems.
Classifier4J is a java library that provides an API for automatic classification of text. The default (and only current) implementation of this API is a Bayesian classifier. This library can be used for multiple purposes - as a spam filter or a blog cl
A torrent search engine plugin for the Azureus/Vuze bittorrent platform.
OpenEphyra is an open framework for question answering (QA). It retrieves answers to natural language questions from the Web and other sources. Visit http://www.ephyra.info/ for more details and information on joining this open research initiative.
DOSE: a distributed platform for semantic elaboration that provides semantic services such as automatic annotation of web resources at the document substructure level, semantic search facilities, semantic annotation storage and retrieval.
Java API for creating Rich Site Summary (RSS) feed files. Created for people who want to create RSS files from within their applications but don't want to get into the nitty gritty of working out XML specs.
Digital Library Search Engine
SeerSuite is an application toolkit for digital libraries and search engines; i.e., CiteSeerX. CiteSeerX has moved to GitHub, please get the latest code from: https://github.com/SeerLabs/CiteSeerX
Command line application written in Java useful for automation of downloading process and filtering contents of downloaded files. jDownloader uses simple script file to configure downloading and filtering processes.
Oxyus is an open source search engine written in 100% Java, aimed to provide a search button to your website in an easy way. Oxyus uses Apache Lucene for indexing, Quartz for scheduling and other interesting software products.
EasyGIS simplifies GIS data management, sharing, and publishing. REST interfaces (json, html views). Lucene based FTS searches. Thematic maps, business cartography. Integration with external GIS data providers - Google, OSM.
Web based RSS Search Engine that learns user preferences to return results. Demo available at http://ec2-50-16-215-243.compute-1.amazonaws.com/
a Solr Based Semantic Mediawiki Store
Dr. Micheal Kay: "Saxon 8.7 is the first release to be released simultaneously by Saxonica on the Java and .NET platforms." MDP: Mission accomplished! Saxon for the .NET platform from Saxonica is now available and supported via the http://saxon.sf.net
This package contains different tools to add NLP capabilities for Lucene 4.x (it has been tested using Lucene version from 4.6.x to 4.8.1). Although it was originally developed for German, it is, mostly, language independent. It allows the user to lemmatize words to be indexed, to weight termy ba their parts of speech (e.g. weighting nouns mor hevaily than pronouns), and to add synonyms taken from GermaNet or a list you provide to the search index and thereby increase recall of lucene.
Framework for text mining, data integration and data analysis. Keywords: ontology and graph alignment, relation mining, warehouse, semantic database integration, bioinformatics, systems biology, microarray, Java.
RSS EXTRACTOR is a java library for generating RSS newsfeeds considering the RSS web feeds from multiple websites. It extracts the best of newsfeed entries and a produces a RSS file which is a fusion of newsfeed entries from several websites.
This is an ***old archive*** of tools developed for facilitating the use of Creative Commons licenses and metadata. --- For the most up to date representation of any of the projects listed here, please see: http://creativecommons.org/project/Developer.
NICE is a high speed open source ftp search engine written 100% in Java and no database required, running on any web container such as Tomcat. it uses Struts,Lucene,Quartz and provides a dynamic AJAX based Web interface and control panel.
The Informa library provides a convenient Java API for handling news channels and metadata about them. Different syntax formats (RSS 0.91, 1.0, 2.0 and Atom 0.3, 1.0) for feeds are supported. Also support for channel information descriptions (OPML) avail
Narrows search result produced by popular Internet search engines, allowing to put extra filtering conditions, as certain words presented, certain words excluded, and so on.
Roosster.org is a personal "on-demand" search engine. This means, it indexes only items/entries/files/URLs you explicitly tell it to index and provides a full-text-search over indexed items.