Deploy pre-built tools that crawl websites, extract structured data, and feed your applications. Reliable web data without maintaining scrapers.
Automate web data collection with cloud tools that handle anti-bot measures, browser rendering, and data transformation out of the box. Extract content from any website, push to vector databases for RAG workflows, or pipe directly into your apps via API. Schedule runs, set up webhooks, and connect to your existing stack. Free tier available, then scale as you need to.
Explore 10,000+ tools
Easy-to-use Business Software for the Waste Management Software Industry
DOP Software’s mission is to streamline waste and recycling business’ processes by providing them with dynamic, comprehensive software and services that increase productivity and quality of performance.
Project moved to GitHub!
https://github.com/carrot2/carrot2
Carrot2 is an Open Source Search Results Clustering Engine. It can automatically organize small collections of documents, e.g. search results, into thematic categories. Carrot2 integrates very well with both Open Source and proprietary search engines.
WebCollector is an open source web crawler framework based on Java.
WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
Github:
https://github.com/CrawlScript/WebCollector
Demo:
https://github.com/CrawlScript/WebCollector/blob/master/YahooCrawler.java
NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces.
Simple application for downloading pictures from Zerochan.net
Simple java application for downloading high-quality pictures from Zerochan.net.
You can find images by size or a tag. It's simple. And flat.
All you need to do: download .jar file and run it with Oracle JVM
(or any another JVM supporting image decoding)
Award-Winning Medical Office Software Designed for Your Specialty
Succeed and scale your practice with cloud-based, data-backed, AI-powered healthcare software.
RXNT is an ambulatory healthcare technology pioneer that empowers medical practices and healthcare organizations to succeed and scale through innovative, data-backed, AI-powered software.
Auto Rescanning - Search Terms - Regularly Updated With New Features
...Check the features section and be sure to let me know if you want a feature added.
Coming Soon:
- Wiki, explaining in depth how to use it more quickly (although its already pretty simple to use)
- Ability to download the whole thread, not just images
- Better multithreading
- Ability to use proxies
- Sort images download from searches into folders
- Keep original image names
- More responsive gui
Be sure to let me know if you want any other features.
SSWAP (Simple Semantic Web Architecture and Protocol; pronounced "swap") is an architecture, protocol, and platform that uses reasoning to semantically integrate disparate data and services on the web. Running live at http://sswap.info.
Oneline provides simple, high performance platform for EC2,JDBC,Hadoop,S3,Solr,Flex, XSLT, J2EE,Windows Mobile SMS, Blog, Yahoo EMail, Google EMail, Others) and many more platform components.
Rise Vision is the #1 digital signage company, offering easy-to-use cloud digital signage software compatible with any player across multiple screens. Forget about static displays. Save time and boost sales with 500+ customizable content templates for your screens. If you ever need help, get free training and exceptionally fast support.
GHIRL is the Graph-based Heterogeneous Information Representation Language: a java library for representing, querying, and navigating graph- or network-based data structures.
JavaPub is a one-click install BibTex-publications portal based on a simple java codebase. It features a drag-and-drop uploader module to upload BibTex files and a module that generates the html-index and entry-pages for publication listings.
Web-as-corpus tools in Java.
* Simple Crawler (and also integration with Nutch and Heritrix)
* HTML cleaner to remove boiler plate code
* Language recognition
* Corpus builder
Other spiders has a limited link depth, follows links not randomized or are combined with heavy indexing machines. This spider will has not link depth limits, randomize next url, that will be checked for new urls.
Java/Swish-e bridge. This application is built arround a simple API and a Web container to provide access to the search facility (via web-services) and management/indexing (wep app).
Retriever is a simple crawler packed as a Java library that allows developers to collect and manipulate documents reachable by a variety of protocols (e.g. http, smb). You'll easily crawl documents shared in a LAN, on the Web, and many other sources.
Simple Porn Downloader is a tiny all Java based application that uses a list of keywords and starting urls to crawl webpages and branch out searching for specific media extensions which are downloaded and presented in an html page.
Command line application written in Java useful for automation of downloading process and filtering contents of downloaded files. jDownloader uses simple script file to configure downloading and filtering processes.
JxtASK is a P2P system that is aimed to search, download and share academic content hosted on websites that will join the JxtASK community. Joining is simple: siteadmins must generate(even automatically)a XML catalog which describes the files.
SearchSite is intended to support out-of-the-box search for small to medium websites, bridging the gap between simple PHP/Perl scripts at one extreme or something like Nutch which is intended to deal with millions of pages at the other.
Krakatoa is search engine for your desktop with simple and advanced search capabilities. It will search on any key word, exact phrases or files. Search within a domain or site. Fast search engine switching for better results.
Lude is an XML-RPC Lucene Daemon written in Java. Clients in any environment can create indexes, add/update/delete documents, and query the index through a simple XML-RPC API.
The goal of this project is to develop a fast, simple, robust and fully JCR (JSR-170) compliant Content Repository on top of a number of RDBMS.
A dual-licensed CMS, Mosaďka-CMS, will be developped on top of this repository by Logyka Technologies.