html source extractor free download

Showing 35 open source projects for "html source extractor"

View related business solutions

Search Engines Java Clear Filters & Widen Search

MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
Go from Code to Production URL in Seconds
Cloud Run deploys apps in any language instantly. Scales to zero. Pay only when code runs.

Skip the Kubernetes configs. Cloud Run handles HTTPS, scaling, and infrastructure automatically. Two million requests free per month.

Try it free
1

WebHarvest - web data extraction tool

Web data extraction (web data mining, web scraping) tool. It leverages well proved XML and text processing techologies in order to easely extract useful data from arbitrary web pages.

14 Reviews

Downloads: 4 This Week

Last Update: 2025-10-27
See Project
2

OpenSearchServer Search Engine

An open source search engine with RESTFul API and crawlers

OpenSearchServer is a powerful, enterprise-class, search engine program. Using the web user interface, the crawlers (web, file, database, etc.) and the client libraries (REST/API , Ruby, Rails, Node.js, PHP, Perl) you will be able to integrate quickly and easily advanced full-text search capabilities in your application: Full-text with basic semantic, join queries, boolean queries, facet and filter, document (PDF, Office, etc.) indexation, web scrapping,etc. OpenSearchServer runs on...

31 Reviews

Downloads: 4 This Week

Last Update: 2018-08-26
See Project
3

cpDetector

cpDetector is a proxy for codepage detection of documents. It delegates to multiple instances that try to detect the codepage by different techinques. A command line executeable is shipped that allows to sort documents by codepage.

Downloads: 3 This Week

Last Update: 2018-04-05
See Project
4

JuniCoder

JuniCoder is a Java project that uses unicode as a base for decoding and encoding formats that invented workarounds to express characters not covered by ASCII. Decoders translate those inventions to unicode. Encoders encode to these inventions.

Downloads: 0 This Week

Last Update: 2018-03-25
See Project
AI-generated apps that pass security review
Stop waiting on engineering. Build production-ready internal tools with AI—on your company data, in your cloud.

Retool lets you generate dashboards, admin panels, and workflows directly on your data. Type something like “Build me a revenue dashboard on my Stripe data” and get a working app with security, permissions, and compliance built in from day one. Whether on our cloud or self-hosted, create the internal software your team needs without compromising enterprise standards or control.

Try Retool free
5

MangaStream Downloader

The MangaStream Downloader is an open source application written in Java for managing and downloading manga from the site mangastream.com and mangafox.me. It is written under the GNU-GPL license and uses an open source HTML parser - TagSoup. Follow the project page on Facebook for updates: https://www.facebook.com/MangastreamDownloader

3 Reviews

Downloads: 1 This Week

Last Update: 2017-12-08
See Project
6

eXtensible Text Framework (XTF)

Framework for search and display of heterogenous document collections.

...Please visit https://github.com/cdlib/xtf for the latest updates. Obsolete Description: The eXtensible Text Framework (XTF) is an architecture that supports searching across collections of heterogeneous textual data (XML, PDF, HTML, text, and more), and the presentation of results and documents in a highly configurable manner. Includes highly customized versions of the proven open-source components Lucene and Saxon.

Downloads: 6 This Week

Last Update: 2019-07-29
See Project
7

CyberNeko HTML Parser

NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces.

17 Reviews

Downloads: 0 This Week

Last Update: 2015-04-17
See Project
8

Geoportal Server

Geoportal Server is a standards-based, open source product that enables discovery and use of geospatial resources including data and services.

18 Reviews

Downloads: 5 This Week

Last Update: 2025-02-13
See Project
9

regain

Regain is a Java search engine based on Jakarta Lucene. It provides indexing and searching files for plenty of formats (HTML,XML,doc(x),xls(x),ppt(x),oo,PDF,RTF,mp3,mp4,Java). A TagLibrary eases integrating search results in your JSP based web page.

13 Reviews

Downloads: 10 This Week

Last Update: 2014-07-30
See Project
Gemini 3 and 200+ AI Models on One Platform
Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

Build generative AI apps with Vertex AI. Switch between models without switching platforms.

Start Free
10

TestEl

TestEl is a Java-based learning analyzer for HTML (and possibly other) structured documents. It can be trained to detect structures in such documents and renders hits in XML.

1 Review

Downloads: 0 This Week

Last Update: 2014-06-09
See Project
11

Repoutil

A simple repository utilite that makes a html-page with the list of the directory.

Downloads: 0 This Week

Last Update: 2015-12-26
See Project
12

JavaPub

JavaPub is a one-click install BibTex-publications portal based on a simple java codebase. It features a drag-and-drop uploader module to upload BibTex files and a module that generates the html-index and entry-pages for publication listings.

Downloads: 0 This Week

Last Update: 2013-04-11
See Project
13

JavaWAC

Web-as-corpus tools in Java. * Simple Crawler (and also integration with Nutch and Heritrix) * HTML cleaner to remove boiler plate code * Language recognition * Corpus builder

Downloads: 0 This Week

Last Update: 2013-04-19
See Project
14

indexed webstats - solr

Webstats Solr is an attempt to make Apache Access log easier to Data Mine. By adding a powerful Search Engine (SOLR) as a Backend and using Java Script and HTML and maybe PHP I hope to out date AWStats.

Downloads: 0 This Week

Last Update: 2013-04-24
See Project
15

EasyGIS

EasyGIS simplifies GIS data management, sharing, and publishing. REST interfaces (json, html views). Lucene based FTS searches. Thematic maps, business cartography. Integration with external GIS data providers - Google, OSM.

Downloads: 0 This Week

Last Update: 2013-04-17
See Project
16

HttpFinder

HttpFinder is web content searching tool. It enables look for text content that matches given regular expression in html pages/scripts etc. All navigation is performed with use of other regexp which describes links to visit.

Downloads: 0 This Week

Last Update: 2015-12-01
See Project
17

JeCARS

JeCARS (Java Extendable Contents And Rights System) is a RESTful webservice which delivers pluggable output formats, e.g. Atom feeds or HTML. Third party applications can be plugged in. A JCR (JSR-170) repository (Jackrabbit) is used for storage.

Downloads: 1 This Week

Last Update: 2013-04-09
See Project
18

RSS EXTRACTOR

RSS EXTRACTOR is a java library for generating RSS newsfeeds considering the RSS web feeds from multiple websites. It extracts the best of newsfeed entries and a produces a RSS file which is a fusion of newsfeed entries from several websites.

Downloads: 0 This Week

Last Update: 2013-04-05
See Project
19

Keyword Extractor

This project is aimed at extracting keywords from documents either as files or on the Internet. It applies sophisticated keyword ranking algorithm to extract most relevant keywords for a document and has also the capability of finding similar document in

Downloads: 0 This Week

Last Update: 2013-03-21
See Project
20

Simple Porn Downloader

Simple Porn Downloader is a tiny all Java based application that uses a list of keywords and starting urls to crawl webpages and branch out searching for specific media extensions which are downloaded and presented in an html page.

Downloads: 5 This Week

Last Update: 2014-07-25
See Project
21

WebNews Crawler

WebNews Crawler is a specific web crawler (spider, fetcher) designed to acquire and clean news articles from RSS and HTML pages. It can do a site specific extraction to extract the actual news content only, filtering out the advertising and other cruft.

Downloads: 0 This Week

Last Update: 2013-04-23
See Project
22

RDF AutoPilot

Generates RDF and RDFS ontology documents automatically from HTML pages once given a set of rules.

Downloads: 0 This Week

Last Update: 2016-08-07
See Project
23

JaWiki

JaWiki is Java Wiki with a file based database to manage the Content. The content is stored in XML files in the file system. A html frontend allows to edit the content by the users via an Browser. A standalone server also included.

Downloads: 0 This Week

Last Update: 2015-08-06
See Project
24

JLinkCheck

JLinkCheck is an Ant Task written in Java for checking links in websites. It is not just checking one single page, but crawling a whole site like a spider, generating a report in XML and (X)HTML. JReptator will be its succesor with many more features

Downloads: 0 This Week

Last Update: 2016-04-26
See Project
25

webnavigator

The project Navigator aims at supporting automated gathering of dynamic information from third party web sites, using their web interface to post queries and to gather replies. Navigator is written in OS-independent java language.

Downloads: 1 This Week

Last Update: 2013-03-21
See Project

Previous
You're on page 1
2
Next

Search Results for "html source extractor"

Showing 35 open source projects for "html source extractor"

WebHarvest - web data extraction tool

OpenSearchServer Search Engine

cpDetector

JuniCoder

MangaStream Downloader

eXtensible Text Framework (XTF)

CyberNeko HTML Parser

Geoportal Server

regain

TestEl

Repoutil

JavaPub

JavaWAC

indexed webstats - solr

EasyGIS

HttpFinder

JeCARS

RSS EXTRACTOR

Keyword Extractor

Simple Porn Downloader

WebNews Crawler

RDF AutoPilot

JaWiki

JLinkCheck

webnavigator

Related Searches

Related Categories