Crawl websites, sync to vector databases, and power RAG applications. Pre-built integrations for LLM pipelines and AI assistants.
Build data pipelines that feed your AI models and agents without managing infrastructure. Crawl any website, transform content, and push directly to your preferred vector store. Use 10,000+ tools for RAG applications, AI assistants, and real-time knowledge bases. Monitor site changes, trigger workflows on new data, and keep your AIs fed with fresh, structured information. Cloud-native, API-first, and free to start until you need to scale.
Try for free
Atera all-in-one platform IT management software with AI agents
Ideal for internal IT departments or managed service providers (MSPs)
Atera’s AI agents don’t just assist, they act. From detection to resolution, they handle incidents and requests instantly, taking your IT management from automated to autonomous.
DPRK pull is a script that pulls the English language North Korean news articles from the KCNA website and puts them into one file for reading by a Text to Speech program.
Picks up text from a web page using a html template.
A java html picker - text extractor
Picks up text from a web page using a html template. Useful if you have regularly data to extract from the same site. You may use the same url or you may build urls having parameters. These parameters are fetch from a text file.
Wiko, the wiki compiler, compiles wiki like files into html and LaTeX, combining easy wiki syntax, your preferred non-web text editor and svn/cvs control to write static webs, cientific articles or even blogs.
Multi-connection command line tool to download Internet sites. Similar to wget and cURL, but it manages up to 50 parallel links. Main features are: recursive fetching, Metalink retrieving, segmented download and image filtering by width and height.
Html Assembler is a static site generator. It automatically integrates page content such as text and photos in a modifiable page template creating a complete set of html files ready for upload to your site.
HXPath is a command line tool useful to extract data from HTML documents. HXPath can select sub trees, like the standard xpath tool, but is also able to read contents and attributes and output them in a bash friendly format. HTML Tidy and HTTP/HTTPS get are built in too.
VNC for use with the BrowserMob Selenium JavaScript Validator. This tool is made available for users of BrowserMob FREE Website Monitoring and Load Testing. The BrowserMob Local Validation Service can be downloaded from https://browsermob.com/tools.
The aw script is written so that you can browse web sites
through the command line by specifying where to look at in a
concise manner. It can also be used to make an excerpt of
web sites.
OrangeHRM provides a world-class HRIS experience and offers everything you and your team need to be that HR hero you know that you are.
Give your HR team the tools they need to streamline administrative tasks, support employees, and make informed decisions with the OrangeHRM free and open source HR software.
APHID is an easy-to-install, easy-to-use DocBook environment. APHID transforms source documents (text or XML) into multiple output formats (HTML, PDF, HTML Help, etc.). APHID is a derivative work of eDE (http://www.e-novative.de).
HTTP functional and non-functional (load and performance) toolkit based on jython/grinder (http://grinder.sf.net) ...includes capabilities to support: SOA services, REST, json/xml encoding, AES and WS security ... and a stub to collect requests
No Latte is an interpreter for a variation of the Latte language (cf. http://www.latte.org/) for writing XHTML documents in a functional-programming style---LaTeX sensibilities with LISP semantics.
SYMPLiK RANGEHOOD is a Javadoc-like tool for Oracle database. This pure-Java program "sucks up" data dictionary and object source code from database and generate document for Tables, Views, Triggers, Packages, Procedures, Functions, and others.
reminder455 — cli remind application. it shows table, which displays how many days left to fixed events. Output to console or xhtml. run under linux or windows.
now here: https://github.com/plastex/plastex
plasTeX is a Python-based LaTeX document processing framework. It gives DOM-like access to a LaTeX document, as well as the ability to generate mulitple output formats (e.g. HTML, DocBook, tBook, etc.).
Vilistextum is a small and fast HTML to text converter. It is fault-tolerant and deals well with badly-formed or otherwise quirky HTML. It has full support for different character sets (e.g. Unicode).
Program converts HTML pages into LaTeX format. Own mappings between HTML tags and character entities can be defined. CSS formatting properties are also supported (including colours). Implemented in Java.
HTMLtools includes several Java HTML tools for preparing Web pages. The HTMLtools program automates batch conversion of tab-delimited spreadsheet text files to HTML Web-page files, file & table editing, keyword mapping, templates, and more.
Escm is a text file processor that expands embedded Scheme expressions in the input file. Unlike other text-embedded language processors, escm does not depend on a particular Scheme implementation---you can use your favorite Scheme.
An all-in-one authentication with mysql as backend. Features: - Howto/Document - user info - libnss-mysql - pam-mysql - usersql - pdbsql (samba) - radius-mysql - mail