Join/Login
Business Software
Open Source Software
For Vendors
Blog
About
More

For Vendors Help Create Join Login

Business Software

Open Source Software

SourceForge Podcast

Resources

Articles
Case Studies
Blog

Menu

Help
Create
Join
Login

Home
Open Source Software
Artificial Intelligence Software
Search Results

Search Results for "html source extractor"

x

Sort By:

Relevance

Clear All Filters

OS

Linux 34
Windows 32
Mac 31
More...
BSD 8
ChromeOS 8
Mobile Operating Systems 2

Category

Artificial Intelligence 38
Software Development 3
Business 2
Database 2
Formats and Protocols 2
Multimedia 2
Internet 1

License

OSI-Approved Open Source 37

Translations

German 2
Chinese (Simplified) 1
Chinese (Traditional) 1
English 1
More...
Japanese 1
Spanish 1

Programming Language

Python 38
JavaScript 3
TypeScript 2
Unix Shell 2
C 1
More...
Ruby 1

Status

Production/Stable 3
Alpha 1

Showing 38 open source projects for "html source extractor"

View related business solutions

Artificial Intelligence Python Clear Filters & Widen Search

MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
$300 in Free Credit Towards Top Cloud Services
Build VMs, containers, AI, databases, storage—all in one place.

Start your project in minutes. After credits run out, 20+ products include free monthly usage. Only pay when you're ready to scale.

Get Started
1

Video-subtitle-extractor

A GUI tool for extracting hard-coded subtitle (hardsub) from videos

Video hard subtitle extraction, generate srt file. There is no need to apply for a third-party API, and text recognition can be implemented locally. A deep learning-based video subtitle extraction framework, including subtitle region detection and subtitle content extraction. A GUI tool for extracting hard-coded subtitles (hardsub) from videos and generating srt files. Use local OCR recognition, no need to set up and call any API, and do not need to access online OCR services such as Baidu...

1 Review

Downloads: 68 This Week

Last Update: 2026-04-05
See Project
2

BudouX

Standalone, small, language-neutral

Standalone. Small. Language-neutral. BudouX is the successor to Budou, the machine learning-powered line break organizer tool. It is standalone. It works with no dependency on third-party word segmenters such as Google cloud natural language API. It is small. It takes only around 15 KB including its machine learning model. It's reasonable to use it even on the client-side. It is language-neutral. You can train a model for any language by feeding a dataset to BudouX’s training...

Downloads: 5 This Week

Last Update: 2026-03-29
See Project
3

Text Generation Web UI

Oobabooga - The definitive Web UI for local AI, with powerful features

A gradio web UI for running Large Language Models like LLaMA, llama.cpp, GPT-J, Pythia, OPT, and GALACTICA. Dropdown menu for switching between models. Notebook mode that resembles OpenAI's playground. Chat mode for conversation and role playing. Instruct mode compatible with Alpaca and Open Assistant formats. Nice HTML output for GPT-4chan. Markdown output for GALACTICA, including LaTeX rendering. Custom chat characters. Advanced chat features (send images, get audio responses with TTS)....

Downloads: 78 This Week

Last Update: 2026-04-07
See Project
4

PasteMD

Paste Markdown and AI responses into Word Excel instantly fast

PasteMD is a lightweight desktop utility designed to streamline the process of transferring formatted content from the clipboard into office applications such as Word, WPS, and Excel. It primarily targets users who frequently copy content from AI chat tools or web pages and encounter formatting issues, especially with Markdown, tables, and LaTeX formulas. PasteMD operates from the system tray and monitors clipboard content, automatically converting Markdown or HTML into properly formatted...

Downloads: 12 This Week

Last Update: 2026-03-18
See Project
Gemini 3 and 200+ AI Models on One Platform
Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

Build generative AI apps with Vertex AI. Switch between models without switching platforms.

Start Free
5

Screenshot to Code

A neural network that transforms a design mock-up into static websites

Screenshot-to-code is a tool or prototype that attempts to convert UI screenshots (e.g., of mobile or web UIs) into code representations, likely generating layouts, HTML, CSS, or markup from image inputs. It is part of a research/proof-of-concept domain in UI automation and image-to-UI code generation. Mapping visual design to code constructs. Code/UI layout (HTML, CSS, or markup). Examples/demo scripts showing “image UI code”.

Downloads: 2 This Week

Last Update: 2025-09-26
See Project
6

Automatic text summarizer

Module for automatic summarization of text documents and HTML pages

Sumy is an automatic text summarization library that provides multiple algorithms for extracting key content from documents and articles. Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains a simple evaluation framework for text summaries. Implemented summarization methods are described in the documentation. I also maintain a list of alternative implementations of the summarizers in various programming languages.

Downloads: 1 This Week

Last Update: 2026-02-14
See Project
7

Pandas Profiling

Create HTML profiling reports from pandas DataFrame objects

pandas-profiling generates profile reports from a pandas DataFrame. The pandas df.describe() function is handy yet a little basic for exploratory data analysis. pandas-profiling extends pandas DataFrame with df.profile_report(), which automatically generates a standardized univariate and multivariate report for data understanding. High correlation warnings, based on different correlation metrics (Spearman, Pearson, Kendall, Cramér’s V, Phik). Most common categories (uppercase, lowercase,...

Downloads: 3 This Week

Last Update: 2026-01-13
See Project
8

BruteForceAI

Advanced LLM-powered brute-force tool combining AI intelligence

BruteForceAI is an open-source security testing tool that applies large language models to the analysis of login forms and authentication flows in web applications. At a high level, the project uses AI to inspect HTML content, identify the relevant form elements, and automate selector discovery so that a tester does not need to hand-map every field before evaluation.

Downloads: 6 This Week

Last Update: 2026-03-09
See Project
9

ScrapeGraphAI

Python scraper based on AI

Extracting content from websites and local documents using LLM. ScrapeGraphAI is a web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.). Just say which information you want to extract and the library will do it for you.

Downloads: 14 This Week

Last Update: 5 days ago
See Project
Go from Code to Production URL in Seconds
Cloud Run deploys apps in any language instantly. Scales to zero. Pay only when code runs.

Skip the Kubernetes configs. Cloud Run handles HTTPS, scaling, and infrastructure automatically. Two million requests free per month.

Try it free
10

MolmoWeb

Open multimodal web agent built by Ai2

...Unlike traditional automation tools that rely on structured HTML parsing or predefined APIs, MolmoWeb operates directly from screenshots of web pages, interpreting visual content in the same way a human user would. This approach allows it to generalize across different websites without requiring site-specific integrations, making it highly adaptable to diverse web environments.

Downloads: 2 This Week

Last Update: 4 days ago
See Project
11

DB-GPT

Revolutionizing Database Interactions with Private LLM Technology

DB-GPT is an experimental open-source project that uses localized GPT large models to interact with your data and environment. With this solution, you can be assured that there is no risk of data leakage, and your data is 100% private and secure.

Downloads: 13 This Week

Last Update: 2026-03-27
See Project
12

PyKEEN

A Python library for learning and evaluating knowledge graph embedding

PyKEEN (Python KnowlEdge EmbeddiNgs) is a Python package designed to train and evaluate knowledge graph embedding models (incorporating multi-modal information). PyKEEN is a Python package for reproducible, facile knowledge graph embeddings. PyKEEN has a function pykeen.env() that magically prints relevant version information about PyTorch, CUDA, and your operating system that can be used for debugging. If you’re in a Jupyter Notebook, it will be pretty-printed as an HTML table.

Downloads: 7 This Week

Last Update: 2025-04-24
See Project
13

Shapash

Explainability and Interpretability to Develop Reliable ML models

Shapash is a Python library dedicated to the interpretability of Data Science models. It provides several types of visualization that display explicit labels that everyone can understand. Data Scientists can more easily understand their models, share their results and easily document their projects in an HTML report. End users can understand the suggestion proposed by a model using a summary of the most influential criteria.

Downloads: 6 This Week

Last Update: 2026-01-30
See Project
14

Label Studio

Label Studio is a multi-type data labeling and annotation tool

...Configurable label formats let you customize the visual interface to meet your specific labeling needs. Support for multiple data types including images, audio, text, HTML, time-series, and video.

Downloads: 29 This Week

Last Update: 2026-03-13
See Project
15

Unstructured.IO

Open source libraries and APIs to build custom preprocessing pipelines

The unstructured library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of unstructured revolve around streamlining and optimizing the data processing workflow for LLMs. unstructured modular bricks and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and is efficient in transforming unstructured data into structured outputs.

Downloads: 5 This Week

Last Update: 13 hours ago
See Project
16

DocsGPT

Private AI platform for agents, enterprise search and RAG pipelines

DocsGPT is an open-source AI platform for deploying private RAG pipelines, AI agents, and enterprise search on your own infrastructure. Connect any data source (PDFs, DOCX, CSV, Excel, HTML, audio, GitHub, databases, URLs) and get accurate, hallucination-free answers with source citations. Choose your LLM: OpenAI, Anthropic, Google Gemini, or local models.

Downloads: 4 This Week

Last Update: 5 hours ago
See Project
17

Chandra

OCR model for complex documents with layout-aware structured outputs

Chandra is an advanced OCR model designed to extract and structure information from complex documents such as tables, forms, handwritten notes, and mathematical content. It focuses on preserving full document layout, meaning that extracted text is accompanied by positional metadata like bounding boxes for each element. Chandra supports multiple output formats including Markdown, HTML, and JSON, making it suitable for downstream processing and integration into data pipelines. It is capable of...

Downloads: 0 This Week

Last Update: 2026-03-18
See Project
18

LlamaParse

Parse files for optimal RAG

LlamaParse is a GenAI-native document parser that can parse complex document data for any downstream LLM use case (RAG, agents). Load in 160+ data sources and data formats, from unstructured, and semi-structured, to structured data (API's, PDFs, documents, SQL, etc.) Store and index your data for different use cases. Integrate with 40+ vector stores, document stores, graph stores, and SQL db providers.

Downloads: 4 This Week

Last Update: 2026-02-13
See Project
19

StyleTTS 2

Towards Human-Level Text-to-Speech through Style Diffusion

StyleTTS2 is a state-of-the-art text-to-speech system that aims for human-level naturalness by combining style diffusion, adversarial training, and large speech language models. It extends the original StyleTTS idea by introducing a style diffusion model that can sample rich, realistic speaking styles conditioned on reference speech, allowing highly expressive and diverse prosody. The architecture uses a two-stage training process and leverages an auxiliary speech language model to guide...

Downloads: 5 This Week

Last Update: 2025-11-28
See Project
20

MCP UI

SDK for building interactive UI components over MCP for AI tools

mcp-ui is a software development kit designed to bring interactive user interface capabilities to applications built on the Model Context Protocol (MCP). It enables developers to create rich, dynamic UI components that can be delivered from an MCP server and rendered seamlessly by a compatible client. Instead of returning only text responses, tools can provide structured UI resources such as HTML or remote-rendered components, allowing more engaging and functional interactions. mcp-ui...

Downloads: 0 This Week

Last Update: 2026-03-18
See Project
21

shuyuan

Reading book source

shuyuan is a project oriented around reading and knowledge consumption, especially targeting large-scale text content such as books, articles, or educational material. The name suggests “academy” or “study hall,” and the tool aims to help users ingest, organize, and manage reading content — possibly offering features like text parsing, annotation, metadata generation, translation, or storage for later reference. The repository is set up to support document ingestion, indexing, and maybe some...

Downloads: 0 This Week

Last Update: 2025-11-28
See Project
22

E2M

E2M converts various file types (doc, docx, epub, html, htm, url

E2M is a SourceForge mirror of the e2m open-source project, which focuses on providing tools or services designed to convert or process content between different formats or systems. Projects with similar naming conventions typically emphasize automation workflows where input data from one environment is transformed into another representation or output structure. The mirrored repository allows users to access the project’s codebase independently from its original hosting platform while...

Downloads: 0 This Week

Last Update: 2026-03-09
See Project
23

HunyuanOCR

OCR expert VLM powered by Hunyuan's native multimodal architecture

HunyuanOCR is an open-source, end-to-end OCR (optical character recognition) Vision-Language Model (VLM) developed by Tencent‑Hunyuan. It’s designed to unify the entire OCR pipeline, detection, recognition, layout parsing, information extraction, translation, and even subtitle or structured output generation, into a single model inference instead of a cascade of separate tools. Despite being fairly lightweight (about 1 billion parameters), it delivers state-of-the-art performance across a...

Downloads: 0 This Week

Last Update: 6 days ago
See Project
24

LangChain Extract

Did you say you like data?

LangChain Extract is an open-source reference application designed to demonstrate how large language models can be used to extract structured data from unstructured text and document files. The project implements a lightweight web service that allows developers to define extraction schemas and apply them to various sources such as plain text, HTML, or PDF documents.

Downloads: 1 This Week

Last Update: 2026-03-09
See Project
25

LexiFinder

AI-powered semantic indexing: automating the creation of book indexes

...Both interfaces share the same underlying engine and support the same features. Supported input formats are PDF, DOCX, and ODT. The index can be exported as plain text, JSON, CSV, or HTML.

Downloads: 4 This Week

Last Update: 2026-03-04
See Project

Previous
You're on page 1
2
Next

Related Searches

video to srt

video text ocr

video subtitle extractor

video-subtitle-extractor

subtitle

video-subtitle-extract

video subtitle offline

remove watermark from video

remove hardcoded subtitles

mkv subtitle extractor

Related Categories

Artificial Intelligence

Software Development

Business

Database

Formats and Protocols

SourceForge

Create a Project
Open Source Software
Business Software
Top Downloaded Projects

Company

About
Team
SourceForge Headquarters
1320 Columbia Street Suite 310
San Diego, CA 92101
+1 (858) 422-6466

Resources

Support
Site Documentation
Site Status
SourceForge Reviews

© 2026 Slashdot Media. All Rights Reserved.

Terms Privacy Opt Out Advertise