Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
siteone-crawler-v1.0.9-win-x64.zip | 2025-06-08 | 88.9 MB | |
siteone-crawler-v1.0.9-linux-arm64.tar.gz | 2025-06-08 | 28.9 MB | |
siteone-crawler-v1.0.9-linux-x64.tar.gz | 2025-06-08 | 29.7 MB | |
siteone-crawler-v1.0.9-macos-arm64.tar.gz | 2025-06-08 | 27.3 MB | |
siteone-crawler-v1.0.9-macos-x64.tar.gz | 2025-06-08 | 28.2 MB | |
README.md | 2025-06-08 | 2.9 kB | |
v1.0.9 source code.tar.gz | 2025-06-08 | 38.5 MB | |
v1.0.9 source code.zip | 2025-06-08 | 38.5 MB | |
Totals: 8 Items | 280.0 MB | 1 |
This release introduces a powerful new Website to Markdown converter, allowing you to export entire websites into clean, single or multiple Markdown files, which is ideal for AI context or documentation purposes. We've also added the ability to start crawling directly from a sitemap.xml
file and significantly enhanced the Offline Website Exporter with more granular control and better handling of international characters. Numerous new command-line options have been added for greater flexibility in crawling, filtering, and reporting, alongside many other improvements and bug fixes.
New Features
- Website to Markdown Converter: A major new feature to convert entire websites into clean Markdown files, replacing the previous dependency on
html2markdown
. - Single-File Markdown Export: Use
--markdown-export-single-file
to combine all website content into a single, organized Markdown file, with smart removal of duplicate headers/footers. - Crawl from Sitemap: You can now provide a URL to a
sitemap.xml
or sitemap index file directly to the--url
parameter to crawl all listed URLs. - Video Gallery in HTML Report: The HTML report now includes a gallery of all found videos, with lazy loading and an interactive player.
- Custom DNS Resolution: Added the
--resolve
option (likecurl
) to provide custom IP addresses for specific domains and ports. - XPath and RegEx in Extra Columns: Enhance custom data extraction with support for XPath 1.0 and Regular Expressions in the
--extra-columns
option. - Max Crawl Depth: Control the crawling scope with the new
--max-depth
parameter for limiting how deep the crawler goes (for pages, not assets). - Customizable HTML Reports: Use
--html-report-options
to select which sections to include in the final HTML report.
Improvements
- Offline Website Exporter:
- New
--offline-export-remove-unwanted-code
option to automatically strip analytics, cookie consents, and other non-essential scripts. - New
--offline-export-no-auto-redirect-html
flag to prevent the creation of meta-refresh redirect files. - Better handling of file paths with UTF-8 characters.
- New
- URL Transformations: Added
--transform-url
to internally change request URLs, useful for crawling sites that serve content from a different domain (e.g., a local instance). - Loop Protection: New
--max-non200-responses-per-basename
option to prevent getting stuck in loops with dynamically generated error pages. - Timezone Support: Set a
--timezone
for all dates and times displayed in reports and used in exported filenames. - Smarter Image Analysis: The WebP analysis will no longer report missing WebP images if more optimized AVIF alternatives are already present.
- LICENSE: Switched to MIT: The project license has been changed to the more permissive MIT license.