Menu โ–พ โ–ด

Tree [4488bd] master /
 History

HTTPS access


File Date Author Commit
 .github 2022-01-04 Axel Hahn Axel Hahn [cb7016] Update issue templates
 docker 2022-12-07 Axel Hahn Axel Hahn [d78db0] ADDED docker folder with my local docker dev env
 docs 2022-12-12 Axel Hahn Axel Hahn [a7dfbd] update docs and readme
 public_html 2022-12-13 Axel Hahn Axel Hahn [bc49cf] add missing file: backend/get.php
 .gitattributes 2021-03-19 Axel Hahn Axel Hahn [f26b23] 2021-02-19: v0.142
 .gitignore 2022-12-13 Axel Hahn Axel Hahn [bc49cf] add missing file: backend/get.php
 README.md 2022-12-12 Axel Hahn Axel Hahn [a7dfbd] update docs and readme
 history.txt 2022-12-12 Axel Hahn Axel Hahn [a7dfbd] update docs and readme

Read Me

AH CRAWLER

Description ##

AhCrawler is a search engine for your website and analytics tool.

This is free software and Open Source
GNU General Public License (GNU GPL) version 3

๐Ÿ‘ค Author: Axel Hahn\
๐Ÿงพ Source: https://github.com/axelhahn/ahcrawler/\
๐Ÿ“œ License: GNU GPL 3.0\
๐Ÿ“— Docs: see https://www.axel-hahn.de/docs/ahcrawler/

โš ๏ธ Important notice:
In version v0.156 the filestructure was changed.
--> See Upgrade to v0.156


It is written in PHP and consists of
- crawler (spider) and indexer
- search for your website
- website analyzer with
- ssl certificate check
- saved cookies
- http response header check
- linkchecker (http status check of all links, css, images, ...)

It runs with PHP 7.3 and higher (up to PHP 8.1).
It uses PDO to store indexed data. So far sqlite and mysql were tested.

This is not a version 1.x yet ... let me do some more work :-)

Screenshot ##

Screenshot: backend

Installation

see the docs https://www.axel-hahn.de/docs/ahcrawler/get_started.htm

Features

  • Free software and Open Source.
  • you can install it on your location.
  • All data stay under your control.
  • And you have full control about the age of the checked content. After fixing errors rerun the indexer and immediately get fresh results.
  • multi language support (backend and frontend)
  • built in web updater

Spider

  • respects exclude rules in
  • robots.txt
  • x-robots http header
  • meta robots values noindex, no follow
  • rel=nofollow in links
  • additional rules for include and exclude rules with regex
  • multiple simultanous requests
  • rebuild full index or update a single url (i.e. to be triggered by a cms)
  • uses http2 (if possible)

Search for your website

  • search with OR or AND
  • search in language (requires lang attribute in your html tags)
  • search in a given subfolder only
  • several methods for pre defined forms or for fully customized form
  • stores users searchterms for a statistics

Website analyzer

  • check of http reponse header for
  • unknown headers
  • unwanted headers
  • security headers
  • check ssl certificate (if your website uses https)
  • show stored server cookies during crawling and following links
  • show website errors, warnings based on http status code (a.k.a. linkchecker)
    for all links, images, css, javascripts, media, ... including hints what to do on which status code
  • for a given url: display where it is used and where it links to showing
    as cascade on redirects (30x status in repsonse header)
  • view over all webpage items (pages, js, css, media) with filter by
  • http status code
  • mime type
  • place (internal item or extern)
  • multiple website support within a single installation
Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.