NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces.
Spondulas is browser emulator designed to retrieve web pages for hunti
Spondulas is browser emulator and parser designed to retrieve web pages for hunting malware. It supports generation of browser user agents, GET/POST requests, and SOCKS5 proxy. It can be used to parse HTML files sent via e-mail. Monitor mode allows a website to be monitored at intervals to discover changes in DNS or content over time. Autolog mode creates an investigation file that documents redirection chains. The retrieved web pages are parsed for links and reported to an output file. More information is available on the wiki.
Checks given webpages for backlinks and scans for image links and keywords.
Uses DOM-based methods to scan for backlinks, which are more sophisticated than simple text scanning (one example: they ignore commented out source code).
AWStats Enterprise Manager is a tool for managing awstats configuration creation and logfile processing, in a multi-server environment. This script is designed to pull all the webserver logs, for every server, and parse them with awstats.
This is a self-contained perl program that can parse out common web log files (it understands and can search over the various fields) to help webmasters (or anyone, really) go through their logs and try to pullout various bits of information.
SimpleRDF/XSL template simplifies RDF/XML sources as much as possible to allow easy processing. SimpleRDF/PHP5 parser takes advantage of SimpleRDF/XSL. It has extremly simple API. You can parse any RDF/XML compatible document (incl. RSS) and much more...
PScrape is a Perl module with functions to
<ol>
<li>parse text files for useful data by using regular expressions and</li>
<li>write the resulting data into a file a tab separated values, useful for insertion into an SQL database.</li>
</ol>
LogAnal is a quick hack to parse Apache Log Files and produce graphical and textual web server statistics.
Works in incremental mode only. Supports Templates for the output HTML, as well as localization (defaults to English).
This project is based on a generic way to creates a BackOffice to manages News.
On the base of an XML description of the tables, the Add, Modify Delete... pages are dynamikely generated. In the same way the package can parse XML files in the same way as