Help save net neutrality! Learn more.
Close

Methabot Web Crawler / News: Recent posts

Methanol/1.7.0 Released!

Methabot is a highly configurable and scriptable web crawler. Methanol is a distributed web crawling system built around Methabot. This project is aimed at providing advanced web and data crawling tools, both as a large-scale web crawling system and as command line tools.

We are proud to release the first version of Methabot featuring Methanol web
crawling system. This release features the two new server daemons mn-masterd
and mn-slaved, along with the new client daemon mb-client. These three together
form a Methanol system, so you can now build your own customized distributed web
crawling system, with all the features and goodies of libmetha integrated!... read more

Posted by Emil Romanus 2009-06-23

Methabot/1.6.0.1 and lmm_mysql-1.0.0

Methabot is a scriptable multi-purpose ftp- and web crawler with an extensible configuration system and speed-optimized architectural design.

lmm_mysql is a Methabot module providing Javascript-MySQL bindings to parser functions.

We are happy to announce the release of Methabot/1.6.0.1 and lmm_mysql-1.0.0. This is the first release of lmm_mysql as a separate package, and Methabot now has a much better module interface.... read more

Posted by Emil Romanus 2009-02-23

Methabot/1.6.0 Released!

Methabot is a scriptable multi-purpose ftp- and web crawler with an extensible configuration system and speed-optimized architectural design.

Here comes another feature-packed release of Methabot. With this release comes such fine features as Javascript-MySQL support, parser chaining support and better URL filtering with support for robots.txt.

Don't forget to check out the project website and wiki at http://bithack.se/projects/methabot/... read more

Posted by Emil Romanus 2009-02-21

Methabot/1.5.0 Released!

Methabot is a speed-optimized, scriptable and highly configurable web, ftp and local file system crawler.

Methabot/1.5.0 is now ready for release. This version contains many improvements, new features and bugfixes. Don't forget to check out the project's website!

Changes and new features:
* Support for reading intial buffer from stdin
* --type and --base-url command line options added, along with the initial_filetype option in configuration files
* Cookies and DNS info is now properly shared between workers when running multithreaded
* Added some example usage commands to --examples
* Big improvements to the inter-thread communication, now faster and more organized
* Added support for 'init' functions to scripts. Read more about init functions at http://bithack.se/projects/methabot/docs/e4x/init_functions.html
* libmetha doesn't freeze when doing multiple concurrent HTTP HEAD requests anymore. The reason for the freezes was a bug in libcurl which is now fixed. Some workarounds have been added to libmetha to prevent the freezes from occuring when using the defect libcurl versions aswell.
* Support for older libcurl versions 7.17.x and 7.16.x
* New information is available in the "this" object of javascript parsers, content-type and transfer status code. Read more at http://bithack.se/projects/methabot/docs/e4x/this.html
* --verbose option replaced with --silent, since verbose mode is now default
* Initial support for FTP crawling and the ftp_dir_url crawler option
* Depth limiting is now crawler-specific
* Added the command line options --crawler and --filetype
* Support for extending and overriding already defined crawlers and filetypes
* Support for the copy keyword in configuration files
* Support for dynamically switching the active crawler, this lets you crawl different websites in completely different ways in one crawling session. Read more about crawler switching at http://bithack.se/projects/methabot/docs/crawler_switching.html
* libev version upgrade to 3.51
* The include directive in configuration files now makes sure the included configuration file hasn't already been loaded, to prevent include-loops and multiple filetype/crawler definitions.
* Various SpiderMonkey garbage collection fixes, libmetha does not crash anymore when cleaning up after a multithreaded session
* Added some extra information to the --info option
* The 'external' option is now fixed and enabled again
* New option --spread-workers
* New libmetha API function lmetha_global_setopt() allows changing the global error/message/warning reporter
* Added initial implementation of a test suite for developers
* Better error reporting when loading configuration files
* Bugfix when an HTTP server didn't return a Content-Type header after a HEAD request
* Bugfix when sorting URLs after multiple HTTP HEAD requests
* Bugfix in the html to xml converter when the HTML page did not have an <html> tag
* Bugfix, the extless-url option did not work
* Bugfix, html to xml converter no longer chokes on byte-order marks or other text before the actual HTML
* Bugfix, prevented libmetha from trying to access URLs of protocols that are not supported
* Bugfix when shutting down after an error.
* Bugfix, unresolvable URLs did not break out the retry loop after three retries
* Very experimental and unstable support for Win32, mainly intended for developers... read more

Posted by Emil Romanus 2009-01-15

Methabot/1.4.1 Released!

Time for a bugfix release. This release fixes lots of build-time errors related to SpiderMonkey on various systems.

Changes:
* Configure could not find jsapi.h on some systems, this should be fixed now.
* Configuration files are now able to modify crawler and filetype flags, added the options 'external' and 'external_peek'
* Bugfix, Methabot would sometimes crash when cleaning up empty URLs after multiple HTTP HEAD
* Fixed a crash that occurred when running synchronously.
* Build system include fix when jsconfig.h could not be found.

Posted by Emil Romanus 2009-01-02

Methabot/1.4.0 Released!

Methabot is a speed-optimized, highly configurable web, ftp and local file system crawler.

After a long time of hardcore programming, Methabot/1.4.0 is finally ready for release. You will need libcurl and spidermonkey installed on your system to be able to compile Methabot.

New features:
* Completely new architectural design
* Filetype parser scripting through Javascript/E4X
* Multithreading is now a primary concept
* HTTP HEAD requests are now done asynchronously in a separate thread using curl and libev
* Support for "peeking" at external URLs
* The Methabot Project has been split up into several subprojects, primarily there's the command line tool, which uses the web crawling library libmetha as its backend.
* Initial work on the distributed web crawling system Methanol.... read more

Posted by Emil Romanus 2008-12-24

Methabot/1.4.0 Under Development!

A completely new version of Methabot is under development. This new version features an almost complete rewrite of Methabot, a cleaner and more organized modular design.

The version number is bumped from 0.3.1 to 1.4.0 mainly due to the following new features:
* The configuration system is replaced by a scripting language named mcl (Metha Configuration Language). This language concentrates on string manipulation, html code handling (enumeration, iteration, etc.) and last but not least, crawler and child crawler scheduling and behaviour management.
* The module system is replaced by a much more extensible version, allowing for example to keep statistics or rely on data in external databases.
* Methabot is split into three parts:
.. * libmetha, a web crawling library
.. * mcl, a compiler and VM for mcl
.. * methabot, the command line tool and daemon... read more

Posted by Emil Romanus 2008-03-12

Where's Methabot/0.3.2?

You might have noticed that I didn't release version 0.3.2 this month as planned. This is because I am busy with implementing a web crawling solution based on Methabot for a paying customer. I don't have time to run the latest version through each test required before release.

Methabot is of course still in heavy development, and its code base, available through subversion, is gaining many benefits from the challenges of implementing this specialized solution.... read more

Posted by Emil Romanus 2007-10-26

Methabot/0.3.1 Released!

We're happy to announce the second release in the modular Methabot series -- Methabot/0.3.1!

New Features:
* Proxy server support
* Cookie handling support
* Option to reduce memory usage

Bugfixes:
* Compilation issues on FreeBSD
* Support for libcurl version <= 7.15.3
* Various tiny issues fixed

Posted by Emil Romanus 2007-09-19

Methabot/0.3.0 Released!

We're happy to announce the release of Methabot/0.3.0 -- Modular Methabot.

This release also features our first Methabot installer for Win32.

New Features:
* Module/Plugin system through shared libraries
* Documentation updates in the Wiki, info on modules
* Two example modules using C and C++ added to src/modules/
* Lots of new functionality for the UMEX system
* Basic implementation of URLRewrite
* Added more file extensions to default configuration files
* 'mode' option finally implemented
* 'thread-result' option added
* '$current_match' filetype option added
* 'num-pipelines' option added
* 'ordered-search' option added... read more

Posted by Emil Romanus 2007-08-17

Methabot/0.3.0 soon to come!

Next version of Methabot, modular Methabot, will be released within a month. This version contains big changes and I therefore decided to jump straight up to the 0.3.0 version number.

New features coming with Methabot/0.3.0:
* File parser modularity through dynamic libraries (so, dll, dylib)
* URL Matching Expressions extended
* Many new configuration options and some new configuration file directives... read more

Posted by Emil Romanus 2007-08-04

Methabot/0.2.8 Released!

We're happy to announce the second public release of Methabot!

New features since Methabot/0.2.7:
* FTP Crawling support
* Widened plugin system and modularity
* Various new options (-C, --run)
* Botnet Fetching Callback
* Documentation updates with architectural design graphs and small details on plugin systems

Changes since Methabot/0.2.7:
* Botnet login is done before everything else if botnet was specified... read more

Posted by Emil Romanus 2007-07-17

Methabot/0.2.7 Released!

Methabot is a speed-optimized, highly configurable web and local file system crawler.

This is the first public release not marked as a development snapshot. After a year without releases, it's time for the public to start testing, speeding up the bug hunting and possibly the addition of new features.

Features available in this release:
* Complete, new and very capable configuration system
* Full HTTP crawling support
* Partial local file system support
* Methanol Web Crawling System support (login, upload and logout)
* Aggressive threading
* Automated downloading w/ progress bar
* Partial support for the new URL Matching Expression system (still undocumented)... read more

Posted by Emil Romanus 2007-06-30