WebScarab (mainly the spider, I guess).
We could combine:
TagSoup (or maybe HTMLParser) to generate SAX events
DOMHandler to build a DOM from the SAX events
Rhino to execute any script elements we encounter
with some kind of pushback InputStream to allow script
elements to write into the document so that any
additions would also be parsed by the SAX handler.
It seems like the best way to do this is to process
pages asynchronously, and keep a "waiting list" if the
page has a dependency on some other URL, e.g. a child
So processing of the page would be suspended until
those pending URL's had been retrieved. This means that
we may have multiple pages being processed at one time.
Finally, once there are no more "pending resources",
fire any "on*" events and monitor for things that
provide an URL.
e.g. document.location.href, window.open(), etc
We would probably have to provide a wrapper for
"document" that implements the non-standard things like
It's a big project, but could be fun!