Firstly I wanted to say thanks for this awesome project! It really saved me when doing my dissertation.
I'm using it in a program similar to Dapper that my supervisor wanted to use to extract information from HTML. This means that I'm using it in large batch processes processing tens of thousands of documents at a time stored on disk.
The performance is generally fine yet on some documents it pauses for a very long time (more than 30 seconds) and on inspection of these documents they didn't seem overly large or complicated.
When I kill the process whilst it's paused I get the following print out "Unsupported Node Type:117" which is from the JavaContentSink.cpp file I think.
Basically I was wondering if there's anything I should watch out for when using the parser in large batch processes?
Log in to post a comment.