I need to parse several HTML pages in a sequential manner. The straightforward way to do this is to create a new Parser object per page. Will this be expensive in terms of memory consumption and performance? Another way is to use the setURL() method from a Parser object. Will this have much better performance? Are there cleaner ways to accomplish the task of parsing many HTML pages?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The overhead of making a new parser as opposed to setURL() is the PrototypicalNodeFactory construction with all it's prototypes -- not huge but probably noticable over hundereds or thousands of pages.
The setURL() method is about the only reusability that's been thought into it. By creating your own Source class and using Parser.reset() you may be able to bypass some other allocations, but it's probably not worth it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I need to parse several HTML pages in a sequential manner. The straightforward way to do this is to create a new Parser object per page. Will this be expensive in terms of memory consumption and performance? Another way is to use the setURL() method from a Parser object. Will this have much better performance? Are there cleaner ways to accomplish the task of parsing many HTML pages?
The overhead of making a new parser as opposed to setURL() is the PrototypicalNodeFactory construction with all it's prototypes -- not huge but probably noticable over hundereds or thousands of pages.
The setURL() method is about the only reusability that's been thought into it. By creating your own Source class and using Parser.reset() you may be able to bypass some other allocations, but it's probably not worth it.