I'm trying to parse a very large number of web pages in order to extractthe text, but some of them does not contain It, as the parser says to me printing a message in the console. There is a way to catch those kind of messages after they are printed, and also to do some proper treatment of those urls?
What I do is to create a parser with a well-formed url like this:
parser = new Parser(url);
And for some of those urls the message appears.
Thanks in advance for your help!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You could try setting the Request header fields in the HTTP request to only accept text.
See the documentation on org.htmlparser.http.ConnectionManager.setDefaultRequestProperties.
However, this relies on the answering server to be properly configured, otherwise you will always get server's sending image files marked as text and so on.
One way would be to prefecth the URL yourself and examine the mime type, then just skip non-text pages.
The best way is probably extend the ConnectionMonitor interface to return a status about whether to continue or not. The checking could be done in-line then by implementing the interface.
I look forward to your code patch submission ;)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi everyone,
I'm trying to parse a very large number of web pages in order to extractthe text, but some of them does not contain It, as the parser says to me printing a message in the console. There is a way to catch those kind of messages after they are printed, and also to do some proper treatment of those urls?
What I do is to create a parser with a well-formed url like this:
parser = new Parser(url);
And for some of those urls the message appears.
Thanks in advance for your help!
You could try setting the Request header fields in the HTTP request to only accept text.
See the documentation on org.htmlparser.http.ConnectionManager.setDefaultRequestProperties.
However, this relies on the answering server to be properly configured, otherwise you will always get server's sending image files marked as text and so on.
One way would be to prefecth the URL yourself and examine the mime type, then just skip non-text pages.
The best way is probably extend the ConnectionMonitor interface to return a status about whether to continue or not. The checking could be done in-line then by implementing the interface.
I look forward to your code patch submission ;)