Well I had used HTMLParser and see that in new version it has more +6 package . As I read , some of this use for bean , some is for logging and junit ??? I think it's for testing .
So if I want to use it in a project , whether it enough to just include htmlparser.jar , htmllexer.jar and commons-logging.jar.
Can I exclude checkstyle-all , fit and thumbelina .
And what using of fit and thumbelina ? and junit when I don't need to testing cause I can use a separate junit ?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If you are only using the Lexer, you only need htmllexer.jar.
If you are doing parsing, you only need htmlparser.jar, which includes the classes from htmllexer.jar.
The rest of the lib directory contents are used for development.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks so much , and I have another question :
I use 1.5 integrate build.
I run a visitor such as ObjectFindingVisitor to find link , it's ok but when I run another visitor to fix link such as UrlModifyingVisitor on parser , it failed to print modified result . Result always empty .
It seems the first visitor extract all text from parser so it just contains an empty html page .
Unless to use a custom visitor , whether i can find another way to extract link and after that , extract text .
Another ques :
I have parse many time to get webpages , but if parser must parse a page which cannot retrieve , it freeze so other pages can't be parse .
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
After running visitAllNodesWith() the parser will have exhausted the input stream and need to be told to start from the beginning again for the next pass with a visitor:
parser.reset ();
If you are going to do this a lot of times, or want to see your changes to the nodes, rather than a completely new set of nodes each time, you'll need to collect the nodes into a NodeList first and then run the visitor over the nodelist like so:
// get the list of nodes
NodeList list = new NodeList ();
for (NodeIterator i = parser.elements(); i.hasMoreNodes(); )
list.add (i.nextNode ());
// apply visitor 1
visitor1.beginParsing ();
for (NodeIterator i = list.elements (); i.hasMoreNodes();)
i.nextNode ().accept (visitor1);
visitor1.finishedParsing ();
// apply visitor 2
visitor2.beginParsing ();
for (NodeIterator i = list.elements (); i.hasMoreNodes();)
i.nextNode ().accept (visitor2);
visitor2.finishedParsing ();
...
Regarding freezing, if you are using a recent Sun JVM, you can set the connect and read timeouts. This is done once in your mainline before you start getting pages:
System.setProperty ("sun.net.client.defaultReadTimeout", "7000");
System.setProperty ("sun.net.client.defaultConnectTimeout", "7000");
The numbers "7000" are the timeout in milliseconds, which you may need to adjust depending on the expected latency.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Great Thanks ! I feel more comfortable when I work with HTMLParser.
And just a small question : :)
Whether Parser maintain a connection to server , or it just grabbed the source and close connection ?
If in 2nd case , whether I have the way to reuse the connection to server ?
Again , thank you very much for the answer , I see that HTMLParser is the best library about html processing .
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The underlying stream is not closed, but it is exhausted after a parse. So it's spent and of no use.
The URLConnection can be obtained from:
parser.getConnection ();
You could try to refetch the data by calling getInputStream() again, but I'm not sure it would work.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well I had used HTMLParser and see that in new version it has more +6 package . As I read , some of this use for bean , some is for logging and junit ??? I think it's for testing .
So if I want to use it in a project , whether it enough to just include htmlparser.jar , htmllexer.jar and commons-logging.jar.
Can I exclude checkstyle-all , fit and thumbelina .
And what using of fit and thumbelina ? and junit when I don't need to testing cause I can use a separate junit ?
If you are only using the Lexer, you only need htmllexer.jar.
If you are doing parsing, you only need htmlparser.jar, which includes the classes from htmllexer.jar.
The rest of the lib directory contents are used for development.
Thanks so much , and I have another question :
I use 1.5 integrate build.
I run a visitor such as ObjectFindingVisitor to find link , it's ok but when I run another visitor to fix link such as UrlModifyingVisitor on parser , it failed to print modified result . Result always empty .
It seems the first visitor extract all text from parser so it just contains an empty html page .
Unless to use a custom visitor , whether i can find another way to extract link and after that , extract text .
Another ques :
I have parse many time to get webpages , but if parser must parse a page which cannot retrieve , it freeze so other pages can't be parse .
After running visitAllNodesWith() the parser will have exhausted the input stream and need to be told to start from the beginning again for the next pass with a visitor:
parser.reset ();
If you are going to do this a lot of times, or want to see your changes to the nodes, rather than a completely new set of nodes each time, you'll need to collect the nodes into a NodeList first and then run the visitor over the nodelist like so:
// get the list of nodes
NodeList list = new NodeList ();
for (NodeIterator i = parser.elements(); i.hasMoreNodes(); )
list.add (i.nextNode ());
// apply visitor 1
visitor1.beginParsing ();
for (NodeIterator i = list.elements (); i.hasMoreNodes();)
i.nextNode ().accept (visitor1);
visitor1.finishedParsing ();
// apply visitor 2
visitor2.beginParsing ();
for (NodeIterator i = list.elements (); i.hasMoreNodes();)
i.nextNode ().accept (visitor2);
visitor2.finishedParsing ();
...
Regarding freezing, if you are using a recent Sun JVM, you can set the connect and read timeouts. This is done once in your mainline before you start getting pages:
System.setProperty ("sun.net.client.defaultReadTimeout", "7000");
System.setProperty ("sun.net.client.defaultConnectTimeout", "7000");
The numbers "7000" are the timeout in milliseconds, which you may need to adjust depending on the expected latency.
Great Thanks ! I feel more comfortable when I work with HTMLParser.
And just a small question : :)
Whether Parser maintain a connection to server , or it just grabbed the source and close connection ?
If in 2nd case , whether I have the way to reuse the connection to server ?
Again , thank you very much for the answer , I see that HTMLParser is the best library about html processing .
The underlying stream is not closed, but it is exhausted after a parse. So it's spent and of no use.
The URLConnection can be obtained from:
parser.getConnection ();
You could try to refetch the data by calling getInputStream() again, but I'm not sure it would work.