Menu

using HtmlParser ( newbee help )

Help
Tu Nam
2004-06-01
2004-06-05
  • Tu Nam

    Tu Nam - 2004-06-01

    Well I had used HTMLParser and see that in new version it has more +6 package . As I read , some of this use for bean , some is for logging and junit ??? I think it's for testing .
    So if I want to use it in a project , whether it enough to just include htmlparser.jar , htmllexer.jar and commons-logging.jar.
    Can I exclude checkstyle-all , fit and thumbelina .
    And what using of fit and thumbelina ? and junit when I don't need to testing cause I can use a separate junit ?

     
    • Derrick Oswald

      Derrick Oswald - 2004-06-02

      If you are only using the Lexer, you only need htmllexer.jar.
      If you are doing parsing, you only need htmlparser.jar, which includes the classes from htmllexer.jar.
      The rest of the lib directory contents are used for development.

       
    • Tu Nam

      Tu Nam - 2004-06-03

      Thanks so much , and I have another question :
      I use 1.5 integrate build.
      I run a visitor such as ObjectFindingVisitor to find link , it's ok but when I run another visitor to fix link such as UrlModifyingVisitor on parser , it failed to print modified result . Result always empty .
      It seems the first visitor extract all text from parser so it just contains an empty html page .
      Unless to use a custom visitor , whether i can find another way to extract link and after that , extract text .
      Another ques :
      I have parse many time to get webpages , but if parser must parse a page which cannot retrieve , it freeze so other pages can't be parse .

       
      • Derrick Oswald

        Derrick Oswald - 2004-06-03

        After running visitAllNodesWith() the parser will have exhausted the input stream and need to be told to start from the beginning again for the next pass with a visitor:
            parser.reset ();

        If you are going to do this a lot of times, or want to see your changes to the nodes, rather than a completely new set of nodes each time, you'll need to collect the nodes into a NodeList first and then run the visitor over the nodelist like so:

        // get the list of nodes
        NodeList list = new NodeList ();
        for (NodeIterator i = parser.elements(); i.hasMoreNodes(); )
            list.add (i.nextNode ());
        // apply visitor 1
        visitor1.beginParsing ();
        for (NodeIterator i = list.elements (); i.hasMoreNodes();)
            i.nextNode ().accept (visitor1);
        visitor1.finishedParsing ();
        // apply visitor 2
        visitor2.beginParsing ();
        for (NodeIterator i = list.elements (); i.hasMoreNodes();)
            i.nextNode ().accept (visitor2);
        visitor2.finishedParsing ();
        ...

        Regarding freezing, if you are using a recent Sun JVM, you can set the connect and read timeouts. This is done once in your mainline before you start getting pages:
                System.setProperty ("sun.net.client.defaultReadTimeout", "7000");
                System.setProperty ("sun.net.client.defaultConnectTimeout", "7000");

        The numbers "7000" are the timeout in milliseconds, which you may need to adjust depending on the expected latency.

         
    • Tu Nam

      Tu Nam - 2004-06-04

      Great Thanks ! I feel more comfortable when I work with HTMLParser.
      And just a small question : :)
      Whether Parser maintain a connection to server , or it just grabbed the source and close connection ?
      If in 2nd case , whether I have the way to reuse the connection to server ?
      Again , thank you very much for the answer , I see that HTMLParser is the best library about html processing .

       
      • Derrick Oswald

        Derrick Oswald - 2004-06-05

        The underlying stream is not closed, but it is exhausted after a parse.  So it's spent and of no use.
        The URLConnection can be obtained from:
            parser.getConnection ();
        You could try to refetch the data by calling getInputStream() again, but I'm not sure it would work.

         

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.