Menu

HTMLParser Tests from inside a firewall

Help
2004-07-29
2013-04-27
  • Henry Sukendro

    Henry Sukendro - 2004-07-29

    Hi.

    I found HTMLParser to be quite useful to extract
    data from websites - the problem is looks like you
    can't use it from inside a firewall.

    For example, when I ran one of the example
    programs included (StringExtractor), and point it to
    an external web address, it returns:
    org.htmlparser.util.ParserException: Connection
    refused: connect

    Does HTMLParser contains a special class that
    would handle Developers who don't have machines that
    are directly connected to the Internet (proxy
    scenario)?

    Thank you.

     
    • Derrick Oswald

      Derrick Oswald - 2004-07-30

      There is no support at the present.
      You should add the use of proxy connections to the RFE (request for enhancement) list.

       
      • loomax

        loomax - 2006-06-21

        Proxy suport should be definetly added

         
        • Derrick Oswald

          Derrick Oswald - 2006-06-21

          Proxy host,port,user and password support has been added in the ConnectionManager class in the http package.

           
          • Miguel

            Miguel - 2009-08-14

            Hi! Im trying to extract some test from the web page, but Im unable to do that because I receive a 500 error code. Recently the proxy server has changed its url from proxy.something.com and now it uses an auto-configuration url (something like http://proxy.something.com:8080/\) What Im doing is:

            Parser.getConnectionManager().setProxyHost("proxy.something.com");
                    Parser.getConnectionManager().setProxyPort(8080);
                    Parser.getConnectionManager().setProxyUser("myusername");
                    Parser.getConnectionManager().setProxyPassword("mypassword");

                    try {
                        Parser parser = new Parser("http://www.google.com");

                        StringBean stringBean = new StringBean();

                        parser.visitAllNodesWith(stringBean);

                        stringBean.setLinks(false);
                        stringBean.setReplaceNonBreakingSpaces(true);
                        stringBean.setCollapse(true);
                        stringBean.setURL("http://www.google.com");

                        String s = stringBean.getStrings();
                       
                    } catch (ParserException e) {
                        // TODO Auto-generated catch block
                        e.printStackTrace();
                    }

            Thanks for your help!

             
            • Derrick Oswald

              Derrick Oswald - 2009-08-15

              500 error codes are when the server is aware that it has had a problem or error:

              501 Not Implemented:
              The server doesn't support the functions required for fullfilling that request. This might occur if a server side include were called on a server that doesn't support that function.

              502 Bad Gateway:
              The server, while acting as a gateway or proxy, received a bad request from an upstream server.

              503 Service Unavailable:
              The server is unable to handle the request due to maintenance or a temporary overload of the server.

              504 Gateway Timeout:
              The server, while acting as a gateway or proxy, did not receive a timely response from an upstream server.

              505 HTTP Version Not Supported:
              The server does not support the HTTP version that was used to make the request.

              Does the proxy work with your browser?

               
    • Matt Ruby

      Matt Ruby - 2004-07-30

      You could use Commons httpclient to connect through your proxy and pull the response.
      http://jakarta.apache.org/commons/httpclient/

      Once you have the response make sure it's html then pass it to the parser using getResponseBodyAsString().  There are some great tutorials and example code on the httpclient site.

      It might be a good idea to use something like this in a future version of HTMLParser.

      Good luck!

      Matt Ruby

       
    • Marek Nazarko

      Marek Nazarko - 2004-09-14

      Properties systemSettings = System.getProperties();
      systemSettings.put("proxySet", "true");
      systemSettings.put("proxyHost","127.0.0.1");
      systemSettings.put("proxyPort","8118");     
      System.setProperties(systemSettings);
      URL adres=new URL("http://...");
      URLConnection polaczenie = adres.openConnection ();
      String password = "user:pass";
      String auth = "Basic " + new sun.misc.BASE64Encoder().encode(password.getBytes());
            polaczenie.setRequestProperty( "Proxy-Authorization", auth);     
      parser = new Parser ();
      parser.setConnection(polaczenie);

      It works even with proxy demanding authorization

       
      • Derrick Oswald

        Derrick Oswald - 2004-09-14

        This capability has been added September 1, and will be available in the next integration release:

        Implemented:
        RFE #1017249    HTML Client Doesn't Support Cookies but will follow redirect
        RFE #1010586    Add support for password protected URL
        and RFE #1000739    Add support for proxy scenario

        A new http package is added, the primary class being
        Connectionmanager which handles proxies, passwords
        and cookies.
        Some testing still needed.

         

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.