HTML Parser / Discussion / Help: HTMLParser Tests from inside a firewall

Henry Sukendro - 2004-07-29

Hi.

I found HTMLParser to be quite useful to extract
data from websites - the problem is looks like you
can't use it from inside a firewall.

For example, when I ran one of the example
programs included (StringExtractor), and point it to
an external web address, it returns:
org.htmlparser.util.ParserException: Connection
refused: connect

Does HTMLParser contains a special class that
would handle Developers who don't have machines that
are directly connected to the Internet (proxy
scenario)?

Thank you.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Derrick Oswald - 2004-07-30
  
  There is no support at the present.
  You should add the use of proxy connections to the RFE (request for enhancement) list.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - loomax - 2006-06-21
    
    Proxy suport should be definetly added
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Derrick Oswald - 2006-06-21
      
      Proxy host,port,user and password support has been added in the ConnectionManager class in the http package.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Miguel - 2009-08-14
        
        Hi! Im trying to extract some test from the web page, but Im unable to do that because I receive a 500 error code. Recently the proxy server has changed its url from proxy.something.com and now it uses an auto-configuration url (something like http://proxy.something.com:8080/\) What Im doing is:
        
        Parser.getConnectionManager().setProxyHost("proxy.something.com");
                Parser.getConnectionManager().setProxyPort(8080);
                Parser.getConnectionManager().setProxyUser("myusername");
                Parser.getConnectionManager().setProxyPassword("mypassword");
        
                try {
                    Parser parser = new Parser("http://www.google.com");
        
                    StringBean stringBean = new StringBean();
        
                    parser.visitAllNodesWith(stringBean);
        
                    stringBean.setLinks(false);
                    stringBean.setReplaceNonBreakingSpaces(true);
                    stringBean.setCollapse(true);
                    stringBean.setURL("http://www.google.com");
        
                    String s = stringBean.getStrings();
        
                } catch (ParserException e) {
                    // TODO Auto-generated catch block
                    e.printStackTrace();
                }
        
        Thanks for your help!
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Derrick Oswald - 2009-08-15
        
        500 error codes are when the server is aware that it has had a problem or error:
        
        501 Not Implemented:
        The server doesn't support the functions required for fullfilling that request. This might occur if a server side include were called on a server that doesn't support that function.
        
        502 Bad Gateway:
        The server, while acting as a gateway or proxy, received a bad request from an upstream server.
        
        503 Service Unavailable:
        The server is unable to handle the request due to maintenance or a temporary overload of the server.
        
        504 Gateway Timeout:
        The server, while acting as a gateway or proxy, did not receive a timely response from an upstream server.
        
        505 HTTP Version Not Supported:
        The server does not support the HTTP version that was used to make the request.
        
        Does the proxy work with your browser?
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Matt Ruby - 2004-07-30
  
  You could use Commons httpclient to connect through your proxy and pull the response.
  http://jakarta.apache.org/commons/httpclient/
  
  Once you have the response make sure it's html then pass it to the parser using getResponseBodyAsString(). There are some great tutorials and example code on the httpclient site.
  
  It might be a good idea to use something like this in a future version of HTMLParser.
  
  Good luck!
  
  Matt Ruby
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Marek Nazarko - 2004-09-14
  
  Properties systemSettings = System.getProperties();
  systemSettings.put("proxySet", "true");
  systemSettings.put("proxyHost","127.0.0.1");
  systemSettings.put("proxyPort","8118");
  System.setProperties(systemSettings);
  URL adres=new URL("http://...");
  URLConnection polaczenie = adres.openConnection ();
  String password = "user:pass";
  String auth = "Basic " + new sun.misc.BASE64Encoder().encode(password.getBytes());
  polaczenie.setRequestProperty( "Proxy-Authorization", auth);
  parser = new Parser ();
  parser.setConnection(polaczenie);
  
  It works even with proxy demanding authorization
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Derrick Oswald - 2004-09-14
    
    This capability has been added September 1, and will be available in the next integration release:
    
    Implemented:
    RFE #1017249    HTML Client Doesn't Support Cookies but will follow redirect
    RFE #1010586    Add support for password protected URL
    and RFE #1000739    Add support for proxy scenario
    
    A new http package is added, the primary class being
    Connectionmanager which handles proxies, passwords
    and cookies.
    Some testing still needed.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

HTMLParser Tests from inside a firewall

Forums

Help

HTMLParser Tests from inside a firewall document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

HTMLParser Tests from inside a firewall