I found HTMLParser to be quite useful to extract
data from websites - the problem is looks like you
can't use it from inside a firewall.
For example, when I ran one of the example
programs included (StringExtractor), and point it to
an external web address, it returns:
org.htmlparser.util.ParserException: Connection
refused: connect
Does HTMLParser contains a special class that
would handle Developers who don't have machines that
are directly connected to the Internet (proxy
scenario)?
Thank you.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi! Im trying to extract some test from the web page, but Im unable to do that because I receive a 500 error code. Recently the proxy server has changed its url from proxy.something.com and now it uses an auto-configuration url (something like http://proxy.something.com:8080/\) What Im doing is:
500 error codes are when the server is aware that it has had a problem or error:
501 Not Implemented:
The server doesn't support the functions required for fullfilling that request. This might occur if a server side include were called on a server that doesn't support that function.
502 Bad Gateway:
The server, while acting as a gateway or proxy, received a bad request from an upstream server.
503 Service Unavailable:
The server is unable to handle the request due to maintenance or a temporary overload of the server.
504 Gateway Timeout:
The server, while acting as a gateway or proxy, did not receive a timely response from an upstream server.
505 HTTP Version Not Supported:
The server does not support the HTTP version that was used to make the request.
Does the proxy work with your browser?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Once you have the response make sure it's html then pass it to the parser using getResponseBodyAsString(). There are some great tutorials and example code on the httpclient site.
It might be a good idea to use something like this in a future version of HTMLParser.
Good luck!
Matt Ruby
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This capability has been added September 1, and will be available in the next integration release:
Implemented:
RFE #1017249 HTML Client Doesn't Support Cookies but will follow redirect
RFE #1010586 Add support for password protected URL
and RFE #1000739 Add support for proxy scenario
A new http package is added, the primary class being
Connectionmanager which handles proxies, passwords
and cookies.
Some testing still needed.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi.
I found HTMLParser to be quite useful to extract
data from websites - the problem is looks like you
can't use it from inside a firewall.
For example, when I ran one of the example
programs included (StringExtractor), and point it to
an external web address, it returns:
org.htmlparser.util.ParserException: Connection
refused: connect
Does HTMLParser contains a special class that
would handle Developers who don't have machines that
are directly connected to the Internet (proxy
scenario)?
Thank you.
There is no support at the present.
You should add the use of proxy connections to the RFE (request for enhancement) list.
Proxy suport should be definetly added
Proxy host,port,user and password support has been added in the ConnectionManager class in the http package.
Hi! Im trying to extract some test from the web page, but Im unable to do that because I receive a 500 error code. Recently the proxy server has changed its url from proxy.something.com and now it uses an auto-configuration url (something like http://proxy.something.com:8080/\) What Im doing is:
Parser.getConnectionManager().setProxyHost("proxy.something.com");
Parser.getConnectionManager().setProxyPort(8080);
Parser.getConnectionManager().setProxyUser("myusername");
Parser.getConnectionManager().setProxyPassword("mypassword");
try {
Parser parser = new Parser("http://www.google.com");
StringBean stringBean = new StringBean();
parser.visitAllNodesWith(stringBean);
stringBean.setLinks(false);
stringBean.setReplaceNonBreakingSpaces(true);
stringBean.setCollapse(true);
stringBean.setURL("http://www.google.com");
String s = stringBean.getStrings();
} catch (ParserException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Thanks for your help!
500 error codes are when the server is aware that it has had a problem or error:
501 Not Implemented:
The server doesn't support the functions required for fullfilling that request. This might occur if a server side include were called on a server that doesn't support that function.
502 Bad Gateway:
The server, while acting as a gateway or proxy, received a bad request from an upstream server.
503 Service Unavailable:
The server is unable to handle the request due to maintenance or a temporary overload of the server.
504 Gateway Timeout:
The server, while acting as a gateway or proxy, did not receive a timely response from an upstream server.
505 HTTP Version Not Supported:
The server does not support the HTTP version that was used to make the request.
Does the proxy work with your browser?
You could use Commons httpclient to connect through your proxy and pull the response.
http://jakarta.apache.org/commons/httpclient/
Once you have the response make sure it's html then pass it to the parser using getResponseBodyAsString(). There are some great tutorials and example code on the httpclient site.
It might be a good idea to use something like this in a future version of HTMLParser.
Good luck!
Matt Ruby
Properties systemSettings = System.getProperties();
systemSettings.put("proxySet", "true");
systemSettings.put("proxyHost","127.0.0.1");
systemSettings.put("proxyPort","8118");
System.setProperties(systemSettings);
URL adres=new URL("http://...");
URLConnection polaczenie = adres.openConnection ();
String password = "user:pass";
String auth = "Basic " + new sun.misc.BASE64Encoder().encode(password.getBytes());
polaczenie.setRequestProperty( "Proxy-Authorization", auth);
parser = new Parser ();
parser.setConnection(polaczenie);
It works even with proxy demanding authorization
This capability has been added September 1, and will be available in the next integration release:
Implemented:
RFE #1017249 HTML Client Doesn't Support Cookies but will follow redirect
RFE #1010586 Add support for password protected URL
and RFE #1000739 Add support for proxy scenario
A new http package is added, the primary class being
Connectionmanager which handles proxies, passwords
and cookies.
Some testing still needed.