Attached is a patch which creates a new constructor for
HtmlDocument that takes an HttpDocument as an
argument. The patch also includes a bug fix to the
String content contructor to HtmlDocument.
Problem Statement: The code required to convert an
HttpDoc into an HtmlDocument is rather complex (due to
the charset encoding issue) and was contained within
WebRobot. This required applications using HttpTool
directly rather than the WebRobot to recreate this logic.
Proposed Solution: Refactor the HttpDoc to
HtmlDocument conversion logic from WebRobot into a
new constructor on HtmlDocument that works with
HttpDoc's.
Benefit: The logic required to determine the charset
encoding by parsing the "Content-Type:" is now
accessable to code outside of WebRobot.
Drawback: HtmlDocument is aware of the existance of
HttpDoc. An alternative would be to move this logic into
a static factory method contained in a utilty class.
However, the constructor approach is more intuitive and
direct and since Http and Html are highly related anyway
there is really no reason to keep them unaware of each
other.
Implementation: Attached is a patch file which moves
the HttpDoc to HtmlDocument logic from WebRobot into
a constructor on HtmlDocument.
**Important Additional Notes: The attached patch also
includes a change (with comments) to the String
content constructor in HtmlDocument. The old
implementation would have failed when foreign/non-Ascii
characters were included in the content String. The
new implemenation addresses this issue by using the JDK
provided String.toBytes() which preserves encoding
rather than merely truncating the 16-bit char's into 8-bit
bytes.
Logged In: YES
user_id=711761
I noiticed you just made a change to WebRobot (with respect
to checking for null on doc managers), which makes the patch
I submitted outdated. Rather than create a new patch file, I
figured I'd just post the code here so you can consider the
submission more easily:
In File WebRobot.java Replaced:
< // solving encoding problem
< // HtmlDocument htmlDoc = new HtmlDocument(u,
doc.getContent());
< HtmlDocument htmlDoc = null;
< HttpHeader contentTypeHeader = doc.getHeader("Content-
type");
< if (contentTypeHeader != null) {
< String contentType = contentTypeHeader.getValue();
< int index = contentType.toLowerCase().indexOf("charset=");
< if (index > 0) {
< htmlDoc = new HtmlDocument(u, doc.getContent(),
contentType.substring(index+8));
< } else {
< htmlDoc = new HtmlDocument(u, doc.getContent());
< }
< } else {
< htmlDoc = new HtmlDocument(u, doc.getContent());
< }
With:
> htmlDoc = new HtmlDocument(doc);
In file HtmlDocument.java added the following:
/**
* Initalizes an HTML document from a String. Convert string
to
* bytes using default encoding of the platform.
*
* @param url the URL of this document. Needed for link
extraction.
* @param content some HTML text
*/
public HtmlDocument(URL url, String content) {
//Note: An alternative might be to specify the encoding so
there is no ambiguity.
//However, this is likely unnecessary since the HTML
parser will simply turn around
//and decode this back into a String when creating the
InputStreamReader.
this(url);
this.content = content.getBytes();
}
/**
* Creates a new HtmlDocument using the contents of the
HttpDocument.
* @param doc the HttpDoc to read when creating the
HtmlDocument.
*/
public HtmlDocument(HttpDoc doc) {
this(doc.getURL(), doc.getContent(), extractCharEncoding
(doc));
}
/**
* Reads the "Content-Type" header's "charset=" section to
determine
* the character encoding used in the document. If the
charset isn't
* set in the header, null is returned.
* @param doc The document for which to determine the
character encoding.
* @return the character encoding from the "Content-Type"
header or null
* if the encoding isn't specified.
*/
private static String extractCharEncoding(HttpDoc doc)
{
HttpHeader contentTypeHeader = doc.getHeader("Content-
type");
if (contentTypeHeader != null) {
String contentType = contentTypeHeader.getValue();
int index = contentType.toLowerCase().indexOf
("charset=");
if (index > 0) {
String encoding = contentType.substring(index+8);
return encoding;
}
}
return null;
}
Updated Patch File