Menu

#3 Refactoring: new HtmlDocument(httpDocument);

open
nobody
None
5
2003-11-06
2003-11-06
No

Attached is a patch which creates a new constructor for
HtmlDocument that takes an HttpDocument as an
argument. The patch also includes a bug fix to the
String content contructor to HtmlDocument.

Problem Statement: The code required to convert an
HttpDoc into an HtmlDocument is rather complex (due to
the charset encoding issue) and was contained within
WebRobot. This required applications using HttpTool
directly rather than the WebRobot to recreate this logic.

Proposed Solution: Refactor the HttpDoc to
HtmlDocument conversion logic from WebRobot into a
new constructor on HtmlDocument that works with
HttpDoc's.

Benefit: The logic required to determine the charset
encoding by parsing the "Content-Type:" is now
accessable to code outside of WebRobot.

Drawback: HtmlDocument is aware of the existance of
HttpDoc. An alternative would be to move this logic into
a static factory method contained in a utilty class.
However, the constructor approach is more intuitive and
direct and since Http and Html are highly related anyway
there is really no reason to keep them unaware of each
other.

Implementation: Attached is a patch file which moves
the HttpDoc to HtmlDocument logic from WebRobot into
a constructor on HtmlDocument.

**Important Additional Notes: The attached patch also
includes a change (with comments) to the String
content constructor in HtmlDocument. The old
implementation would have failed when foreign/non-Ascii
characters were included in the content String. The
new implemenation addresses this issue by using the JDK
provided String.toBytes() which preserves encoding
rather than merely truncating the 16-bit char's into 8-bit
bytes.

Discussion

  • Doug Bateman

    Doug Bateman - 2003-11-06
    • summary: New Constructor: new HtmlDocument(httpDocument); --> Refactoring: new HtmlDocument(httpDocument);
     
  • Doug Bateman

    Doug Bateman - 2003-11-06

    Logged In: YES
    user_id=711761

    I noiticed you just made a change to WebRobot (with respect
    to checking for null on doc managers), which makes the patch
    I submitted outdated. Rather than create a new patch file, I
    figured I'd just post the code here so you can consider the
    submission more easily:

    In File WebRobot.java Replaced:
    < // solving encoding problem
    < // HtmlDocument htmlDoc = new HtmlDocument(u,
    doc.getContent());
    < HtmlDocument htmlDoc = null;
    < HttpHeader contentTypeHeader = doc.getHeader("Content-
    type");
    < if (contentTypeHeader != null) {
    < String contentType = contentTypeHeader.getValue();
    < int index = contentType.toLowerCase().indexOf("charset=");
    < if (index > 0) {
    < htmlDoc = new HtmlDocument(u, doc.getContent(),
    contentType.substring(index+8));
    < } else {
    < htmlDoc = new HtmlDocument(u, doc.getContent());
    < }
    < } else {
    < htmlDoc = new HtmlDocument(u, doc.getContent());
    < }

    With:
    > htmlDoc = new HtmlDocument(doc);

    In file HtmlDocument.java added the following:

    /**
    * Initalizes an HTML document from a String. Convert string
    to
    * bytes using default encoding of the platform.
    *
    * @param url the URL of this document. Needed for link
    extraction.
    * @param content some HTML text
    */
    public HtmlDocument(URL url, String content) {
    //Note: An alternative might be to specify the encoding so
    there is no ambiguity.
    //However, this is likely unnecessary since the HTML
    parser will simply turn around
    //and decode this back into a String when creating the
    InputStreamReader.
    this(url);
    this.content = content.getBytes();
    }

    /**
    * Creates a new HtmlDocument using the contents of the
    HttpDocument.
    * @param doc the HttpDoc to read when creating the
    HtmlDocument.
    */
    public HtmlDocument(HttpDoc doc) {
    this(doc.getURL(), doc.getContent(), extractCharEncoding
    (doc));
    }

    /**
    * Reads the "Content-Type" header's "charset=" section to
    determine
    * the character encoding used in the document. If the
    charset isn't
    * set in the header, null is returned.
    * @param doc The document for which to determine the
    character encoding.
    * @return the character encoding from the "Content-Type"
    header or null
    * if the encoding isn't specified.
    */
    private static String extractCharEncoding(HttpDoc doc)
    {
    HttpHeader contentTypeHeader = doc.getHeader("Content-
    type");
    if (contentTypeHeader != null) {
    String contentType = contentTypeHeader.getValue();
    int index = contentType.toLowerCase().indexOf
    ("charset=");
    if (index > 0) {
    String encoding = contentType.substring(index+8);
    return encoding;
    }
    }
    return null;
    }

     
  • Doug Bateman

    Doug Bateman - 2003-11-06

    Updated Patch File

     

Log in to post a comment.