HTML Parser / Discussion / Help: parsing HTML docs that specify a charset

Arjohn Kampman - 2004-06-14

Hi all,

I'm having difficulty parsing HTML documents that specify a character set correctly. Currently, I'm using the following code to parse HTML streams (still using version 1.3):

Parser parser = new Parser(new NodeReader(reader, 8192));
parser.setFeedback(new HtmlFeedback());
parser.registerScanners();
HtmlToTextVisitor visitor = new HtmlToTextVisitor();
parser.visitAllNodesWith(visitor);

Clearly, supplying a Reader to the NodeReader enforces a character encoding, which is problematic if the HTML document specifies a different encoding. Browsing through Parser.java I saw some code that recreates the used Reader when a charset is encountered in the stream, but this only seems to work when a URLConnection is supplied to the parser. Is there any way to use this functionality when no URLConnection object is available? More specifically, I'm looking for a parse method which I can supply an InputStream and a default charset to. Is anything like this available in 1.3 or newer?

Thanks,

Arjohh

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Derrick Oswald - 2004-06-14
  
  What I think you want is available in version 1.4 and 1.5 of the parser. There is a Page class that takes a stream and charset:
  
      /**
       * Construct a page from a stream encoded with the given charset.
       * @param stream The source of bytes.
       * @param charset The encoding used.
       * If null, defaults to the <code>DEFAULT_CHARSET</code>.
       * @exception UnsupportedEncodingException If the given charset is not supported.
       */
      public Page (InputStream stream, String charset)
  
  The page can be given to the Lexer in it's constructor, and the Lexer can be provided to the Parser in it's constructor, although you may only need a Lexer depending on what you want o get out of it.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Arjohn Kampman - 2004-06-15
    
    Hi Derrick,
    
    I tried version 1.4.1 and your suggestion seems to be working. Thanks a lot.
    
    Cheers,
    
    Arjohn
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

parsing HTML docs that specify a charset

Forums

Help

parsing HTML docs that specify a charset document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

parsing HTML docs that specify a charset