I'm having difficulty parsing HTML documents that specify a character set correctly. Currently, I'm using the following code to parse HTML streams (still using version 1.3):
Parser parser = new Parser(new NodeReader(reader, 8192));
parser.setFeedback(new HtmlFeedback());
parser.registerScanners();
HtmlToTextVisitor visitor = new HtmlToTextVisitor();
parser.visitAllNodesWith(visitor);
Clearly, supplying a Reader to the NodeReader enforces a character encoding, which is problematic if the HTML document specifies a different encoding. Browsing through Parser.java I saw some code that recreates the used Reader when a charset is encountered in the stream, but this only seems to work when a URLConnection is supplied to the parser. Is there any way to use this functionality when no URLConnection object is available? More specifically, I'm looking for a parse method which I can supply an InputStream and a default charset to. Is anything like this available in 1.3 or newer?
Thanks,
Arjohh
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
What I think you want is available in version 1.4 and 1.5 of the parser. There is a Page class that takes a stream and charset:
/**
* Construct a page from a stream encoded with the given charset.
* @param stream The source of bytes.
* @param charset The encoding used.
* If null, defaults to the <code>DEFAULT_CHARSET</code>.
* @exception UnsupportedEncodingException If the given charset is not supported.
*/
public Page (InputStream stream, String charset)
The page can be given to the Lexer in it's constructor, and the Lexer can be provided to the Parser in it's constructor, although you may only need a Lexer depending on what you want o get out of it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi all,
I'm having difficulty parsing HTML documents that specify a character set correctly. Currently, I'm using the following code to parse HTML streams (still using version 1.3):
Parser parser = new Parser(new NodeReader(reader, 8192));
parser.setFeedback(new HtmlFeedback());
parser.registerScanners();
HtmlToTextVisitor visitor = new HtmlToTextVisitor();
parser.visitAllNodesWith(visitor);
Clearly, supplying a Reader to the NodeReader enforces a character encoding, which is problematic if the HTML document specifies a different encoding. Browsing through Parser.java I saw some code that recreates the used Reader when a charset is encountered in the stream, but this only seems to work when a URLConnection is supplied to the parser. Is there any way to use this functionality when no URLConnection object is available? More specifically, I'm looking for a parse method which I can supply an InputStream and a default charset to. Is anything like this available in 1.3 or newer?
Thanks,
Arjohh
What I think you want is available in version 1.4 and 1.5 of the parser. There is a Page class that takes a stream and charset:
/**
* Construct a page from a stream encoded with the given charset.
* @param stream The source of bytes.
* @param charset The encoding used.
* If null, defaults to the <code>DEFAULT_CHARSET</code>.
* @exception UnsupportedEncodingException If the given charset is not supported.
*/
public Page (InputStream stream, String charset)
The page can be given to the Lexer in it's constructor, and the Lexer can be provided to the Parser in it's constructor, although you may only need a Lexer depending on what you want o get out of it.
Hi Derrick,
I tried version 1.4.1 and your suggestion seems to be working. Thanks a lot.
Cheers,
Arjohn