[Htmlparser-user] Dealing with *repeated* EncodingChangeException
Brought to you by:
derrickoswald
From: Subramanya S. <sa...@cs...> - 2006-03-23 19:06:13
|
Hello everyone, My name is Subbu (Subramanya Sastry). For one of my projects, I had been using the Swing inbuilt parser and had managed to set up workarounds to deal with its inadequacies (mostly because of it being based on HTML 3.2). Anyway, I had looked at HTMLParser few months back, but since all was working fine for me with the Swing parser, I hadn't switched over to HTMLParser, and also because I didn't have to ship another library with the application. But, for various reasons, including the fact that I am multi-lingualizing my application, I decided to check out HTMLParser last week. However, I quickly ran into problems because of EncodingChangeException -- and this was on plain-old "English content" HTML files. I scouted around and read about the "parser.reset()" trick. However, that didn't solve my problem because even after reset, the same exception was being thrown at the same place. When I looked into the HTML, I noticed that the publishers had *TWO* content-type meta tags <meta http-equiv="Content-Type" content="text/html"> and a little while later <meta content="text/html; charset=UTF-8" http-equiv="Content-Type"> The presence of these multiple meta tags renders the resetting useless because the parser will trip on the second meta tag each time! (Check out pages on http://www.economictimes.com for this kind of HTML) I couldn't think of a work-around for this, and so, I reverted back to the Swing parser which allows me to ignore character-set changes, which helps me deal with the above problem. Since I knew of no easy way of telling HTMLParser to ingore char-set changes, I couldn't use HTMLParser. But, after racking my head for a while, I finally went through the source code of HTMLParser, and then, finally hit the solution/hack when going through the Javadoc for "PrototypicalNodeFactory"! I saw on the user mailing list that a couple of times, people have run into this problem of not being able to parse HTML even after resetting the parser. So, I am sharing this in the interest of those who might run into this problem in the future. The solution/hack is as follows: I simply unregistered the meta tag from the PrototypicalNodeFactory the third time around which means both the above meta tags won't get parsed. But, since the parser has already picked up the UTF-8 encoding, the entire file will be parsed with UTF-8 encoding. Obviously, this is not a bullet-proof solution, but this helps me get through several HTML files which were otherwise getting rejected. Code snippet below: --------------------------------------------------------------------- private static void IgnoreCharSetChanges(Parser p) { PrototypicalNodeFactory factory = new PrototypicalNodeFactory (); factory.unregisterTag(new MetaTag()); // Unregister meta tag so that char set changes are ignored! p.setNodeFactory (factory); } private static String ParseNow(Parser p, MyVisitor visitor) throws org.htmlparser.util.ParserException { try { System.out.println("START encoding is " + p.getEncoding()); p.visitAllNodesWith(visitor); } catch (org.htmlparser.util.EncodingChangeException e) { try { System.out.println("Caught you! CURRENT encoding is " + p.getEncoding()); visitor.Init(); p.reset(); p.visitAllNodesWith(visitor); } catch (org.htmlparser.util.EncodingChangeException e2) { System.out.println("CURRENT encoding is " + p.getEncoding()); System.out.println("--- CAUGHT you yet again! IGNORE meta tags now! ---"); visitor.Init(); p.reset(); IgnoreCharSetChanges(p); p.visitAllNodesWith(visitor); } } System.out.println("ENCODING IS " + p.getEncoding()); return p.getEncoding(); } --------------------------------------------------------------------- If, in future versions of HTMLParser, the MetaTag class starts doing other important things in future besides setting text encoding, then, a new class could be derived from the existing MetaTag class whose "doSemanticAction()" code simply ignores char set changes for "content-type" meta tags and calls super.doSemanticAction for others ... If there are gotchas in this technique, I would appreciate feedback on that front too! Thanks, Best, Subbu. |