[Htmlparser-user] Dealing with repeated EncodingChangeException

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hello everyone,

My name is Subbu (Subramanya Sastry).  For one of my projects, I had been
using the Swing inbuilt parser and had managed to set up workarounds to deal
with its inadequacies (mostly because of it being based on HTML 3.2).  
Anyway, I had looked at HTMLParser few months back, but since all was working
fine for me with the Swing parser, I hadn't switched over to HTMLParser, and
also because I didn't have to ship another library with the application.

But, for various reasons, including the fact that I am multi-lingualizing my
application, I decided to check out HTMLParser last week.  However, I quickly
ran into problems because of EncodingChangeException -- and this was on
plain-old "English content" HTML files.  I scouted around and read about the
"parser.reset()" trick.  However, that didn't solve my problem because even
after reset, the same exception was being thrown at the same place.  When I
looked into the HTML, I noticed that the publishers had *TWO* content-type
meta tags
   <meta http-equiv="Content-Type" content="text/html">
and a little while later
   <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
The presence of these multiple meta tags renders the resetting useless
because the parser will trip on the second meta tag each time!
(Check out pages on http://www.economictimes.com for this kind of HTML)

I couldn't think of a work-around for this, and so, I reverted back to the
Swing parser which allows me to ignore character-set changes, which helps me
deal with the above problem.  Since I knew of no easy way of telling
HTMLParser to ingore char-set changes, I couldn't use HTMLParser.  But, after
racking my head for a while, I finally went through the source code of
HTMLParser, and then, finally hit the solution/hack when going through the
Javadoc for "PrototypicalNodeFactory"!

I saw on the user mailing list that a couple of times, people have run into
this problem of not being able to parse HTML even after resetting the parser.  
So, I am sharing this in the interest of those who might run into this problem
in the future.

The solution/hack is as follows: I simply unregistered the meta tag from the
PrototypicalNodeFactory the third time around which means both the above meta
tags won't get parsed.  But, since the parser has already picked up the UTF-8
encoding, the entire file will be parsed with UTF-8 encoding. Obviously, this
is not a bullet-proof solution, but this helps me get through several HTML
files which were otherwise getting rejected.

Code snippet below:
---------------------------------------------------------------------
   private static void IgnoreCharSetChanges(Parser p)
   {
      PrototypicalNodeFactory factory = new PrototypicalNodeFactory ();
      factory.unregisterTag(new MetaTag());
         // Unregister meta tag so that char set changes are ignored!
      p.setNodeFactory (factory);
   }

   private static String ParseNow(Parser p, MyVisitor visitor) throws org.htmlparser.util.ParserException
   {
      try {
         System.out.println("START encoding is " + p.getEncoding());
         p.visitAllNodesWith(visitor);
      }
      catch (org.htmlparser.util.EncodingChangeException e) {
         try {
            System.out.println("Caught you! CURRENT encoding is " + p.getEncoding());
            visitor.Init();
            p.reset();
            p.visitAllNodesWith(visitor);
         }
         catch (org.htmlparser.util.EncodingChangeException e2) {
            System.out.println("CURRENT encoding is " + p.getEncoding());
            System.out.println("--- CAUGHT you yet again! IGNORE meta tags now! ---");
            visitor.Init();
            p.reset();
            IgnoreCharSetChanges(p);
            p.visitAllNodesWith(visitor);
         }
      }
      System.out.println("ENCODING IS " + p.getEncoding());
      return p.getEncoding();
   }
---------------------------------------------------------------------

If, in future versions of HTMLParser, the MetaTag class starts doing other
important things in future besides setting text encoding, then, a new class
could be derived from the existing MetaTag class whose "doSemanticAction()"
code simply ignores char set changes for "content-type" meta tags and
calls super.doSemanticAction for others ...

If there are gotchas in this technique, I would appreciate feedback on
that front too!

Thanks,

Best,
Subbu.

[Htmlparser-user] Dealing with *repeated* EncodingChangeException