Re: [Htmlparser-user] Non-charset=utf-8 Characters in Parsed Text?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Thank you Derrick,

That worked perfectly.

On Dec 12, 2007 10:52 PM, Derrick Oswald <der...@ro...> wrote:

> You can create a class extending org.htmlparser.tags.MetaTag and
> overriding doSemanticAction () to do nothing.
> Register this with a org.htmlparser.PrototypicalNodeFactory you assign to
> your parser as described here<http://htmlparser.sourceforge.net/faq.html#composite>
> .
>
>
> ----- Original Message ----
> From: Jeffery Brewer <jef...@gm...>
> To: htmlparser user list <htm...@li...>
> Sent: Wednesday, December 12, 2007 9:31:16 PM
> Subject: Re: [Htmlparser-user] Non-charset=utf-8 Characters in Parsed
> Text?
>
> Thanks Karsten,
>
> I have now read the FAQ and have spent some time trying to solve my
> problem. I'm learning a lot more about the parser but haven't solved my
> problem yet.
>
> The pages I'm trying to read have a meta tag setting the encoding to
> UTF-8...
>     <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
>
> but they are obviously using a different character set (they shouldn't
> be!).
>
> If I copy the page and modify the tag for windows-1252 encoding...
>
>     <meta http-equiv="Content-Type" content="text/html;
> charset=windows-1252">
>
> and parse the page, I can recover the characters and convert them.
>
> Likewise, if I omit that meta tag and set the parser for windows-1252
> encoding I can also recover the characters and convert them.
>
> But if I set the parser for windows-1252 encoding and then have it parse
> the page from the website, the parser reads the utf-8 encoding tag and and
> automatically parses the page using utf-8 encoding.
>
> In other words, if I do this...
>
>      Parser parser = new Parser ("http://www.examiner.com/a-1097821~County_plan_could_double_neighborhood_enforcement.html
>
> <http://www.examiner.com/a-1097821%7ECounty_plan_could_double_neighborhood_enforcement.html>
> ");
>      parser.setEncoding("windows-1252");
>      System.out.println("encoding=" + parser.getEncoding());
>      NodeList divNodeList = parser.parse(new HasAttributeFilter("id",
> "article_main"));
>      System.out.println("encoding=" + parser.getEncoding());
>
> it prints out
> encoding=windows-1252
> encoding=UTF-8
>
> I wonder if it's possible to have the parser ignore the meta tag, or if
> it's somehow possible to alter or delete the meta tag before the site is
> parsed or if there is a better approach?
>
>
>
> On Dec 12, 2007 12:28 AM, Karsten Ohme <wid...@t-...> wrote:
>
> > Jeffery Brewer schrieb:
> > > I'm running into an issue where I'm getting question mark characters
> > in
> > > place of quotes, apostrophes, hyphens, etc.
> >
> > Have you read the FAQ?
> >
> > http://htmlparser.sourceforge.net/faq.html
> >
> > The "Why am I getting an EncodingChangeException?" should be helpful how
> > to handle character encoding issues. If the web page does not contain an
> >
> > encoding hint, let the parser fetch the web site for you, maybe the HTTP
> > header contains the correct encoding. So it is used. If the web site is
> > offline, set the correct encoding in the parser. Does this help?
> >
> > Regards,
> > Karsten
> >
> > >
> > > I know this has to do with the website using characters outside those
> > > defined by the specification. Is there a way to correct this in the
> > > htmlparser? I started trying to do a simple character replacement on
> > the
> > > parsed text, but whenever I do an "(int) string.charAt(n)" for any
> > special
> > > character I'm getting a 65533, and if I do a "
> > Character.getNumericValue(
> > > string.charAt(n))" I'm getting a -1, so I'm assuming I'm far to far
> > > "downstream" to fix the problem.
> > >
> > > Also I've just been using the Parser.parse method to return nodelists
> > and
> > > have been working my way through the documents that way rather than
> > try any
> > > of the other htmlparser features (which may already account for
> > this??).
> > >
> > > Thanks in advance for any help. I'm really enjoying working with the
> > parser
> > > and thanks to everyone who built this thing.
> > >
> > > Jeff
> > >
> > >
> > >
> > >
> > ------------------------------------------------------------------------
> > >
> > >
> > -------------------------------------------------------------------------
> > > SF.Net email is sponsored by:
> > > Check out the new SourceForge.net Marketplace.
> > > It's the best place to buy or sell services for
> > > just about anything Open Source.
> > > http://sourceforge.net/services/buy/index.php
> > >
> > >
> > >
> > ------------------------------------------------------------------------
> > >
> > > _______________________________________________
> > > Htmlparser-user mailing list
> > > Htm...@li...
> > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> >
> >
> >
> > -------------------------------------------------------------------------
> > SF.Net email is sponsored by:
> > Check out the new SourceForge.net Marketplace.
> > It's the best place to buy or sell services for
> > just about anything Open Source.
> > http://sourceforge.net/services/buy/index.php
> > _______________________________________________
> > Htmlparser-user mailing list
> > Htm...@li...
> > https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> >
>
>
>
> -------------------------------------------------------------------------
> SF.Net email is sponsored by:
> Check out the new SourceForge.net Marketplace.
> It's the best place to buy or sell services
> for just about anything Open Source.
>
> http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>