Re: [Htmlparser-user] Non-charset=utf-8 Characters in Parsed Text?
Brought to you by:
derrickoswald
From: Jeffery B. <jef...@gm...> - 2007-12-15 13:58:08
|
Thank you Derrick, That worked perfectly. On Dec 12, 2007 10:52 PM, Derrick Oswald <der...@ro...> wrote: > You can create a class extending org.htmlparser.tags.MetaTag and > overriding doSemanticAction () to do nothing. > Register this with a org.htmlparser.PrototypicalNodeFactory you assign to > your parser as described here<http://htmlparser.sourceforge.net/faq.html#composite> > . > > > ----- Original Message ---- > From: Jeffery Brewer <jef...@gm...> > To: htmlparser user list <htm...@li...> > Sent: Wednesday, December 12, 2007 9:31:16 PM > Subject: Re: [Htmlparser-user] Non-charset=utf-8 Characters in Parsed > Text? > > Thanks Karsten, > > I have now read the FAQ and have spent some time trying to solve my > problem. I'm learning a lot more about the parser but haven't solved my > problem yet. > > The pages I'm trying to read have a meta tag setting the encoding to > UTF-8... > <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> > > but they are obviously using a different character set (they shouldn't > be!). > > If I copy the page and modify the tag for windows-1252 encoding... > > <meta http-equiv="Content-Type" content="text/html; > charset=windows-1252"> > > and parse the page, I can recover the characters and convert them. > > Likewise, if I omit that meta tag and set the parser for windows-1252 > encoding I can also recover the characters and convert them. > > But if I set the parser for windows-1252 encoding and then have it parse > the page from the website, the parser reads the utf-8 encoding tag and and > automatically parses the page using utf-8 encoding. > > In other words, if I do this... > > Parser parser = new Parser ("http://www.examiner.com/a-1097821~County_plan_could_double_neighborhood_enforcement.html > > <http://www.examiner.com/a-1097821%7ECounty_plan_could_double_neighborhood_enforcement.html> > "); > parser.setEncoding("windows-1252"); > System.out.println("encoding=" + parser.getEncoding()); > NodeList divNodeList = parser.parse(new HasAttributeFilter("id", > "article_main")); > System.out.println("encoding=" + parser.getEncoding()); > > it prints out > encoding=windows-1252 > encoding=UTF-8 > > I wonder if it's possible to have the parser ignore the meta tag, or if > it's somehow possible to alter or delete the meta tag before the site is > parsed or if there is a better approach? > > > > On Dec 12, 2007 12:28 AM, Karsten Ohme <wid...@t-...> wrote: > > > Jeffery Brewer schrieb: > > > I'm running into an issue where I'm getting question mark characters > > in > > > place of quotes, apostrophes, hyphens, etc. > > > > Have you read the FAQ? > > > > http://htmlparser.sourceforge.net/faq.html > > > > The "Why am I getting an EncodingChangeException?" should be helpful how > > to handle character encoding issues. If the web page does not contain an > > > > encoding hint, let the parser fetch the web site for you, maybe the HTTP > > header contains the correct encoding. So it is used. If the web site is > > offline, set the correct encoding in the parser. Does this help? > > > > Regards, > > Karsten > > > > > > > > I know this has to do with the website using characters outside those > > > defined by the specification. Is there a way to correct this in the > > > htmlparser? I started trying to do a simple character replacement on > > the > > > parsed text, but whenever I do an "(int) string.charAt(n)" for any > > special > > > character I'm getting a 65533, and if I do a " > > Character.getNumericValue( > > > string.charAt(n))" I'm getting a -1, so I'm assuming I'm far to far > > > "downstream" to fix the problem. > > > > > > Also I've just been using the Parser.parse method to return nodelists > > and > > > have been working my way through the documents that way rather than > > try any > > > of the other htmlparser features (which may already account for > > this??). > > > > > > Thanks in advance for any help. I'm really enjoying working with the > > parser > > > and thanks to everyone who built this thing. > > > > > > Jeff > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > > > ------------------------------------------------------------------------- > > > SF.Net email is sponsored by: > > > Check out the new SourceForge.net Marketplace. > > > It's the best place to buy or sell services for > > > just about anything Open Source. > > > http://sourceforge.net/services/buy/index.php > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > ------------------------------------------------------------------------- > > SF.Net email is sponsored by: > > Check out the new SourceForge.net Marketplace. > > It's the best place to buy or sell services for > > just about anything Open Source. > > http://sourceforge.net/services/buy/index.php > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > ------------------------------------------------------------------------- > SF.Net email is sponsored by: > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services > for just about anything Open Source. > > http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |