Thread: Re: [Htmlparser-user] Non-charset=utf-8 Characters in Parsed Text?
Brought to you by:
derrickoswald
From: Derrick O. <der...@ro...> - 2007-12-13 03:52:37
|
You can create a class extending org.htmlparser.tags.MetaTag and overriding doSemanticAction () to do nothing. Register this with a org.htmlparser.PrototypicalNodeFactory you assign to your parser as described here. ----- Original Message ---- From: Jeffery Brewer <jef...@gm...> To: htmlparser user list <htm...@li...> Sent: Wednesday, December 12, 2007 9:31:16 PM Subject: Re: [Htmlparser-user] Non-charset=utf-8 Characters in Parsed Text? Thanks Karsten, I have now read the FAQ and have spent some time trying to solve my problem. I'm learning a lot more about the parser but haven't solved my problem yet. The pages I'm trying to read have a meta tag setting the encoding to UTF-8... <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> but they are obviously using a different character set (they shouldn't be!). If I copy the page and modify the tag for windows-1252 encoding... <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> and parse the page, I can recover the characters and convert them. Likewise, if I omit that meta tag and set the parser for windows-1252 encoding I can also recover the characters and convert them. But if I set the parser for windows-1252 encoding and then have it parse the page from the website, the parser reads the utf-8 encoding tag and and automatically parses the page using utf-8 encoding. In other words, if I do this... Parser parser = new Parser ("http://www.examiner.com/a-1097821~County_plan_could_double_neighborhood_enforcement.html "); parser.setEncoding("windows-1252"); System.out.println("encoding=" + parser.getEncoding()); NodeList divNodeList = parser.parse(new HasAttributeFilter("id", "article_main")); System.out.println("encoding=" + parser.getEncoding()); it prints out encoding=windows-1252 encoding=UTF-8 I wonder if it's possible to have the parser ignore the meta tag, or if it's somehow possible to alter or delete the meta tag before the site is parsed or if there is a better approach? On Dec 12, 2007 12:28 AM, Karsten Ohme <wid...@t-...> wrote: Jeffery Brewer schrieb: > I'm running into an issue where I'm getting question mark characters in > place of quotes, apostrophes, hyphens, etc. Have you read the FAQ? http://htmlparser.sourceforge.net/faq.html The "Why am I getting an EncodingChangeException?" should be helpful how to handle character encoding issues. If the web page does not contain an encoding hint, let the parser fetch the web site for you, maybe the HTTP header contains the correct encoding. So it is used. If the web site is offline, set the correct encoding in the parser. Does this help? Regards, Karsten > > I know this has to do with the website using characters outside those > defined by the specification. Is there a way to correct this in the > htmlparser? I started trying to do a simple character replacement on the > parsed text, but whenever I do an "(int) string.charAt(n)" for any special > character I'm getting a 65533, and if I do a "Character.getNumericValue( > string.charAt(n))" I'm getting a -1, so I'm assuming I'm far to far > "downstream" to fix the problem. > > Also I've just been using the Parser.parse method to return nodelists and > have been working my way through the documents that way rather than try any > of the other htmlparser features (which may already account for this??). > > Thanks in advance for any help. I'm really enjoying working with the parser > and thanks to everyone who built this thing. > > Jeff > > > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > SF.Net email is sponsored by: > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services for > just about anything Open Source. > http://sourceforge.net/services/buy/index.php > > > ------------------------------------------------------------------------ > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user ------------------------------------------------------------------------- SF.Net email is sponsored by: Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Jeffery B. <jef...@gm...> - 2007-12-15 13:58:08
|
Thank you Derrick, That worked perfectly. On Dec 12, 2007 10:52 PM, Derrick Oswald <der...@ro...> wrote: > You can create a class extending org.htmlparser.tags.MetaTag and > overriding doSemanticAction () to do nothing. > Register this with a org.htmlparser.PrototypicalNodeFactory you assign to > your parser as described here<http://htmlparser.sourceforge.net/faq.html#composite> > . > > > ----- Original Message ---- > From: Jeffery Brewer <jef...@gm...> > To: htmlparser user list <htm...@li...> > Sent: Wednesday, December 12, 2007 9:31:16 PM > Subject: Re: [Htmlparser-user] Non-charset=utf-8 Characters in Parsed > Text? > > Thanks Karsten, > > I have now read the FAQ and have spent some time trying to solve my > problem. I'm learning a lot more about the parser but haven't solved my > problem yet. > > The pages I'm trying to read have a meta tag setting the encoding to > UTF-8... > <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> > > but they are obviously using a different character set (they shouldn't > be!). > > If I copy the page and modify the tag for windows-1252 encoding... > > <meta http-equiv="Content-Type" content="text/html; > charset=windows-1252"> > > and parse the page, I can recover the characters and convert them. > > Likewise, if I omit that meta tag and set the parser for windows-1252 > encoding I can also recover the characters and convert them. > > But if I set the parser for windows-1252 encoding and then have it parse > the page from the website, the parser reads the utf-8 encoding tag and and > automatically parses the page using utf-8 encoding. > > In other words, if I do this... > > Parser parser = new Parser ("http://www.examiner.com/a-1097821~County_plan_could_double_neighborhood_enforcement.html > > <http://www.examiner.com/a-1097821%7ECounty_plan_could_double_neighborhood_enforcement.html> > "); > parser.setEncoding("windows-1252"); > System.out.println("encoding=" + parser.getEncoding()); > NodeList divNodeList = parser.parse(new HasAttributeFilter("id", > "article_main")); > System.out.println("encoding=" + parser.getEncoding()); > > it prints out > encoding=windows-1252 > encoding=UTF-8 > > I wonder if it's possible to have the parser ignore the meta tag, or if > it's somehow possible to alter or delete the meta tag before the site is > parsed or if there is a better approach? > > > > On Dec 12, 2007 12:28 AM, Karsten Ohme <wid...@t-...> wrote: > > > Jeffery Brewer schrieb: > > > I'm running into an issue where I'm getting question mark characters > > in > > > place of quotes, apostrophes, hyphens, etc. > > > > Have you read the FAQ? > > > > http://htmlparser.sourceforge.net/faq.html > > > > The "Why am I getting an EncodingChangeException?" should be helpful how > > to handle character encoding issues. If the web page does not contain an > > > > encoding hint, let the parser fetch the web site for you, maybe the HTTP > > header contains the correct encoding. So it is used. If the web site is > > offline, set the correct encoding in the parser. Does this help? > > > > Regards, > > Karsten > > > > > > > > I know this has to do with the website using characters outside those > > > defined by the specification. Is there a way to correct this in the > > > htmlparser? I started trying to do a simple character replacement on > > the > > > parsed text, but whenever I do an "(int) string.charAt(n)" for any > > special > > > character I'm getting a 65533, and if I do a " > > Character.getNumericValue( > > > string.charAt(n))" I'm getting a -1, so I'm assuming I'm far to far > > > "downstream" to fix the problem. > > > > > > Also I've just been using the Parser.parse method to return nodelists > > and > > > have been working my way through the documents that way rather than > > try any > > > of the other htmlparser features (which may already account for > > this??). > > > > > > Thanks in advance for any help. I'm really enjoying working with the > > parser > > > and thanks to everyone who built this thing. > > > > > > Jeff > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > > > ------------------------------------------------------------------------- > > > SF.Net email is sponsored by: > > > Check out the new SourceForge.net Marketplace. > > > It's the best place to buy or sell services for > > > just about anything Open Source. > > > http://sourceforge.net/services/buy/index.php > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > ------------------------------------------------------------------------- > > SF.Net email is sponsored by: > > Check out the new SourceForge.net Marketplace. > > It's the best place to buy or sell services for > > just about anything Open Source. > > http://sourceforge.net/services/buy/index.php > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > ------------------------------------------------------------------------- > SF.Net email is sponsored by: > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services > for just about anything Open Source. > > http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |