htmlparser-user Mailing List for HTML Parser (Page 22)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Derrick O. <der...@ro...> - 2008-01-07 02:49:56
|
You could make your own as described here: http://htmlparser.sourceforge.net/faq.html#composite ----- Original Message ---- From: HAMMER_SHI <dmr...@gm...> To: htm...@li... Sent: Sunday, January 6, 2008 8:54:35 PM Subject: Re: [Htmlparser-user] Htmlparser-user Digest, Vol 18, Issue 3 hi,all I want to extract html page Font tags,But I cannot find Find any Font or FontTag in <B>org.htmlparser.tags</B> package.How Can I do. thanks |
From: HAMMER_SHI <dmr...@gm...> - 2008-01-07 01:54:42
|
hi,all I want to extract html page Font tags,But I cannot find Find any Font or FontTag in <B>org.htmlparser.tags</B> package.How Can I do. thanks |
From: Derrick O. <der...@ro...> - 2008-01-06 17:40:21
|
The Page class has a constructor taking an InputStream and an encoding. You can make an InputStream from a byte array for example. You need to have stored the encoding somewhere to reconstitute the bytes correctly. The Parser constructor taking a Lexer constructed from a Page would be what you want. ----- Original Message ---- From: cash cash <ca...@ya...> To: htmlparser user list <htm...@li...> Sent: Sunday, January 6, 2008 2:19:23 AM Subject: [Htmlparser-user] How if source is in byte form Dear HTMLParser community, We adapted a web crawler which stores crawled web pages in byte form. Can HTMLParser take byte as input and do filtering on HTML tags? Thank you. ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: cash c. <ca...@ya...> - 2008-01-06 07:19:30
|
Dear HTMLParser community, We adapted a web crawler which stores crawled web pages in byte form. Can HTMLParser take byte as input and do filtering on HTML tags? Thank you. ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs |
From: Derrick O. <der...@ro...> - 2007-12-30 19:08:03
|
No, there is no support for evaluating Javascript. ----- Original Message ---- From: Adi's Gmail <adi...@gm...> To: htm...@li... Sent: Sunday, December 30, 2007 12:16:56 PM Subject: [Htmlparser-user] Javascript Dear HTMLParser community, We know that Javascript can be used to dynamically load and modify a web page. Suppose I have such code in a HTML file: this is a <script>document.write("text");</script> Could the full text "this is a text" be extracted instead of "this is a document.write("text");" or "this is a" ? This means that Javascript is executed first and then the text is extracted from the page. Does HTMLParser support this operation? I have tried searching HTMLParser documentation but none mention about this. I would really appreciate your reply. Thank you. |
From: Adi's G. <adi...@gm...> - 2007-12-30 17:17:06
|
RGVhciBIVE1MUGFyc2VyIGNvbW11bml0eSwNCg0KV2Uga25vdyB0aGF0IEphdmFzY3JpcHQgY2Fu IGJlIHVzZWQgdG8gZHluYW1pY2FsbHkgbG9hZCBhbmQgbW9kaWZ5IGEgd2ViIHBhZ2UuIFN1cHBv c2UgSSBoYXZlIHN1Y2ggY29kZSBpbiBhIEhUTUwgZmlsZToNCg0KICAgIHRoaXMgaXMgYSA8c2Ny aXB0PmRvY3VtZW50LndyaXRlKCJ0ZXh0Iik7PC9zY3JpcHQ+DQoNCkNvdWxkIHRoZSBmdWxsIHRl eHQgInRoaXMgaXMgYSB0ZXh0IiBiZSBleHRyYWN0ZWQgaW5zdGVhZCBvZiAidGhpcyBpcyBhIGRv Y3VtZW50LndyaXRlKCJ0ZXh0Iik7IiBvciAidGhpcyBpcyBhIiA/DQpUaGlzIG1lYW5zIHRoYXQg SmF2YXNjcmlwdCBpcyBleGVjdXRlZCBmaXJzdCBhbmQgdGhlbiB0aGUgdGV4dCBpcyBleHRyYWN0 ZWQgZnJvbSB0aGUgcGFnZS4gRG9lcyBIVE1MUGFyc2VyIHN1cHBvcnQgdGhpcyBvcGVyYXRpb24/ DQoNCkkgaGF2ZSB0cmllZCBzZWFyY2hpbmcgSFRNTFBhcnNlciBkb2N1bWVudGF0aW9uIGJ1dCBu b25lIG1lbnRpb24gYWJvdXQgdGhpcy4gSSB3b3VsZCByZWFsbHkgYXBwcmVjaWF0ZSB5b3VyIHJl cGx5LiBUaGFuayB5b3Uu |
From: Jeffery B. <jef...@gm...> - 2007-12-15 13:58:08
|
Thank you Derrick, That worked perfectly. On Dec 12, 2007 10:52 PM, Derrick Oswald <der...@ro...> wrote: > You can create a class extending org.htmlparser.tags.MetaTag and > overriding doSemanticAction () to do nothing. > Register this with a org.htmlparser.PrototypicalNodeFactory you assign to > your parser as described here<http://htmlparser.sourceforge.net/faq.html#composite> > . > > > ----- Original Message ---- > From: Jeffery Brewer <jef...@gm...> > To: htmlparser user list <htm...@li...> > Sent: Wednesday, December 12, 2007 9:31:16 PM > Subject: Re: [Htmlparser-user] Non-charset=utf-8 Characters in Parsed > Text? > > Thanks Karsten, > > I have now read the FAQ and have spent some time trying to solve my > problem. I'm learning a lot more about the parser but haven't solved my > problem yet. > > The pages I'm trying to read have a meta tag setting the encoding to > UTF-8... > <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> > > but they are obviously using a different character set (they shouldn't > be!). > > If I copy the page and modify the tag for windows-1252 encoding... > > <meta http-equiv="Content-Type" content="text/html; > charset=windows-1252"> > > and parse the page, I can recover the characters and convert them. > > Likewise, if I omit that meta tag and set the parser for windows-1252 > encoding I can also recover the characters and convert them. > > But if I set the parser for windows-1252 encoding and then have it parse > the page from the website, the parser reads the utf-8 encoding tag and and > automatically parses the page using utf-8 encoding. > > In other words, if I do this... > > Parser parser = new Parser ("http://www.examiner.com/a-1097821~County_plan_could_double_neighborhood_enforcement.html > > <http://www.examiner.com/a-1097821%7ECounty_plan_could_double_neighborhood_enforcement.html> > "); > parser.setEncoding("windows-1252"); > System.out.println("encoding=" + parser.getEncoding()); > NodeList divNodeList = parser.parse(new HasAttributeFilter("id", > "article_main")); > System.out.println("encoding=" + parser.getEncoding()); > > it prints out > encoding=windows-1252 > encoding=UTF-8 > > I wonder if it's possible to have the parser ignore the meta tag, or if > it's somehow possible to alter or delete the meta tag before the site is > parsed or if there is a better approach? > > > > On Dec 12, 2007 12:28 AM, Karsten Ohme <wid...@t-...> wrote: > > > Jeffery Brewer schrieb: > > > I'm running into an issue where I'm getting question mark characters > > in > > > place of quotes, apostrophes, hyphens, etc. > > > > Have you read the FAQ? > > > > http://htmlparser.sourceforge.net/faq.html > > > > The "Why am I getting an EncodingChangeException?" should be helpful how > > to handle character encoding issues. If the web page does not contain an > > > > encoding hint, let the parser fetch the web site for you, maybe the HTTP > > header contains the correct encoding. So it is used. If the web site is > > offline, set the correct encoding in the parser. Does this help? > > > > Regards, > > Karsten > > > > > > > > I know this has to do with the website using characters outside those > > > defined by the specification. Is there a way to correct this in the > > > htmlparser? I started trying to do a simple character replacement on > > the > > > parsed text, but whenever I do an "(int) string.charAt(n)" for any > > special > > > character I'm getting a 65533, and if I do a " > > Character.getNumericValue( > > > string.charAt(n))" I'm getting a -1, so I'm assuming I'm far to far > > > "downstream" to fix the problem. > > > > > > Also I've just been using the Parser.parse method to return nodelists > > and > > > have been working my way through the documents that way rather than > > try any > > > of the other htmlparser features (which may already account for > > this??). > > > > > > Thanks in advance for any help. I'm really enjoying working with the > > parser > > > and thanks to everyone who built this thing. > > > > > > Jeff > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > > > ------------------------------------------------------------------------- > > > SF.Net email is sponsored by: > > > Check out the new SourceForge.net Marketplace. > > > It's the best place to buy or sell services for > > > just about anything Open Source. > > > http://sourceforge.net/services/buy/index.php > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > ------------------------------------------------------------------------- > > SF.Net email is sponsored by: > > Check out the new SourceForge.net Marketplace. > > It's the best place to buy or sell services for > > just about anything Open Source. > > http://sourceforge.net/services/buy/index.php > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > ------------------------------------------------------------------------- > SF.Net email is sponsored by: > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services > for just about anything Open Source. > > http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Derrick O. <der...@ro...> - 2007-12-13 03:52:37
|
You can create a class extending org.htmlparser.tags.MetaTag and overriding doSemanticAction () to do nothing. Register this with a org.htmlparser.PrototypicalNodeFactory you assign to your parser as described here. ----- Original Message ---- From: Jeffery Brewer <jef...@gm...> To: htmlparser user list <htm...@li...> Sent: Wednesday, December 12, 2007 9:31:16 PM Subject: Re: [Htmlparser-user] Non-charset=utf-8 Characters in Parsed Text? Thanks Karsten, I have now read the FAQ and have spent some time trying to solve my problem. I'm learning a lot more about the parser but haven't solved my problem yet. The pages I'm trying to read have a meta tag setting the encoding to UTF-8... <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> but they are obviously using a different character set (they shouldn't be!). If I copy the page and modify the tag for windows-1252 encoding... <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> and parse the page, I can recover the characters and convert them. Likewise, if I omit that meta tag and set the parser for windows-1252 encoding I can also recover the characters and convert them. But if I set the parser for windows-1252 encoding and then have it parse the page from the website, the parser reads the utf-8 encoding tag and and automatically parses the page using utf-8 encoding. In other words, if I do this... Parser parser = new Parser ("http://www.examiner.com/a-1097821~County_plan_could_double_neighborhood_enforcement.html "); parser.setEncoding("windows-1252"); System.out.println("encoding=" + parser.getEncoding()); NodeList divNodeList = parser.parse(new HasAttributeFilter("id", "article_main")); System.out.println("encoding=" + parser.getEncoding()); it prints out encoding=windows-1252 encoding=UTF-8 I wonder if it's possible to have the parser ignore the meta tag, or if it's somehow possible to alter or delete the meta tag before the site is parsed or if there is a better approach? On Dec 12, 2007 12:28 AM, Karsten Ohme <wid...@t-...> wrote: Jeffery Brewer schrieb: > I'm running into an issue where I'm getting question mark characters in > place of quotes, apostrophes, hyphens, etc. Have you read the FAQ? http://htmlparser.sourceforge.net/faq.html The "Why am I getting an EncodingChangeException?" should be helpful how to handle character encoding issues. If the web page does not contain an encoding hint, let the parser fetch the web site for you, maybe the HTTP header contains the correct encoding. So it is used. If the web site is offline, set the correct encoding in the parser. Does this help? Regards, Karsten > > I know this has to do with the website using characters outside those > defined by the specification. Is there a way to correct this in the > htmlparser? I started trying to do a simple character replacement on the > parsed text, but whenever I do an "(int) string.charAt(n)" for any special > character I'm getting a 65533, and if I do a "Character.getNumericValue( > string.charAt(n))" I'm getting a -1, so I'm assuming I'm far to far > "downstream" to fix the problem. > > Also I've just been using the Parser.parse method to return nodelists and > have been working my way through the documents that way rather than try any > of the other htmlparser features (which may already account for this??). > > Thanks in advance for any help. I'm really enjoying working with the parser > and thanks to everyone who built this thing. > > Jeff > > > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > SF.Net email is sponsored by: > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services for > just about anything Open Source. > http://sourceforge.net/services/buy/index.php > > > ------------------------------------------------------------------------ > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user ------------------------------------------------------------------------- SF.Net email is sponsored by: Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Jeffery B. <jef...@gm...> - 2007-12-13 02:59:16
|
Thanks Karsten, I have now read the FAQ and have spent some time trying to solve my problem. I'm learning a lot more about the parser but haven't solved my problem yet. The pages I'm trying to read have a meta tag setting the encoding to UTF-8... <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> but they are obviously using a different character set (they shouldn't be!). If I copy the page and modify the tag for windows-1252 encoding... <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> and parse the page, I can recover the characters and convert them. Likewise, if I omit that meta tag and set the parser for windows-1252 encoding I can also recover the characters and convert them. But if I set the parser for windows-1252 encoding and then have it parse the page from the website, the parser reads the utf-8 encoding tag and and automatically parses the page using utf-8 encoding. In other words, if I do this... Parser parser = new Parser (" http://www.examiner.com/a-1097821~County_plan_could_double_neighborhood_enforcement.html "); parser.setEncoding("windows-1252"); System.out.println("encoding=" + parser.getEncoding()); NodeList divNodeList = parser.parse(new HasAttributeFilter("id", "article_main")); System.out.println("encoding=" + parser.getEncoding()); it prints out encoding=windows-1252 encoding=UTF-8 I wonder if it's possible to have the parser ignore the meta tag, or if it's somehow possible to alter or delete the meta tag before the site is parsed or if there is a better approach? On Dec 12, 2007 12:28 AM, Karsten Ohme <wid...@t-...> wrote: > Jeffery Brewer schrieb: > > I'm running into an issue where I'm getting question mark characters in > > place of quotes, apostrophes, hyphens, etc. > > Have you read the FAQ? > > http://htmlparser.sourceforge.net/faq.html > > The "Why am I getting an EncodingChangeException?" should be helpful how > to handle character encoding issues. If the web page does not contain an > encoding hint, let the parser fetch the web site for you, maybe the HTTP > header contains the correct encoding. So it is used. If the web site is > offline, set the correct encoding in the parser. Does this help? > > Regards, > Karsten > > > > > I know this has to do with the website using characters outside those > > defined by the specification. Is there a way to correct this in the > > htmlparser? I started trying to do a simple character replacement on the > > parsed text, but whenever I do an "(int) string.charAt(n)" for any > special > > character I'm getting a 65533, and if I do a "Character.getNumericValue( > > string.charAt(n))" I'm getting a -1, so I'm assuming I'm far to far > > "downstream" to fix the problem. > > > > Also I've just been using the Parser.parse method to return nodelists > and > > have been working my way through the documents that way rather than try > any > > of the other htmlparser features (which may already account for this??). > > > > Thanks in advance for any help. I'm really enjoying working with the > parser > > and thanks to everyone who built this thing. > > > > Jeff > > > > > > > > ------------------------------------------------------------------------ > > > > > ------------------------------------------------------------------------- > > SF.Net email is sponsored by: > > Check out the new SourceForge.net Marketplace. > > It's the best place to buy or sell services for > > just about anything Open Source. > > http://sourceforge.net/services/buy/index.php > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > ------------------------------------------------------------------------- > SF.Net email is sponsored by: > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services for > just about anything Open Source. > http://sourceforge.net/services/buy/index.php > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Karsten O. <wid...@t-...> - 2007-12-12 05:29:40
|
Jeffery Brewer schrieb: > I'm running into an issue where I'm getting question mark characters in > place of quotes, apostrophes, hyphens, etc. Have you read the FAQ? http://htmlparser.sourceforge.net/faq.html The "Why am I getting an EncodingChangeException?" should be helpful how to handle character encoding issues. If the web page does not contain an encoding hint, let the parser fetch the web site for you, maybe the HTTP header contains the correct encoding. So it is used. If the web site is offline, set the correct encoding in the parser. Does this help? Regards, Karsten > > I know this has to do with the website using characters outside those > defined by the specification. Is there a way to correct this in the > htmlparser? I started trying to do a simple character replacement on the > parsed text, but whenever I do an "(int) string.charAt(n)" for any special > character I'm getting a 65533, and if I do a "Character.getNumericValue( > string.charAt(n))" I'm getting a -1, so I'm assuming I'm far to far > "downstream" to fix the problem. > > Also I've just been using the Parser.parse method to return nodelists and > have been working my way through the documents that way rather than try any > of the other htmlparser features (which may already account for this??). > > Thanks in advance for any help. I'm really enjoying working with the parser > and thanks to everyone who built this thing. > > Jeff > > > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > SF.Net email is sponsored by: > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services for > just about anything Open Source. > http://sourceforge.net/services/buy/index.php > > > ------------------------------------------------------------------------ > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Jeffery B. <jef...@gm...> - 2007-12-11 23:46:38
|
I'm running into an issue where I'm getting question mark characters in place of quotes, apostrophes, hyphens, etc. I know this has to do with the website using characters outside those defined by the specification. Is there a way to correct this in the htmlparser? I started trying to do a simple character replacement on the parsed text, but whenever I do an "(int) string.charAt(n)" for any special character I'm getting a 65533, and if I do a "Character.getNumericValue( string.charAt(n))" I'm getting a -1, so I'm assuming I'm far to far "downstream" to fix the problem. Also I've just been using the Parser.parse method to return nodelists and have been working my way through the documents that way rather than try any of the other htmlparser features (which may already account for this??). Thanks in advance for any help. I'm really enjoying working with the parser and thanks to everyone who built this thing. Jeff |
From: Subramanya S. <sa...@cs...> - 2007-12-11 23:32:34
|
Thank you very much! That fixed it! Is there a list of such flags and what they do? This is the second such flag I had to set in recent times to handle the wild HTML that exists out there. I might as well figure out what they do and set as many as needed right now since I am parsing all kinds of HTML found on the web. -S. > I believe you want to set the static member for strict remark parsing > to false: > org.htmlparser.lexer.Lexer.STRICT_REMARKS = false; > |
From: Derrick O. <der...@ro...> - 2007-12-11 23:21:30
|
I believe you want to set the static member for strict remark parsing to false: org.htmlparser.lexer.Lexer.STRICT_REMARKS = false; ----- Original Message ---- From: Subramanya Sastry <sa...@cs...> To: htmlparser user list <htm...@li...> Sent: Tuesday, December 11, 2007 5:02:08 PM Subject: [Htmlparser-user] scanning / parsing bug? For this url, http://www.washingtonpost.com/wp-dyn/content/article/2007/12/10/AR2007121001600.html (and maybe other washington post urls), I wonder if HTML Parser is running into a bug. The HTML source for this page has the following block of HTML in the middle .. <!---------------- End New Comments Box ------------------> <div class="sidebarhack"><b></b></div> .... .... </div> <!-- sphereit end --> <br clear="all"> The parser is ignoring all content from the start of the line 'End New Comments Box' till 'sphereit end' ... I wonder if this is because of the lack of a space before the '-->' closing comment string in the first line ... I tested the code by adding a space manually at that point, and sure enough, the block of HTML in the middle is correctly recognized. Is there a workaround for this? I am also willing to download the source code and incorporate a fix, if necessary. Thanks, Subbu. ------------------------------------------------------------------------- SF.Net email is sponsored by: Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Subramanya S. <sa...@cs...> - 2007-12-11 22:02:42
|
For this url, http://www.washingtonpost.com/wp-dyn/content/article/2007/12/10/AR2007121001600.html (and maybe other washington post urls), I wonder if HTML Parser is running into a bug. The HTML source for this page has the following block of HTML in the middle .. <!---------------- End New Comments Box ------------------> <div class="sidebarhack"><b></b></div> .... .... </div> <!-- sphereit end --> <br clear="all"> The parser is ignoring all content from the start of the line 'End New Comments Box' till 'sphereit end' ... I wonder if this is because of the lack of a space before the '-->' closing comment string in the first line ... I tested the code by adding a space manually at that point, and sure enough, the block of HTML in the middle is correctly recognized. Is there a workaround for this? I am also willing to download the source code and incorporate a fix, if necessary. Thanks, Subbu. |
From: Derrick O. <der...@ro...> - 2007-12-07 13:15:53
|
The accept() method is used by the visitor pattern, not the filter paradigm. It's not clear what string you are trying to 'do not match'. The <p> tag has no string. Maybe you mean the string between <p> tags, or if you've made the <p> tag composite, then maybe it's children. You should probably just add more filtering clauses, e.g. parser.extractAllNodesThatMatch(new AndFilter (new TagNameFilter("p"), ...)); I would suggest you try the FilterBuilder application to build up your filter. ----- Original Message ---- From: "at...@gm..." <at...@gm...> To: htmlparser user list <htm...@li...> Sent: Thursday, December 6, 2007 2:09:07 PM Subject: [Htmlparser-user] TagNameFilter Hi, i need some help with the TagNameFilter. I have a function to get all the p tags out a html document. NodeList nl = parser.extractAllNodesThatMatch(new TagNameFilter("p")); But now i want to filter from the NodeList all entries that do not match a special string. I guess the key would be the "accept() function" but im unsure how to implement it(well the string compare etc is clear but the usage of the accept() + Tag.class ). And furthermore i have problems with doubled entries because of nested p tags. Thanks Alex ------------------------------------------------------------------------- SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Tom H. <thj...@ri...> - 2007-12-06 19:25:20
|
Hi Derrick, That doesn't sound too bad at all to me. I created a single utility function that handles that for any tag I need to inject: void setEndTag(Tag tag) { TagNode endTag = new TagNode(); endTag.setTagName("/" + tag.getTagName()); tag.setEndTag(endTag); } That works like a charm! Thanks, Tom Derrick Oswald wrote: > You will need to add your own end tag to the script tag you are > injecting. I believe it's something like this: > TagNode end = new TagNode (); > end.setTagName ("/SCRIPT"); > script.setEndTag (end) > > I guess this could be made much easier. > > ----- Original Message ---- > From: Tom Hjellming <thj...@ri...> > To: htm...@li... > Sent: Thursday, December 6, 2007 3:19:22 AM > Subject: [Htmlparser-user] Transformation limitations? > > I'm experimenting with the HtmlParser library to see if I can use it to > transform webpages. One thing I'm trying is to see if I can inject some > javascript into the HTML page. > > My test app uses the PrototypicalNodeFactory to register some overridden > tags like MyHeadTag and MyBodyTag (which derive from the HeadTag and > BodyTag classes respectively) and then I run the parser. I then locate > the MyHeadTag object found during the parsing and do the following: > > ScriptTag script = new ScriptTag(); > script.setAttribute("SRC", "blah.js"); > script.setLanguage("javascript"); > > NodeList childNodes = headTag.getChildren(); > childNodes.add(script); > > I then loop through the parser-generated listHtmlNodes calling toHtml() > on each node and appending the result in a StringBuffer. > > But looking at the resulting StringBuffer contents, I see that the > <script> tag is not terminated with a </script>: > > <html> > <head> > <title>Testing...</title> > <SCRIPT LANGUAGE=javascript SRC="blah.js"> > </head> > <body> > <p>Testing...</p> > </body> > </html> > > All the other tags that were in the original HTML file with end tags are > fine. It is just the newly injected ScriptTag that is not properly > terminated. > > This happens with any "container" tag I try to insert into the > parser-generated "DOM" tree. > > Does anyone know why? Any hints on how to fix this? Is this an > unreasonable thing to do with HtmlParser? > > thanks, > Tom > > > ------------------------------------------------------------------------- > SF.Net email is sponsored by: The Future of Linux Business White Paper > from Novell. From the desktop to the data center, Linux is going > mainstream. Let it simplify your IT future. > http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > <mailto:Htm...@li...> > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > SF.Net email is sponsored by: The Future of Linux Business White Paper > from Novell. From the desktop to the data center, Linux is going > mainstream. Let it simplify your IT future. > http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 > ------------------------------------------------------------------------ > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: <at...@gm...> - 2007-12-06 19:10:43
|
Hi, i need some help with the TagNameFilter. I have a function to get all the p tags out a html document. NodeList nl =3D parser.extractAllNodesThatMatch(new TagNameFilter("p"));= But now i want to filter from the NodeList all entries that do not match= a = special string. I guess the key would be the "accept() function" but im unsure how to = implement it(well the string compare etc is clear but the usage of the = accept() + Tag.class ). And furthermore i have problems with doubled entries because of nested p= = tags. Thanks Alex |
From: Derrick O. <der...@ro...> - 2007-12-06 11:44:04
|
You will need to add your own end tag to the script tag you are injecting. I believe it's something like this: TagNode end = new TagNode (); end.setTagName ("/SCRIPT"); script.setEndTag (end) I guess this could be made much easier. ----- Original Message ---- From: Tom Hjellming <thj...@ri...> To: htm...@li... Sent: Thursday, December 6, 2007 3:19:22 AM Subject: [Htmlparser-user] Transformation limitations? I'm experimenting with the HtmlParser library to see if I can use it to transform webpages. One thing I'm trying is to see if I can inject some javascript into the HTML page. My test app uses the PrototypicalNodeFactory to register some overridden tags like MyHeadTag and MyBodyTag (which derive from the HeadTag and BodyTag classes respectively) and then I run the parser. I then locate the MyHeadTag object found during the parsing and do the following: ScriptTag script = new ScriptTag(); script.setAttribute("SRC", "blah.js"); script.setLanguage("javascript"); NodeList childNodes = headTag.getChildren(); childNodes.add(script); I then loop through the parser-generated listHtmlNodes calling toHtml() on each node and appending the result in a StringBuffer. But looking at the resulting StringBuffer contents, I see that the <script> tag is not terminated with a </script>: <html> <head> <title>Testing...</title> <SCRIPT LANGUAGE=javascript SRC="blah.js"> </head> <body> <p>Testing...</p> </body> </html> All the other tags that were in the original HTML file with end tags are fine. It is just the newly injected ScriptTag that is not properly terminated. This happens with any "container" tag I try to insert into the parser-generated "DOM" tree. Does anyone know why? Any hints on how to fix this? Is this an unreasonable thing to do with HtmlParser? thanks, Tom ------------------------------------------------------------------------- SF.Net email is sponsored by: The Future of Linux Business White Paper from Novell. From the desktop to the data center, Linux is going mainstream. Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Tom H. <thj...@ri...> - 2007-12-06 08:19:30
|
I'm experimenting with the HtmlParser library to see if I can use it to transform webpages. One thing I'm trying is to see if I can inject some javascript into the HTML page. My test app uses the PrototypicalNodeFactory to register some overridden tags like MyHeadTag and MyBodyTag (which derive from the HeadTag and BodyTag classes respectively) and then I run the parser. I then locate the MyHeadTag object found during the parsing and do the following: ScriptTag script = new ScriptTag(); script.setAttribute("SRC", "blah.js"); script.setLanguage("javascript"); NodeList childNodes = headTag.getChildren(); childNodes.add(script); I then loop through the parser-generated listHtmlNodes calling toHtml() on each node and appending the result in a StringBuffer. But looking at the resulting StringBuffer contents, I see that the <script> tag is not terminated with a </script>: <html> <head> <title>Testing...</title> <SCRIPT LANGUAGE=javascript SRC="blah.js"> </head> <body> <p>Testing...</p> </body> </html> All the other tags that were in the original HTML file with end tags are fine. It is just the newly injected ScriptTag that is not properly terminated. This happens with any "container" tag I try to insert into the parser-generated "DOM" tree. Does anyone know why? Any hints on how to fix this? Is this an unreasonable thing to do with HtmlParser? thanks, Tom |
From: Derrick O. <der...@ro...> - 2007-11-24 00:23:57
|
Printing out the top level node generates the entire html again. NodeLists also understand toHtml(). So for a parse like, NodeList list = parser.parse (null); the entire page is printed out with: System.out.println (list.toHtml ()); or for (int i = 0; i < list.size(); i++) System.out.println (list.elementAt(i).toHtml ()); You should look at the toHtml() method of CompositeTag if you don't want it to print the nested tags. ----- Original Message ---- From: Randy Paries <rtp...@gm...> To: htm...@li... Sent: Friday, November 23, 2007 3:55:06 PM Subject: [Htmlparser-user] does someone have a simple example of printing out all nodes of a document hello, this should not be as hard as i am making this, but i have a brain lock. I need a simple function to parse a html file and print out each of its nodes i am espesiailly having problems with nested nodes. Like tables within divs within divs. it needs to prints everything so, i could take the output and display the same html page. but it needs to go to the level where there are no children. I hope this make sense. thanks for any help someone can give me randy ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Randy P. <rtp...@gm...> - 2007-11-24 00:07:22
|
More Details. Using the lexer will not help me cause i need fully formed tags. thanks again On Nov 23, 2007 5:55 PM, Randy Paries <rtp...@gm...> wrote: > hello, > this should not be as hard as i am making this, but i have a brain lock. > > I need a simple function to parse a html file and print out each of its nodes > > i am espesiailly having problems with nested nodes. Like tables within > divs within divs. > > it needs to prints everything so, i could take the output and display > the same html page. > > but it needs to go to the level where there are no children. I hope > this make sense. > > thanks for any help someone can give me > randy > |
From: Randy P. <rtp...@gm...> - 2007-11-23 23:55:07
|
hello, this should not be as hard as i am making this, but i have a brain lock. I need a simple function to parse a html file and print out each of its nodes i am espesiailly having problems with nested nodes. Like tables within divs within divs. it needs to prints everything so, i could take the output and display the same html page. but it needs to go to the level where there are no children. I hope this make sense. thanks for any help someone can give me randy |
From: Derrick O. <der...@ro...> - 2007-11-23 23:45:31
|
string value = tag.getAttribute("<name>"); ----- Original Message ---- From: Ali <to...@ya...> To: htm...@li... Sent: Friday, November 23, 2007 11:29:07 AM Subject: [Htmlparser-user] Attribute of a tag hi, how can i get a attribute of tage? thanks ____________________________________________________________________________________ Be a better pen pal. Text or chat with friends inside Yahoo! Mail. See how. http://overview.mail.yahoo.com/ ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Ali <to...@ya...> - 2007-11-23 19:29:16
|
hi, how can i get a attribute of tage? thanks ____________________________________________________________________________________ Be a better pen pal. Text or chat with friends inside Yahoo! Mail. See how. http://overview.mail.yahoo.com/ |
From: Jurgen V. <ri...@pl...> - 2007-11-23 18:07:46
|
Thanks that worked. Jurgen Derrick Oswald wrote: > > You should be able to use the Page.setBaseUrl (string base) method to > set the URL used as a prefix for relative links, i.e. > parser.getLexer ().getPage ().setBaseUrl ("http://yadda.yadda"); > > > ----- Original Message ---- > From: Jurgen Voorneveld <j.e...@st...> > To: htm...@li... > Sent: Friday, November 23, 2007 11:13:33 AM > Subject: [Htmlparser-user] Link Location resolving > > List, > > I've recently started using htmlparser as part of a webspidering tool > that I have written and I've run into a small problem. > My spider downloads files from webservers using HttpClient from the > Apache Commons project. These files are then stored locally in a > temporary location. If a file contains HTML it is then parsed by > htmlparser. > During parsing the parser resolves relative links to other files by > adding the location of the file to the relative link. Which of course > completely screws up the links. Is there any way to turn this feature > off or some way of telling the parser that the location of the data is > not where it gets the data from. > > thanks > Jurgen Voorneveld > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > <mailto:Htm...@li...> > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > ------------------------------------------------------------------------ > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |