htmlparser-user Mailing List for HTML Parser (Page 24)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Ali <to...@ya...> - 2007-11-11 14:14:06
|
Hi everyone! I am new user to HTML parser, I have problem I want to POST a form request and then parse the resultant HTML, but so far I have failed to do so. Ali __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
From: Derrick O. <der...@ro...> - 2007-10-29 13:15:11
|
All HTTP access is through the org.htmlparser.http.ConnectionManager class.= =0AIt has request properties, proxy support, cookies and basic authenticati= on.=0AI would say you need redirection processing enabled and cookie suppor= t enabled besides your username and password,=0Abecause usually these sight= s rely on cookies after a redirection.=0A=0A----- Original Message ----=0AF= rom: "dwi...@un..." <dwi...@un...>=0ATo: htmlparser-user@= lists.sourceforge.net=0ASent: Monday, October 29, 2007 7:53:22 AM=0ASubject= : [Htmlparser-user] Authentification=0A=0A=0AHi,=0A=0AI would like to conne= ct to a website that requires authentification and=0A then=0Aprocess the ht= ml code. Unfortunately I haven't found a way to=0A authenticate me=0Aas a u= ser via the htmlparser API. Did I overlook something, or what do=0A I need= =0Ato do?=0A=0AAny help is greatly appreciated!=0A=0ACheers,=0ADaniel=0A=0A= =0A------------------------------------------------------------------------= -=0AThis SF.net email is sponsored by: Splunk Inc.=0AStill grepping through= log files to find problems? Stop.=0ANow Search log events and configurati= on files using AJAX and a browser.=0ADownload your FREE copy of Splunk now = >> http://get.splunk.com/=0A_______________________________________________= =0AHtmlparser-user mailing list=0AH...@li...=0Ah= ttps://lists.sourceforge.net/lists/listinfo/htmlparser-user=0A=0A=0A=0A |
From: Daniel W. <wic...@un...> - 2007-10-29 11:58:05
|
Hi, I would like to connect to a website that requires authentification (basic authentification for example)and then process the html code. Unfortunately I haven't found a way to authenticate me as a user via the htmlparser API. Did I overlook something, or what do I need to do? Any help is greatly appreciated! Cheers, Daniel |
From: Eugeny N D. <bo...@re...> - 2007-10-29 11:56:15
|
On Mon, Oct 29, 2007 at 12:53:22PM +0100, dwi...@un... wrote: >=20 > Hi, >=20 > I would like to connect to a website that requires authentification and t= hen > process the html code. Unfortunately I haven't found a way to authenticat= e me > as a user via the htmlparser API. Did I overlook something, or what do I = need > to do? >=20 > Any help is greatly appreciated! Take a look at Jakarta Commons, there is HttpClient project which does everything needed for connection and authentication and much more for you, = if you need to deal with HTTP protocol. --=20 Eugene N Dzhurinsky |
From: <dwi...@un...> - 2007-10-29 11:53:29
|
Hi, I would like to connect to a website that requires authentification and then process the html code. Unfortunately I haven't found a way to authenticate me as a user via the htmlparser API. Did I overlook something, or what do I need to do? Any help is greatly appreciated! Cheers, Daniel |
From: Derrick O. <der...@ro...> - 2007-10-26 02:21:06
|
You might want to try setting the ScriptScanner STRICT memeber to false:=0A= =0Apublic class ScriptScanner=0A extends=0A CompositeTagScanner= =0A{=0A /**=0A * Strict parsing of CDATA flag.=0A * If this flag= is set true, the parsing of script is performed without=0A * regard to= quotes. This means that erroneous script such as:=0A * <pre>=0A * = document.write("</script>");=0A * </pre>=0A * will be parsed i= n strict accordance with appendix=0A * =0A * B.3.2 Specifying non-H= TML data</a> of the=0A * HTML'>http://www.w3.org/TR/html4/">HTML 4.01 S= pecification and=0A * hence will be split into two or more nodes. Corre= ct javascript would=0A * escape the ETAGO:=0A * <pre>=0A * docu= ment.write("<\/script>");=0A * </pre>=0A * If true, CDATA pars= ing will stop at the first ETAGO ("</") no matter=0A * whether it is= quoted or not. If false, balanced quotes (either single or=0A * double= ) will shield an ETAGO. Beacuse of the possibility of quotes within=0A = * single or multiline comments, these are also parsed. In most cases,=0A = * users prefer non-strict handling since there is so much broken script= =0A * out in the wild.=0A */=0A public static boolean STRICT =3D= true;=0A=0A=0A----- Original Message ----=0AFrom: Subramanya Sastry <sastr= y...@cs...>=0ATo: htmlparser user list <htm...@li...urceforg= e.net>=0ASent: Thursday, October 25, 2007 2:00:56 PM=0ASubject: [Htmlparser= -user] parsing bug?=0A=0A=0AI am writing to check if what I am observing is= a parsing bug, and if =0Aso, if there are any known workarounds.=0A=0AWhen= javascript is being parsed, at the start, my NodeVisitor's visitTag =0Amet= hod gets called. As expected, all starting html tags within the =0Ajavascr= ipt itself are being ignored since they are are part of the =0Ajavascript a= nd not the HTML. However, at the first closing tag that is =0Aencounted wi= thin the javascript code (even within strings), the parser =0Atriggers a cl= ose script tag and calls my visitor's visitEndTag method =0Agets called.=0A= =0AFor example with "</textarea>" string within the javascript, I get 2 =0A= successive calls to visitEndTag, first with the SCRIPT tag and next with = =0Athe TEXTAREA tag. Or, with a "</he" + "ad>" string within javascript, I= =0Aget 2 successive calls to visitEndTag, first with the SCRIPT tag and = =0Anext with the TEXTAREA tag.=0A=0AAs an example, test the 2 urls:=0A=0A1.= http://youtube.com/?v=3Dqf4tdOKKWic=0A2. http://www.deccanherald.com/Conte= nt/Oct202007/city2007102031594.asp=0A=0AAny leads to fix this are much appr= eciated.=0A=0AThanks,=0ASubbu.=0A=0A---------------------------------------= ----------------------------------=0AThis SF.net email is sponsored by: Spl= unk Inc.=0AStill grepping through log files to find problems? Stop.=0ANow = Search log events and configuration files using AJAX and a browser.=0ADownl= oad your FREE copy of Splunk now >> http://get.splunk.com/=0A______________= _________________________________=0AHtmlparser-user mailing list=0AHtmlpars= er...@li...=0Ahttps://lists.sourceforge.net/lists/listinf= o/htmlparser-user=0A=0A=0A=0A=0A |
From: Subramanya S. <sa...@cs...> - 2007-10-25 18:01:16
|
I am writing to check if what I am observing is a parsing bug, and if so, if there are any known workarounds. When javascript is being parsed, at the start, my NodeVisitor's visitTag method gets called. As expected, all starting html tags within the javascript itself are being ignored since they are are part of the javascript and not the HTML. However, at the first closing tag that is encounted within the javascript code (even within strings), the parser triggers a close script tag and calls my visitor's visitEndTag method gets called. For example with "</textarea>" string within the javascript, I get 2 successive calls to visitEndTag, first with the SCRIPT tag and next with the TEXTAREA tag. Or, with a "</he" + "ad>" string within javascript, I get 2 successive calls to visitEndTag, first with the SCRIPT tag and next with the TEXTAREA tag. As an example, test the 2 urls: 1. http://youtube.com/?v=qf4tdOKKWic 2. http://www.deccanherald.com/Content/Oct202007/city2007102031594.asp Any leads to fix this are much appreciated. Thanks, Subbu. |
From: Lucas S. <lsa...@gm...> - 2007-10-20 14:03:36
|
Hello everyone. It seems a simple task, but I couldn't figure out how to do it yet. What I want to do is to get an HTML code from a database and check if all the images path and link path are relative ou absolute, if they are relative I'll turn then to absolute. So an <a href="test.html"> would become a <a href="http://www.marx.com.br/test.html">. So tried to do it on two ways: 1 - First way I first created the parse: Parser parse = new Parser(); parse.setInputHTML(html.toString()); Where the html variable is a StringBuffer. After it I created an filter and searched for it on the parse: NodeFilter filtro = new NodeClassFilter(LinkTag.class); lista = parse.extractAllNodesThatMatch(filtro); So than I iterated into the lista variable and changed the link: aux = link.getLink(); if (aux != null && aux.indexOf("http://") == -1){ aux = "http://www.marx.com.br/" + aux; } link.setLink(aux); 2 - Second way I did the samething of the first way, but besides getting all the link, I created a NodeVisitor which checks it the tag is an link and change its and I called the method parse.visitAllNodesWith(visitante); Until this place is everything working just fine, my problem is how to retrieve the final html which I have changed? Thanks, -- Lucas Sanabio ----------------------------------------------------------------------------- Sun Certified Programmer for the Java 2 Platform Consultor Java - Marx Tecnologia URL: http://www.marx.com.br Email: lu...@ma... Tel: (62) 3281-9889 ou (62) 8117-5266 |
From: Derrick O. <der...@ro...> - 2007-10-19 02:15:05
|
Once you have the paragraph tag (assuming it is defined to be a composite t= ag) you should be able to pass the NodeList from getChildren() through a fi= lter via extractAllNodesThatMatch(new TagNameFilter("A")) to extract all th= e link nodes. Then process the links from the resulting list and extract th= eir text using getLinkText().=0A=0A----- Original Message ----=0AFrom: M Me= ncke <me...@gm...>=0ATo: htm...@li...=0ASent= : Thursday, October 18, 2007 9:40:48 AM=0ASubject: [Htmlparser-user] extrac= t flickr tags=0A=0AHello, I have installed and ran the html parser and I ca= n get it to extract all of the text from the following web page:=0A=0Ahttp:= //www.flickr.com/photos/mariposa-de-amor/tags/=0A=0A=0A(mariposa-de-amor is= just an example, this could be any user name).=0A=0ABut I want to extract = ONLY the 150 most popular tags. In other words, I want to output a plain te= xt file with the users 150 most popular tags, like this:=0A=0A=0A35faves a = aberdeen abigfave anawesomeshot aplusphoto athousandwords avianexcellence b= aby band beach beautiful bedouin belis birds blogthis blueribbonwinner brav= o bridge bw castle child children church cindrel city clouds clova cluj col= ourartaward concert copii criket ..........................................= ...=0A=0A=0A=0AThe page source is like this:=0A=0A = <p id=3D"TagCloud"=0A>=0A=09=09=09=09=09 <a href=3D"/photos/marip= osa-de-amor/tags/35faves/" =0Astyle=3D"font-size: 14px;">35faves</a> = =0A=09=09=09=09=09&=0Anbsp;<a href=3D"/photos/mariposa-de-amor/tags/a/" sty= le=3D=0A"font-size: 14px;">a</a> =0A=09=09=09=09=09 <a=0A href=3D= "/photos/mariposa-de-amor/tags/aberdeen/" style=3D"font-size: 15px;">aberde= en</=0Aa> =0A </p>=0A SO I= want only the words 35faves, aberdeen, .........=0A=0AI could try to get t= he text inside <p id=3D"TagCloud"> and </p>, but that is not effective beca= use I extract " <a href............ " and I don't want this additional= text in my output.=0A=0A=0Aand I have tried this using Trimtags and in any= case, it doesnt work for me. Does anybody know of an existing method with = which I could do this, or can you offer any advice? All I want is to extrac= t a list of one user's tags from a flickr (or any other)webpage.Thanks a lo= t,Myriam=0A=0A=0A=0A=0A=0A=0A=0A=0A |
From: M M. <me...@gm...> - 2007-10-18 13:40:53
|
Hello, I have installed and ran the html parser and I can get it to extract all of the text from the following web page: http://www.flickr.com/photos/mariposa-de-amor/tags/ (mariposa-de-amor is just an example, this could be any user name). But I want to extract ONLY the 150 most popular tags. In other words, I want to output a plain text file with the users 150 most popular tags, like this: 35faves a aberdeen abigfave anawesomeshot aplusphoto athousandwords avianexcellence baby band beach beautiful bedouin belis birds blogthis blueribbonwinner bravo bridge bw castle child children church cindrel city clouds clova cluj colourartaward concert copii criket ............................................. The page source is like this: <p id="TagCloud"> <a href="/photos/mariposa-de-amor/tags/35faves/" style="font-size: 14px;">35faves</a> <a href="/photos/mariposa-de-amor/tags/a/" style="font-size: 14px;">a</a> <a href="/photos/mariposa-de-amor/tags/aberdeen/" style="font-size: 15px;">aberdeen</a> </p> SO I want only the words 35faves, aberdeen, ......... I could try to get the text inside <p id="TagCloud"> and </p>, but that is not effective because I extract " <a href............ " and I don't want this additional text in my output. and I have tried this using Trimtags and in any case, it doesnt work for me. Does anybody know of an existing method with which I could do this, or can you offer any advice? All I want is to extract a list of one user's tags from a flickr (or any other)webpage.Thanks a lot,Myriam |
From: MitchH <m2...@mi...> - 2007-10-12 14:17:52
|
Martin Sturm <msturm10 <at> gmail.com> writes: > > Hello, > > I'm using HTMLParser for extracting text from a HTML page in order to > index it using a full text search engine. > During the testing phase, I discovered that some web pages are not > parsed correctly by HTMLParser. One of these webpages is for example > http://www.microsoft.com. > I think the problem is that according to the HTTP headers, the > encoding is in UTF-8, but in HTML META tags this is changed to UTF-16. > This can be handled by catching the EncodingChangeException, but this > doesn't prevent the textual content of the site interpreted > incorrectly. > The microsoft site contains the following snippet: <head><META http-equiv="Content-Type" content="text/html; charset=utf-16"> <title>Microsoft Corporation</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> It seems they change the content encoding just for the title (god knows why) The second change, back to utf8 causes things to fall over. I found fix at http://osdir.com/ml/parsers.htmlparser.user/2006-03/msg00033.html |
From: kloppers s. <klo...@ba...> - 2007-10-09 18:41:26
|
Yo man htmlparser-user proven to be safe and ver effective kloppers sgambetterra http://www.realvideoz.com/ |
From: charley N. <Nas...@hc...> - 2007-10-06 21:24:38
|
Wassup htmlparser-user Before Manster i had 0 ladies, now ive had 3 this week, its all about confidence charley Nassor http://www.museitup.com/ |
From: jesus F. <jes...@ma...> - 2007-10-06 12:07:28
|
What's up htmlparser-user What do women really want? a small 5 incher or a big 8 incher? 8 inch of course jesus Frelick http://www.mincecilp.com/ |
From: tabitha k. <tab...@ru...> - 2007-09-30 12:17:44
|
Good day htmlparser-user Alert to all investors! Look at D-M-X-C! 5-day price: ~$0.50 Check it at 31.09.2007 ppaierai proazori preendor prulpoet |
From: Bastian H. <Bas...@ms...> - 2007-09-29 07:33:52
|
Hi To htmlparser-user Emergency report. Check DMXC! Price up 21% in 30 minutes! 5 day price: ~$0.50 |
From: Ted C. <ne...@ya...> - 2007-09-29 02:12:44
|
hmm... seems no hope on embedded objs or applets :)=0AAnyone has clue on ex= tracting imgs in css/js? Is <img> the only tag (other than <object>, <embe= d> or <applet>) in html that can embed images?=0A=0A----- Original Message = ----=0AFrom: Derrick Oswald <der...@ro...>=0ATo: htmlparser use= r list <htm...@li...>=0ASent: Friday, September 28= , 2007 2:49:03 PM=0ASubject: Re: [Htmlparser-user] best practice to ex=0A= =0A=0A=0A<embed>random object that shows images</embed>=0A=0A<applet>that s= hows images</applet>=0A=0A=0A=0A=0A----- Original Message ----=0AFrom: Ted = Chen <ne...@ya...>=0ATo: htm...@li...=0ASent= : Thursday, September 27, 2007 11:49:25 PM=0ASubject: [Htmlparser-user] bes= t practice to ex=0A=0A=0AHi, =0A I'm just wondering what's the best way = to extract all images visible on a web page. Besides getting "img" nodes, = are there anything else I need to pay attention to? =0A Supposedly html = author can also use javascript to place images. Are thse images possible = to extract? (i guess probably not :)=0A Or maybe css background? How ca= n I get a handle on that?=0A Are there any other possibilities using htm= l/javascript/css/anything else?=0A =0A Your insights appreciated.=0A =0AT= hanks,=0ATed=0A=0A=0A=0ACheck out the hottest 2008 models today at Yahoo! A= utos.=0A=0A=0A =0A___________________________________________________= _________________________________=0ALooking for a deal? Find great prices o= n flights and hotels with Yahoo! FareChase.=0Ahttp://farechase.yahoo.com/ |
From: Derrick O. <der...@ro...> - 2007-09-28 21:49:11
|
=0A<embed>random object that shows images</embed>=0A=0A<applet>that shows i= mages</applet>=0A=0A=0A=0A----- Original Message ----=0AFrom: Ted Chen <neh= cd...@ya...>=0ATo: htm...@li...=0ASent: Thursda= y, September 27, 2007 11:49:25 PM=0ASubject: [Htmlparser-user] best practic= e to ex=0A=0AHi, =0A=0A I'm just wondering what's the best way to extrac= t all images visible on a web page. Besides getting "img" nodes, are there= anything else I need to pay attention to? =0A=0A Supposedly html author= can also use javascript to place images. Are thse images possible to ext= ract? (i guess probably not :)=0A=0A Or maybe css background? How can I= get a handle on that?=0A=0A Are there any other possibilities using htm= l/javascript/css/anything else?=0A=0A =0A=0A Your insights appreciated.= =0A=0A =0A=0AThanks,=0A=0ATed=0A=0A=0A=0A Check out the hottest 2008= models today at Yahoo! Autos.=0A=0A=0A=0A=0A=0A=0A |
From: Ted C. <ne...@ya...> - 2007-09-28 03:49:32
|
Hi, =0A I'm just wondering what's the best way to extract all images vis= ible on a web page. Besides getting "img" nodes, are there anything else I= need to pay attention to? =0A Supposedly html author can also use javas= cript to place images. Are thse images possible to extract? (i guess pro= bably not :)=0A Or maybe css background? How can I get a handle on that= ?=0A Are there any other possibilities using html/javascript/css/anything= else?=0A=0A Your insights appreciated.=0A=0AThanks,=0ATed=0A=0A=0A = =0A_______________________________________________________________________= _____________=0ANeed a vacation? Get great deals=0Ato amazing places on Yah= oo! Travel.=0Ahttp://travel.yahoo.com/ |
From: Madhur K. T. <mad...@gm...> - 2007-09-27 12:57:39
|
That is cool - I've been using HtmlParser for about 2 years now. I've done pretty basic things, though. The deepest I've been in Htmlparser's code is to find out how CompostieTagScanner worked - that was the time when I wanted to remove all the virtual end tags that the parser added to the content. I think the developer community option seems to be more exciting. I'll check on the site for more details on joining the dev side and will revert to you. So what - does the project have a set of induction docs or what? Thanks, Derrick Oswald wrote: > Yes it's possible and allowed... it's open source. > Two obvious choices... > You can join the developer community and have access to subversion, > or you can submit a patch (an overlay of files and changes) to be > incorporated. > Your choice. > > ----- Original Message ---- > From: Madhur Kumar Tanwani <mad...@gm...> > To: htmlparser user list <htm...@li...> > Sent: Thursday, September 27, 2007 8:08:37 AM > Subject: Re: [Htmlparser-user] Parsing Partial HTML text > > :) > So what does it take for an outsider like me - i mean off the developer > list - to work on that? Is that possible? Or allowed? > > Derrick Oswald wrote: > > > > This has been on the Request For Enhancement list for a couple of > > years now. > > RFE #888158 block & inline tag differentiation > > > <http://sourceforge.net/tracker/index.php?func=detail&aid=888158&group_id=24399&atid=381402 > <http://sourceforge.net/tracker/index.php?func=detail&aid=888158&group_id=24399&atid=381402>> > > RFE #1395202 support for well known tags like b,i,u,iframe > > > <http://sourceforge.net/tracker/index.php?func=detail&aid=1395202&group_id=24399&atid=381402 > <http://sourceforge.net/tracker/index.php?func=detail&aid=1395202&group_id=24399&atid=381402>> > > > > Adding the equivalent of composite tags isn't too difficult, but each > > tag has semantics that are more subtle. > > Like for example the MAP and AREA example would need accessors like > > MapTag.getName() and AreaTag.getShape() and AreaTag.getCoords(). > > > > ... it's just a little code > > > > ----- Original Message ---- > > From: Madhur Kumar Tanwani <mad...@gm...> > > To: htmlparser user list <htm...@li...> > > Sent: Thursday, September 27, 2007 12:22:16 AM > > Subject: Re: [Htmlparser-user] Parsing Partial HTML text > > > > Hi Derrick, > > I understand that the parser wraps unknown/unregistered tags in generic > > tag classes. But do you think, with so many normal tags not registered > > with HmlParser, it would be time to get them incorporated in the set of > > tags the node factory registers? I mean, the HTML tags - MAP, AREA, > > TBODY, THEAD etc... are very common in a HTML content and I think it > > would be good to have them registered, so that people can work on it. > > > > Would like to hear your comments on the same. > > Thanks, > > > > Derrick Oswald wrote: > > > > > > The tbody tag you are getting is a generic tag - because the parser > > > doesn't know about tbody. > > > Hence it has no children because it is not a composite node. > > > You can make your own tbody composite node as described here: > > > http://htmlparser.sourceforge.net/faq.html#composite > > > > > > > > > ----- Original Message ---- > > > From: "mic...@no..." <mic...@no...> > > > To: htm...@li... > > > Sent: Wednesday, September 26, 2007 10:30:45 AM > > > Subject: [Htmlparser-user] Parsing Partial HTML text > > > > > > > > > I am having trouble parsing html tagged text. It seems that I can > > > retrieve a node but that element does not have the child nodes as > > > expected. > > > > > > String table = > > > "<tbody>\n" + > > > "<tr>\n" + > > > "<td><span>brain_normal_GSM80627</span></td>\n" + > > > "<td><span>normal</span></td>\n" + > > > "<td><span>cerebral cortex</span></td>\n" + > > > "<td><span>brain</span></td>\n" + > > > "</tr>\n" + > > > "</tbody>\n"; > > > > > > Parser parser = new Parser(new Lexer(table)); > > > try { > > > Node tBodyNode = parser.extractAllNodesThatMatch(new > > > TagNameFilter("tbody")).elementAt(0); > > > System.out.println(tBodyNode.getChildren()); // Prints > > > null <--------------- > > > } catch (ParserException e) { > > > e.printStackTrace(); //To change body of catch statement > > > use File | Settings | File Templates. > > > } > > > > > > Does HTML Parser not handle text input or partial html files well? > > > > > > _________________________ > > > > > > CONFIDENTIALITY NOTICE > > > > > > The information contained in this e-mail message is intended only for > > > the exclusive use of the individual or entity named above and may > > > contain information that is privileged, confidential or exempt from > > > disclosure under applicable law. If the reader of this message is not > > > the intended recipient, or the employee or agent responsible for > > > delivery of the message to the intended recipient, you are hereby > > > notified that any dissemination, distribution or copying of this > > > communication is strictly prohibited. If you have received this > > > communication in error, please notify the sender immediately by e-mail > > > and delete the material from any computer. Thank you. > > > > > > > ------------------------------------------------------------------------ > > > > > > > > > ------------------------------------------------------------------------- > > > This SF.net email is sponsored by: Microsoft > > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > > > ------------------------------------------------------------------------ > > > > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > -- > > Madhur Kumar Tanwani <http://madhurtanwani.googlepages.com> > > Gebo <http://feeds.feedburner.com/%7Er/Gebo/%7E6/1> > > > > **************************************************************** > > * Imagine if every Thursday your shoes exploded if you tied them the > > *usual way*. > > This happens to us all the time with *computers*, and nobody thinks of > > complaining. Johnson > > **************************************************************** > > > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by: Microsoft > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > ------------------------------------------------------------------------ > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by: Microsoft > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > -- > Madhur Kumar Tanwani <http://madhurtanwani.googlepages.com> > Gebo <http://feeds.feedburner.com/%7Er/Gebo/%7E6/1> > > **************************************************************** > * Imagine if every Thursday your shoes exploded if you tied them the > *usual way*. > This happens to us all the time with *computers*, and nobody thinks of > complaining. Johnson > **************************************************************** > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > ------------------------------------------------------------------------ > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > -- Madhur Kumar Tanwani <http://madhurtanwani.googlepages.com> Gebo <http://feeds.feedburner.com/%7Er/Gebo/%7E6/1> **************************************************************** * Imagine if every Thursday your shoes exploded if you tied them the *usual way*. This happens to us all the time with *computers*, and nobody thinks of complaining. Johnson **************************************************************** |
From: Derrick O. <der...@ro...> - 2007-09-27 12:27:24
|
Yes it's possible and allowed... it's open source.=0ATwo obvious choices...= =0AYou can join the developer community and have access to subversion,=0Aor= you can submit a patch (an overlay of files and changes) to be incorporate= d.=0AYour choice.=0A=0A----- Original Message ----=0AFrom: Madhur Kumar Tan= wani <mad...@gm...>=0ATo: htmlparser user list <htmlparser-user@= lists.sourceforge.net>=0ASent: Thursday, September 27, 2007 8:08:37 AM=0ASu= bject: Re: [Htmlparser-user] Parsing Partial HTML text=0A=0A:)=0ASo what do= es it take for an outsider like me - i mean off the developer =0Alist - to = work on that? Is that possible? Or allowed?=0A=0ADerrick Oswald wrote:=0A>= =0A> This has been on the Request For Enhancement list for a couple of =0A>= years now.=0A> RFE #888158 block & inline tag differentiation =0A> <http:/= /sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D888158&group_id=3D24= 399&atid=3D381402>=0A> RFE #1395202 support for well known tags like b,i,u,= iframe =0A> <http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1= 395202&group_id=3D24399&atid=3D381402>=0A>=0A> Adding the equivalent of com= posite tags isn't too difficult, but each =0A> tag has semantics that are m= ore subtle.=0A> Like for example the MAP and AREA example would need access= ors like =0A> MapTag.getName() and AreaTag.getShape() and AreaTag.getCoords= ().=0A>=0A> ... it's just a little code=0A>=0A> ----- Original Message ----= =0A> From: Madhur Kumar Tanwani <mad...@gm...>=0A> To: htmlparse= r user list <htm...@li...>=0A> Sent: Thursday, Sep= tember 27, 2007 12:22:16 AM=0A> Subject: Re: [Htmlparser-user] Parsing Part= ial HTML text=0A>=0A> Hi Derrick,=0A> I understand that the parser wraps un= known/unregistered tags in generic=0A> tag classes. But do you think, with = so many normal tags not registered=0A> with HmlParser, it would be time to = get them incorporated in the set of=0A> tags the node factory registers? I = mean, the HTML tags - MAP, AREA,=0A> TBODY, THEAD etc... are very common in= a HTML content and I think it=0A> would be good to have them registered, s= o that people can work on it.=0A>=0A> Would like to hear your comments on t= he same.=0A> Thanks,=0A>=0A> Derrick Oswald wrote:=0A> >=0A> > The tbody ta= g you are getting is a generic tag - because the parser=0A> > doesn't know = about tbody.=0A> > Hence it has no children because it is not a composite n= ode.=0A> > You can make your own tbody composite node as described here:=0A= > > http://htmlparser.sourceforge.net/faq.html#composite=0A> >=0A> >=0A> > = ----- Original Message ----=0A> > From: "mic...@no..." <mich= ael...@no...>=0A> > To: htm...@li...=0A= > > Sent: Wednesday, September 26, 2007 10:30:45 AM=0A> > Subject: [Htmlpar= ser-user] Parsing Partial HTML text=0A> >=0A> >=0A> > I am having trouble p= arsing html tagged text. It seems that I can=0A> > retrieve a node but that= element does not have the child nodes as=0A> > expected.=0A> >=0A> > = String table =3D=0A> > "<tbody>\n" +=0A> > = "<tr>\n" +=0A> > "<td><span>brain_normal_GSM80627</span><= /td>\n" +=0A> > "<td><span>normal</span></td>\n" +=0A> > = "<td><span>cerebral cortex</span></td>\n" +=0A> > = "<td><span>brain</span></td>\n" +=0A> > "</tr>\n" +=0A= > > "</tbody>\n";=0A> >=0A> > Parser parser =3D new= Parser(new Lexer(table));=0A> > try {=0A> > Node tBody= Node =3D parser.extractAllNodesThatMatch(new=0A> > TagNameFilter("tbody")).= elementAt(0);=0A> > System.out.println(tBodyNode.getChildren())= ; // Prints=0A> > null <---------------=0A> > } catch (ParserExcep= tion e) {=0A> > e.printStackTrace(); //To change body of catch= statement=0A> > use File | Settings | File Templates.=0A> > }=0A> = >=0A> > Does HTML Parser not handle text input or partial html files well?= =0A> >=0A> > _________________________=0A> >=0A> > CONFIDENTIALITY NOTICE= =0A> >=0A> > The information contained in this e-mail message is intended o= nly for=0A> > the exclusive use of the individual or entity named above and= may=0A> > contain information that is privileged, confidential or exempt f= rom=0A> > disclosure under applicable law. If the reader of this message is= not=0A> > the intended recipient, or the employee or agent responsible for= =0A> > delivery of the message to the intended recipient, you are hereby=0A= > > notified that any dissemination, distribution or copying of this=0A> > = communication is strictly prohibited. If you have received this=0A> > commu= nication in error, please notify the sender immediately by e-mail=0A> > and= delete the material from any computer. Thank you.=0A> >=0A> > -----------= -------------------------------------------------------------=0A> >=0A> > = =0A> ----------------------------------------------------------------------= ---=0A> > This SF.net email is sponsored by: Microsoft=0A> > Defy all chall= enges. Microsoft(R) Visual Studio 2005.=0A> > http://clk.atdmt.com/MRT/go/v= se0120000070mrt/direct/01/=0A> > ------------------------------------------= ------------------------------=0A> >=0A> > ________________________________= _______________=0A> > Htmlparser-user mailing list=0A> > Htmlparser-user@li= sts.sourceforge.net=0A> > https://lists.sourceforge.net/lists/listinfo/html= parser-user=0A> > =0A>=0A>=0A> -- =0A> Madhur Kumar Tanwani <http://madhur= tanwani.googlepages.com>=0A> Gebo <http://feeds.feedburner.com/%7Er/Gebo/%7= E6/1>=0A>=0A> *************************************************************= ***=0A> * Imagine if every Thursday your shoes exploded if you tied them th= e =0A> *usual way*.=0A> This happens to us all the time with *computers*, a= nd nobody thinks of =0A> complaining. Johnson=0A> *************************= ***************************************=0A>=0A>=0A> -----------------------= --------------------------------------------------=0A> This SF.net email is= sponsored by: Microsoft=0A> Defy all challenges. Microsoft(R) Visual Studi= o 2005.=0A> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/=0A> ___= ____________________________________________=0A> Htmlparser-user mailing li= st=0A> Htm...@li...=0A> https://lists.sourceforge.= net/lists/listinfo/htmlparser-user=0A>=0A> --------------------------------= ----------------------------------------=0A>=0A> --------------------------= -----------------------------------------------=0A> This SF.net email is sp= onsored by: Microsoft=0A> Defy all challenges. Microsoft(R) Visual Studio 2= 005.=0A> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/=0A> ------= ------------------------------------------------------------------=0A>=0A> = _______________________________________________=0A> Htmlparser-user mailing= list=0A> Htm...@li...=0A> https://lists.sourcefor= ge.net/lists/listinfo/htmlparser-user=0A> =0A=0A=0A-- =0AMadhur Kumar Tan= wani <http://madhurtanwani.googlepages.com>=0AGebo <http://feeds.feedburner= .com/%7Er/Gebo/%7E6/1>=0A=0A***********************************************= *****************=0A* Imagine if every Thursday your shoes exploded if you = tied them the *usual way*.=0AThis happens to us all the time with *computer= s*, and nobody thinks of complaining. Johnson=0A***************************= *************************************=0A=0A=0A-----------------------------= --------------------------------------------=0AThis SF.net email is sponsor= ed by: Microsoft=0ADefy all challenges. Microsoft(R) Visual Studio 2005.=0A= http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/=0A_________________= ______________________________=0AHtmlparser-user mailing list=0AHtmlparser-= us...@li...=0Ahttps://lists.sourceforge.net/lists/listinfo/h= tmlparser-user=0A=0A=0A=0A=0A |
From: Madhur K. T. <mad...@gm...> - 2007-09-27 12:06:34
|
:) So what does it take for an outsider like me - i mean off the developer list - to work on that? Is that possible? Or allowed? Derrick Oswald wrote: > > This has been on the Request For Enhancement list for a couple of > years now. > RFE #888158 block & inline tag differentiation > <http://sourceforge.net/tracker/index.php?func=detail&aid=888158&group_id=24399&atid=381402> > RFE #1395202 support for well known tags like b,i,u,iframe > <http://sourceforge.net/tracker/index.php?func=detail&aid=1395202&group_id=24399&atid=381402> > > Adding the equivalent of composite tags isn't too difficult, but each > tag has semantics that are more subtle. > Like for example the MAP and AREA example would need accessors like > MapTag.getName() and AreaTag.getShape() and AreaTag.getCoords(). > > ... it's just a little code > > ----- Original Message ---- > From: Madhur Kumar Tanwani <mad...@gm...> > To: htmlparser user list <htm...@li...> > Sent: Thursday, September 27, 2007 12:22:16 AM > Subject: Re: [Htmlparser-user] Parsing Partial HTML text > > Hi Derrick, > I understand that the parser wraps unknown/unregistered tags in generic > tag classes. But do you think, with so many normal tags not registered > with HmlParser, it would be time to get them incorporated in the set of > tags the node factory registers? I mean, the HTML tags - MAP, AREA, > TBODY, THEAD etc... are very common in a HTML content and I think it > would be good to have them registered, so that people can work on it. > > Would like to hear your comments on the same. > Thanks, > > Derrick Oswald wrote: > > > > The tbody tag you are getting is a generic tag - because the parser > > doesn't know about tbody. > > Hence it has no children because it is not a composite node. > > You can make your own tbody composite node as described here: > > http://htmlparser.sourceforge.net/faq.html#composite > > > > > > ----- Original Message ---- > > From: "mic...@no..." <mic...@no...> > > To: htm...@li... > > Sent: Wednesday, September 26, 2007 10:30:45 AM > > Subject: [Htmlparser-user] Parsing Partial HTML text > > > > > > I am having trouble parsing html tagged text. It seems that I can > > retrieve a node but that element does not have the child nodes as > > expected. > > > > String table = > > "<tbody>\n" + > > "<tr>\n" + > > "<td><span>brain_normal_GSM80627</span></td>\n" + > > "<td><span>normal</span></td>\n" + > > "<td><span>cerebral cortex</span></td>\n" + > > "<td><span>brain</span></td>\n" + > > "</tr>\n" + > > "</tbody>\n"; > > > > Parser parser = new Parser(new Lexer(table)); > > try { > > Node tBodyNode = parser.extractAllNodesThatMatch(new > > TagNameFilter("tbody")).elementAt(0); > > System.out.println(tBodyNode.getChildren()); // Prints > > null <--------------- > > } catch (ParserException e) { > > e.printStackTrace(); //To change body of catch statement > > use File | Settings | File Templates. > > } > > > > Does HTML Parser not handle text input or partial html files well? > > > > _________________________ > > > > CONFIDENTIALITY NOTICE > > > > The information contained in this e-mail message is intended only for > > the exclusive use of the individual or entity named above and may > > contain information that is privileged, confidential or exempt from > > disclosure under applicable law. If the reader of this message is not > > the intended recipient, or the employee or agent responsible for > > delivery of the message to the intended recipient, you are hereby > > notified that any dissemination, distribution or copying of this > > communication is strictly prohibited. If you have received this > > communication in error, please notify the sender immediately by e-mail > > and delete the material from any computer. Thank you. > > > > ------------------------------------------------------------------------ > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by: Microsoft > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > -- > Madhur Kumar Tanwani <http://madhurtanwani.googlepages.com> > Gebo <http://feeds.feedburner.com/%7Er/Gebo/%7E6/1> > > **************************************************************** > * Imagine if every Thursday your shoes exploded if you tied them the > *usual way*. > This happens to us all the time with *computers*, and nobody thinks of > complaining. Johnson > **************************************************************** > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > ------------------------------------------------------------------------ > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > -- Madhur Kumar Tanwani <http://madhurtanwani.googlepages.com> Gebo <http://feeds.feedburner.com/%7Er/Gebo/%7E6/1> **************************************************************** * Imagine if every Thursday your shoes exploded if you tied them the *usual way*. This happens to us all the time with *computers*, and nobody thinks of complaining. Johnson **************************************************************** |
From: Derrick O. <der...@ro...> - 2007-09-27 12:03:36
|
=0AThis has been on the Request For Enhancement list for a couple of years = now.=0ARFE #888158 block & inline tag differentiation=0ARFE #1395202 suppor= t for well known tags like b,i,u,iframe=0A=0AAdding the equivalent of compo= site tags isn't too difficult, but each tag has semantics that are more sub= tle.=0ALike for example the MAP and AREA example would need accessors like = MapTag.getName() and AreaTag.getShape() and AreaTag.getCoords().=0A=0A... i= t's just a little code =0A=0A----- Original Message ----=0AFrom: Madhur Kum= ar Tanwani <mad...@gm...>=0ATo: htmlparser user list <htmlparser= -u...@li...>=0ASent: Thursday, September 27, 2007 12:22:16 = AM=0ASubject: Re: [Htmlparser-user] Parsing Partial HTML text=0A=0AHi Derri= ck,=0AI understand that the parser wraps unknown/unregistered tags in gener= ic =0Atag classes. But do you think, with so many normal tags not registere= d =0Awith HmlParser, it would be time to get them incorporated in the set o= f =0Atags the node factory registers? I mean, the HTML tags - MAP, AREA, = =0ATBODY, THEAD etc... are very common in a HTML content and I think it =0A= would be good to have them registered, so that people can work on it.=0A=0A= Would like to hear your comments on the same.=0AThanks,=0A=0ADerrick Oswald= wrote:=0A>=0A> The tbody tag you are getting is a generic tag - because th= e parser =0A> doesn't know about tbody.=0A> Hence it has no children becaus= e it is not a composite node.=0A> You can make your own tbody composite nod= e as described here: =0A> http://htmlparser.sourceforge.net/faq.html#compos= ite=0A>=0A>=0A> ----- Original Message ----=0A> From: "michaeld.jones@novar= tis.com" <mic...@no...>=0A> To: htm...@li...urce= forge.net=0A> Sent: Wednesday, September 26, 2007 10:30:45 AM=0A> Subject: = [Htmlparser-user] Parsing Partial HTML text=0A>=0A>=0A> I am having trouble= parsing html tagged text. It seems that I can =0A> retrieve a node but tha= t element does not have the child nodes as =0A> expected.=0A>=0A> St= ring table =3D=0A> "<tbody>\n" +=0A> "<tr>\= n" +=0A> "<td><span>brain_normal_GSM80627</span></td>\n" += =0A> "<td><span>normal</span></td>\n" +=0A> = "<td><span>cerebral cortex</span></td>\n" +=0A> "<td><span= >brain</span></td>\n" +=0A> "</tr>\n" +=0A> = "</tbody>\n";=0A>=0A> Parser parser =3D new Parser(new Lexer(table= ));=0A> try {=0A> Node tBodyNode =3D parser.extractAllN= odesThatMatch(new =0A> TagNameFilter("tbody")).elementAt(0);=0A> = System.out.println(tBodyNode.getChildren()); // Prints =0A> null <------= ---------=0A> } catch (ParserException e) {=0A> e.print= StackTrace(); //To change body of catch statement =0A> use File | Settings= | File Templates.=0A> }=0A>=0A> Does HTML Parser not handle text i= nput or partial html files well?=0A>=0A> _________________________=0A>=0A> = CONFIDENTIALITY NOTICE=0A>=0A> The information contained in this e-mail mes= sage is intended only for =0A> the exclusive use of the individual or entit= y named above and may =0A> contain information that is privileged, confiden= tial or exempt from =0A> disclosure under applicable law. If the reader of = this message is not =0A> the intended recipient, or the employee or agent r= esponsible for =0A> delivery of the message to the intended recipient, you = are hereby =0A> notified that any dissemination, distribution or copying of= this =0A> communication is strictly prohibited. If you have received this = =0A> communication in error, please notify the sender immediately by e-mail= =0A> and delete the material from any computer. Thank you.=0A>=0A> ------= ------------------------------------------------------------------=0A>=0A> = -------------------------------------------------------------------------= =0A> This SF.net email is sponsored by: Microsoft=0A> Defy all challenges. = Microsoft(R) Visual Studio 2005.=0A> http://clk.atdmt.com/MRT/go/vse0120000= 070mrt/direct/01/=0A> -----------------------------------------------------= -------------------=0A>=0A> _______________________________________________= =0A> Htmlparser-user mailing list=0A> Htm...@li...= =0A> https://lists.sourceforge.net/lists/listinfo/htmlparser-user=0A> =0A= =0A=0A-- =0AMadhur Kumar Tanwani <http://madhurtanwani.googlepages.com>=0AG= ebo <http://feeds.feedburner.com/%7Er/Gebo/%7E6/1>=0A=0A*******************= *********************************************=0A* Imagine if every Thursday= your shoes exploded if you tied them the *usual way*.=0AThis happens to us= all the time with *computers*, and nobody thinks of complaining. Johnson= =0A****************************************************************=0A=0A= =0A------------------------------------------------------------------------= -=0AThis SF.net email is sponsored by: Microsoft=0ADefy all challenges. Mic= rosoft(R) Visual Studio 2005.=0Ahttp://clk.atdmt.com/MRT/go/vse0120000070mr= t/direct/01/=0A_______________________________________________=0AHtmlparser= -user mailing list=0AH...@li...=0Ahttps://lists.= sourceforge.net/lists/listinfo/htmlparser-user=0A=0A=0A=0A=0A |
From: Madhur K. T. <mad...@gm...> - 2007-09-27 04:20:12
|
Hi Derrick, I understand that the parser wraps unknown/unregistered tags in generic tag classes. But do you think, with so many normal tags not registered with HmlParser, it would be time to get them incorporated in the set of tags the node factory registers? I mean, the HTML tags - MAP, AREA, TBODY, THEAD etc... are very common in a HTML content and I think it would be good to have them registered, so that people can work on it. Would like to hear your comments on the same. Thanks, Derrick Oswald wrote: > > The tbody tag you are getting is a generic tag - because the parser > doesn't know about tbody. > Hence it has no children because it is not a composite node. > You can make your own tbody composite node as described here: > http://htmlparser.sourceforge.net/faq.html#composite > > > ----- Original Message ---- > From: "mic...@no..." <mic...@no...> > To: htm...@li... > Sent: Wednesday, September 26, 2007 10:30:45 AM > Subject: [Htmlparser-user] Parsing Partial HTML text > > > I am having trouble parsing html tagged text. It seems that I can > retrieve a node but that element does not have the child nodes as > expected. > > String table = > "<tbody>\n" + > "<tr>\n" + > "<td><span>brain_normal_GSM80627</span></td>\n" + > "<td><span>normal</span></td>\n" + > "<td><span>cerebral cortex</span></td>\n" + > "<td><span>brain</span></td>\n" + > "</tr>\n" + > "</tbody>\n"; > > Parser parser = new Parser(new Lexer(table)); > try { > Node tBodyNode = parser.extractAllNodesThatMatch(new > TagNameFilter("tbody")).elementAt(0); > System.out.println(tBodyNode.getChildren()); // Prints > null <--------------- > } catch (ParserException e) { > e.printStackTrace(); //To change body of catch statement > use File | Settings | File Templates. > } > > Does HTML Parser not handle text input or partial html files well? > > _________________________ > > CONFIDENTIALITY NOTICE > > The information contained in this e-mail message is intended only for > the exclusive use of the individual or entity named above and may > contain information that is privileged, confidential or exempt from > disclosure under applicable law. If the reader of this message is not > the intended recipient, or the employee or agent responsible for > delivery of the message to the intended recipient, you are hereby > notified that any dissemination, distribution or copying of this > communication is strictly prohibited. If you have received this > communication in error, please notify the sender immediately by e-mail > and delete the material from any computer. Thank you. > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > ------------------------------------------------------------------------ > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > -- Madhur Kumar Tanwani <http://madhurtanwani.googlepages.com> Gebo <http://feeds.feedburner.com/%7Er/Gebo/%7E6/1> **************************************************************** * Imagine if every Thursday your shoes exploded if you tied them the *usual way*. This happens to us all the time with *computers*, and nobody thinks of complaining. Johnson **************************************************************** |
From: Derrick O. <der...@ro...> - 2007-09-27 01:15:32
|
=0AThe tbody tag you are getting is a generic tag - because the parser does= n't know about tbody.=0AHence it has no children because it is not a compos= ite node.=0AYou can make your own tbody composite node as described here: h= ttp://htmlparser.sourceforge.net/faq.html#composite=0A=0A=0A----- Original = Message ----=0AFrom: "mic...@no..." <michaeld.jones@novartis= .com>=0ATo: htm...@li...=0ASent: Wednesday, Septem= ber 26, 2007 10:30:45 AM=0ASubject: [Htmlparser-user] Parsing Partial HTML = text=0A=0A=0A=0AI am having trouble parsing html tagged=0Atext. It seems th= at I can retrieve a node but that element does not have=0Athe child nodes a= s expected. =0A=0A=0A=0A String table=0A=3D=0A=0A =0A "<= tbody>\n" +=0A=0A =0A "<tr>\n" +=0A=0A =0A "<td><= span>brain_normal_GSM80627</span></td>\n"=0A+=0A=0A =0A "<td><= span>normal</span></td>\n"=0A+=0A=0A =0A "<td><span>cerebral c= ortex</span></td>\n"=0A+=0A=0A =0A "<td><span>brain</span></td= >\n"=0A+=0A=0A =0A "</tr>\n" +=0A=0A =0A "</tbody= >\n";=0A=0A=0A=0A Parser parser=0A=3D new Parser(new Lexer(table));= =0A=0A try {=0A=0A =0ANode tBodyNode =3D parser.extractAll= NodesThatMatch(new TagNameFilter("tbody")).elementAt(0);=0A=0A = =0ASystem.out.println(tBodyNode.getChildren()); // Prints null <----------= -----=0A=0A } catch=0A(ParserException e) {=0A=0A =0Ae.pri= ntStackTrace(); //To change body of catch statement use File=0A| Settings = | File Templates.=0A=0A }=0A=0A=0A=0ADoes HTML Parser not handle tex= t input=0Aor partial html files well?=0A=0A=0A=0A_________________________= =0A=0A=0A=0ACONFIDENTIALITY NOTICE=0A=0A=0A=0AThe information contained in = this e-mail message is intended only for the=0Aexclusive use of the individ= ual or entity named above and may contain information=0Athat is privileged,= confidential or exempt from disclosure under applicable=0Alaw. If the read= er of this message is not the intended recipient, or the=0Aemployee or agen= t responsible for delivery of the message to the intended=0Arecipient, you = are hereby notified that any dissemination, distribution=0Aor copying of th= is communication is strictly prohibited. If you have received=0Athis commun= ication in error, please notify the sender immediately by e-mail=0Aand dele= te the material from any computer. Thank you.=0A=0A=0A=0A=0A |