htmlparser-user Mailing List for HTML Parser (Page 27)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: <pos...@or...> - 2007-08-11 13:09:30
|
This is an automatically generated Delivery Status Notification. Unable to deliver message to the following recipients, due to being unable to connect successfully to the destination mail server. htm...@li... |
From: Shivani P. <Sh...@ma...> - 2007-08-11 12:30:45
|
H'u g.e N+e_w.s To Impac,t C-Y.T'V Chin+a Yo'uTV C*o+r_p.. Symbo-l: C-Y-T'V We h_a+v,e alrea-dy s.e*e-n C'YTV's m,arket i+mpact befor e climb*+ing to o*v+e'r $2*.00 w*i't,h n*e'w's+. Pr-ess R,elease: C_hina YouTV',s Cn+Boo W_e-b S,i*t+e Ra_nks N-o_. 1 on Micro,sof+t L i*v*e Searc-h Eng*ine CnBo'o Tr*affic In,creas es 4+9'% O.v,e+r T.w-o Mo*nths R_e.a_d t.h.e new_s, thin-k abo'ut t.h+e imp-act, and j u m+p on t*h-i s fi.rst t_hing T+omorro.w mo,rni'ng! $0._42 is a g-i'f*t at t_h_i s pri'ce....*. Do y'o'u-r homewor.*k a'n,d wat.ch t-h+i*s t*rade Mond.ay mornin+g. |
From: frida R. <fri...@po...> - 2007-08-11 09:18:45
|
H-u_g-e N e'w s To Im'pact C-Y+T-V Chin*a Y-ouTV C-o-r.p*. S.ymbol: C-Y_T_V We h+a,v'e alrea-dy s.e+e,n CYT+V's mar'ket impac,t befo,re climb.*ing to o,v+e r $2.0,0 w_i t,h n,e'w,s.. Pre'ss Releas,e: Ch+ina YouTV_'s CnBo*o W*e_b S-i-t e R'anks N,o*.+1 on Microso_,ft L+i,v*e S earch Engi*ne C_nBoo Tr.affic In crease-s 4-9,% O-v_e'r T.w-o Mo*nths R_e*a,d t+h,e ne'ws, thin-k abou,t t-h_e i*mpact, and j_u*m_p on t+h,i s firs t thin,g Tom'orr*ow mornin+g'! $0.'42 is a g.i,f't at t.h i s pri'ce...... Do y o-u.r hom,ewo_rk a.n'd watc,h t'h_i+s tra*de Mond.ay morning+. |
From: Derrick O. <der...@ro...> - 2007-08-07 22:16:23
|
Hi,=0A=0AThe HasAttributeFilter should have worked... at least enough to ex= tract all links with the Id attribute:=0A new AndFilter (new TagNameFilter= ("A"), new HasAttributeFilter ("Id"))=0A=0AThat said, there isn't a "HasAt= tributeRegexFilter" that would match an attribute value pattern,=0Aalthough= it has been discussed on the dev forum - or was that the LinkRegexFilter?= =0A=0AWhat you need is a combination of the HasAttributeFilter and the Rege= xFilter, where the exact equality test in the accept() method of HasAttribu= teFilter is replaced by the pattern matching code from the RegexFilter. Som= ething like this:=0A=0A /**=0A * Accept tags with a certain attribut= e.=0A * @param node The node to check.=0A * @return <code>true</cod= e> if the node has the attribute=0A * (and value if that is being check= ed too), <code>false</code> otherwise.=0A */=0A public boolean accep= t (Node node)=0A {=0A Tag tag;=0A Attribute attribute;=0A = String string;=0A Matcher matcher;=0A boolean ret;=0A= =0A ret =3D false;=0A if (node instanceof Tag)=0A {=0A= tag =3D (Tag)node;=0A attribute =3D tag.getAttribute= Ex (mAttribute);=0A ret =3D null !=3D attribute;=0A i= f (ret && (null !=3D mValue))=0A {=0A string =3D = attribute.getValue ();=0A matcher =3D mPattern.matcher (stri= ng);=0A switch (mStrategy)=0A {=0A = case MATCH:=0A ret =3D matcher.matches ();= =0A break;=0A case LOOKINGAT:=0A = ret =3D matcher.lookingAt ();=0A = break;=0A case FIND:=0A default:= =0A ret =3D matcher.find ();=0A = break;=0A }=0A }=0A }=0A=0A retu= rn (ret);=0A }=0A=0ADerrick=0A=0A----- Original Message ----=0AFrom: Mar= k Goking <Mar...@as...>=0ATo: htm...@li...urcef= orge.net=0ASent: Tuesday, August 7, 2007 5:19:29 AM=0ASubject: [Htmlparser-= user] Parsing for links=0A=0A=0AHi all=0A=0AI used the filterbean class to = extract only tags with links <a href>=0A=0AHowever I wish to only retrieve = links that have an id attribute with=0Avalue that starts with string test_= =0A=0AI don't see any method in the api that lets you do a search for the i= d's=0Avalue that acts like a String's indexOf() method.=0A=0AWhat would be = the filters needed for this operation? Even though ive=0Aadded attributes t= o the LinkTag to search for id=3Dvalue attribute, it=0Astill wont work.=0A= =0AThanks=0AChester=0A=0A--------------------------------------------------= -----------------------=0AThis SF.net email is sponsored by: Splunk Inc.=0A= Still grepping through log files to find problems? Stop.=0ANow Search log = events and configuration files using AJAX and a browser.=0ADownload your FR= EE copy of Splunk now >> http://get.splunk.com/=0A________________________= _______________________=0AHtmlparser-user mailing list=0AHtmlparser-user@li= sts.sourceforge.net=0Ahttps://lists.sourceforge.net/lists/listinfo/htmlpars= er-user=0A=0A=0A=0A=0A |
From: Volodymyr K. <Vol...@hz...> - 2007-08-07 13:45:54
|
fb38f230 |
From: Mark G. <Mar...@as...> - 2007-08-07 09:20:34
|
Hi all I used the filterbean class to extract only tags with links <a href> However I wish to only retrieve links that have an id attribute with value that starts with string test_ I don't see any method in the api that lets you do a search for the id's value that acts like a String's indexOf() method. What would be the filters needed for this operation? Even though ive added attributes to the LinkTag to search for id=3Dvalue attribute, it still wont work. Thanks Chester |
From: Gee R. <ge...@ya...> - 2007-08-06 11:06:45
|
Hello!=0A=0AI know Java but not HTML/. I want to extract text files of some= selected news form arabic news paper. Can someone help me how can I do thi= s?=0A=0ABest Regards- G. Raza=0A=0A=0A=0A=0A _________________________= ___________________________________________________________=0ALuggage? GPS?= Comic books? =0ACheck out fitting gifts for grads at Yahoo! Search=0Ahttp:= //search.yahoo.com/search?fr=3Doni_on_mail&p=3Dgraduation+gifts&cs=3Dbz |
From: Bren <Br...@wi...> - 2007-08-01 16:30:42
|
From: dsd <Kha...@pr...> - 2007-08-01 05:45:06
|
From: Derrick O. <der...@ro...> - 2007-07-31 20:05:18
|
No, sorry, I can't do your job for you. A standard Java InputStreamReader takes the encoding as a constructor argument. I suggest trying "UTF-8". If you don't want to turn the file into a String first, the Page class in the lexer package has a similar constructor: Page (InputStream stream, String charset) You can pass the page into the Lexer and thence on to the Parser with something like: new Parser (new Lexer (new Page (mystream, "UTF-8"))): ----- Original Message ---- From: k <km...@re...> To: htm...@li... Sent: Tuesday, July 31, 2007 1:39:51 PM Subject: [Htmlparser-user] Re :Re: Re :Re: Tag Nodes not getting recognized...Please Help hi Derrick, thanks very much again. I have tried with ISO-8859-1, but no luck. The original html file is with Unicode(probably UTF-8). I have tried many many ways....and I was not able to do it.... could you please once try htmlParser on my html file and advice me with any help...i know it takes your valueble time...it will be very helpful to me. I am attaching the file again. Kumar. On Mon, 30 Jul 2007 15:33:22 -0700 (PDT) htmlparser user list wrote The ISO-8859-1 encoding contains ASCII, you might try that. If there aren't any funny characters in the file it should work OK. ----- Original Message ---- From: k To: htm...@li... Sent: Monday, July 30, 2007 8:24:07 AM Subject: [Htmlparser-user] Re :Re: Tag Nodes not getting recognized...Please Help Thanks a ton Derrick, for your message, your help is highly appreciable. I have tried earlier using parser.setEncoding("UTF-8"), but it was also not working as expected. Today I have tried getting the content of the file in a string using, parser.setInputHTML(getContentsAsString(testFile)). But it also did not work. The only way it worked is that, if I open the HTML file outside in TextPad and saved it again with Encoding 'ANSI', and then running my code with this new file. Could you please suggest a way that I can do the above using htmlParser or by any other means? I tried reading the file a line at a time and using the following for the conversion. byte[] stringBytesUTF = line.getBytes("UTF-8"); ansiString = new String(stringBytesUTF, "ANSI") But it seems ANSI is not a valid argument. Any advice in this respect is highly valueble to me. Thanking You, Kumar. On Sat, 28 Jul 2007 13:48:36 -0700 (PDT) htmlparser user list wrote It appears the file is unicode, probably UTF-8, so you'll need to get the contents as a string yourself, or try parser.setEncoding ("UTF-8") before performing the parse. Some operating systems support a bye order mask prefix (like 0xFEFF) within the file to identify such files as other than plain ascii. ----- Original Message ---- From: k To: htm...@li... Sent: Saturday, July 28, 2007 8:12:19 AM Subject: [Htmlparser-user] Tag Nodes not getting recognized...Please Help Hi All, First of all thanks very much for your precious time. I hope I will get help from here, as I have no other way. For more than 2 days, I was trying to parse (and process all nodes) one of my HTML file using differnt parsers available. But I was not able to get the Tag Nodes list only for this particular HTML file. When I tried to process this HTML file with HtmlPraser, it was not detecting the TagNodes, it was just detecting the whole html page as one TextNode. But when I try with other simple HTML files, it does detect TagNodes. Please kindly help me out from this issue. Not sure if my HTML file charecter set is different ? Or Should I choose any encoding options ? Here is my code: Also Attached is my HTML file.It has images but I am not attaching them. parser = new Parser("atest.htm"); for (NodeIterator i = parser.elements(); i.hasMoreNodes();){ processMyNodes(i.nextNode()); } static void processMyNodes (Node node) throws ParserException { if (node instanceof TextNode) {e TextNode text = (TextNode)node; System.out.println (text.getText ()); } if (node instanceof RemarkNode) { RemarkNode remark = (RemarkNode)node; } else if (node instanceof TagNode) { TagNode tag = (TagNode)node; NodeList nl = tag.getChildren (); if (null != nl) for (NodeIterator i = nl.elements (); i.hasMoreNodes (); ) processMyNodes (i.nextNode ()); } } Kumar. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/_______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/_______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Derrick O. <der...@ro...> - 2007-07-30 22:33:30
|
The ISO-8859-1 encoding contains ASCII, you might try that. If there aren't any funny characters in the file it should work OK. ----- Original Message ---- From: k <km...@re...> To: htm...@li... Sent: Monday, July 30, 2007 8:24:07 AM Subject: [Htmlparser-user] Re :Re: Tag Nodes not getting recognized...Please Help Thanks a ton Derrick, for your message, your help is highly appreciable. I have tried earlier using parser.setEncoding("UTF-8"), but it was also not working as expected. Today I have tried getting the content of the file in a string using, parser.setInputHTML(getContentsAsString(testFile)). But it also did not work. The only way it worked is that, if I open the HTML file outside in TextPad and saved it again with Encoding 'ANSI', and then running my code with this new file. Could you please suggest a way that I can do the above using htmlParser or by any other means? I tried reading the file a line at a time and using the following for the conversion. byte[] stringBytesUTF = line.getBytes("UTF-8"); ansiString = new String(stringBytesUTF, "ANSI") But it seems ANSI is not a valid argument. Any advice in this respect is highly valueble to me. Thanking You, Kumar. On Sat, 28 Jul 2007 13:48:36 -0700 (PDT) htmlparser user list wrote It appears the file is unicode, probably UTF-8, so you'll need to get the contents as a string yourself, or try parser.setEncoding ("UTF-8") before performing the parse. Some operating systems support a bye order mask prefix (like 0xFEFF) within the file to identify such files as other than plain ascii. ----- Original Message ---- From: k To: htm...@li... Sent: Saturday, July 28, 2007 8:12:19 AM Subject: [Htmlparser-user] Tag Nodes not getting recognized...Please Help Hi All, First of all thanks very much for your precious time. I hope I will get help from here, as I have no other way. For more than 2 days, I was trying to parse (and process all nodes) one of my HTML file using differnt parsers available. But I was not able to get the Tag Nodes list only for this particular HTML file. When I tried to process this HTML file with HtmlPraser, it was not detecting the TagNodes, it was just detecting the whole html page as one TextNode. But when I try with other simple HTML files, it does detect TagNodes. Please kindly help me out from this issue. Not sure if my HTML file charecter set is different ? Or Should I choose any encoding options ? Here is my code: Also Attached is my HTML file.It has images but I am not attaching them. parser = new Parser("atest.htm"); for (NodeIterator i = parser.elements(); i.hasMoreNodes();){ processMyNodes(i.nextNode()); } static void processMyNodes (Node node) throws ParserException { if (node instanceof TextNode) {e TextNode text = (TextNode)node; System.out.println (text.getText ()); } if (node instanceof RemarkNode) { RemarkNode remark = (RemarkNode)node; } else if (node instanceof TagNode) { TagNode tag = (TagNode)node; NodeList nl = tag.getChildren (); if (null != nl) for (NodeIterator i = nl.elements (); i.hasMoreNodes (); ) processMyNodes (i.nextNode ()); } } Kumar. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/_______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: k <km...@re...> - 2007-07-30 17:39:44
|
Hi All, Sorry for repeating my request but please help me on how to parse HTML file with \'UTF-8\' encoding charset as parser.setEncoding(\"UTF-8\") was not working for me.Thanks very much in advance.Kumar.On 30 Jul 2007 12:24:07 -0000 htmlparser user list wrote Thanks a ton Derrick, for your message, your help is highly appreciable. I have tried earlier using parser.setEncoding(\"UTF-8\"), but it was also not working as expected. Today I have tried getting the content of the file in a string using, parser.setInputHTML(getContentsAsString(testFile)). But it also did not work. The only way it worked is that, if I open the HTML file outside in TextPad and saved it again with Encoding \'ANSI\', and then running my code with this new file.Could you please suggest a way that I can do the above using htmlParser or by any other means? I tried reading the file a line at a time and using the following for the conversion.byte[] stringBytesUTF = line.getBytes(\"UTF-8\");ansiString = new String(stringBytesUTF, \"ANSI\")But it seems ANSI is not a valid argument. Any advice in this respect is highly valueble to me.Thanking You,Kumar.On Sat, 28 Jul 2007 13:48:36 -0700 (PDT) htmlparser user list wroteIt appears the file is unicode, probably UTF-8, so you\'ll need to get the contents as a string yourself, or try parser.setEncoding (\"UTF-8\") before performing the parse. Some operating systems support a bye order mask prefix (like 0xFEFF) within the file to identify such files as other than plain ascii.----- Original Message ----From: k To: htm...@li...Sent: Saturday, July 28, 2007 8:12:19 AMSubject: [Htmlparser-user] Tag Nodes not getting recognized...Please HelpHi All, First of all thanks very much for yourprecious time. I hope I will get help from here, as I have no other way. For more than 2 days, I was trying to parse (and process all nodes) one of my HTML file using differnt parsers available. But I was not able to get the Tag Nodes list only for this particular HTML file. When I tried to process this HTML file with HtmlPraser, it was not detecting the TagNodes, it was just detecting the whole html page as one TextNode.But when I try with other simple HTML files, it does detect TagNodes. Please kindly help me out from this issue. Not sure if my HTML file charecter set is different ? Or Should I choose any encoding options ?Here is my code: Also Attached is my HTML file.It has images but I am not attaching them. parser = new Parser(\"atest.htm\"); for (NodeIterator i = parser.elements();i.hasMoreNodes();){ processMyNodes(i.nextNode()); } static void processMyNodes (Node node) throws ParserException { if (node instanceof TextNode) {e TextNode text = (TextNode)node; System.out.println (text.getText ()); } if (node instanceof RemarkNode) { RemarkNode remark = (RemarkNode)node; } else if (node instanceof TagNode) { TagNode tag = (TagNode)node; NodeList nl = tag.getChildren (); if (null != nl) for (NodeIterator i =nl.elements (); i.hasMoreNodes (); ) processMyNodes (i.nextNode ()); } }Kumar.-------------------------------------------------------------------------This SF.net email is sponsored by: Splunk Inc.Still grepping through log files to findproblems? Stop.Now Search log events and configuration files using AJAX and a browser.Download your FREE copy of Splunk now >> http://get.splunk.com/_______________________________________________Htmlparser-user mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: k <km...@re...> - 2007-07-30 12:25:45
|
Thanks a ton Derrick, for your message, your help is highly appreciable. I have tried earlier using parser.setEncoding(\"UTF-8\"), but it was also not working as expected. Today I have tried getting the content of the file in a string using, parser.setInputHTML(getContentsAsString(testFile)). But it also did not work. The only way it worked is that, if I open the HTML file outside in TextPad and saved it again with Encoding \'ANSI\', and then running my code with this new file. Could you please suggest a way that I can do the above using htmlParser or by any other means? I tried reading the file a line at a time and using the following for the conversion.byte[] stringBytesUTF = line.getBytes(\"UTF-8\");ansiString = new String(stringBytesUTF, \"ANSI\")But it seems ANSI is not a valid argument. Any advice in this respect is highly valueble to me. Thanking You,Kumar.On Sat, 28 Jul 2007 13:48:36 -0700 (PDT) htmlparser user list wroteIt appears the file is unicode, probably UTF-8, so you\'ll need to get the contents as a string yourself, or try parser.setEncoding (\"UTF-8\") before performing the parse. Some operating systems support a bye order mask prefix (like 0xFEFF) within the file to identify such files as other than plain ascii.----- Original Message ----From: k To: htm...@li...Sent: Saturday, July 28, 2007 8:12:19 AMSubject: [Htmlparser-user] Tag Nodes not getting recognized...Please HelpHi All, First of all thanks very much for yourprecious time. I hope I will get help from here, as I have no other way. For more than 2 days, I was trying to parse (and process all nodes) one of my HTML file using differnt parsers available. But I was not able to get the Tag Nodes list only for this particular HTML file. When I tried to process this HTML file with HtmlPraser, it was not detecting the TagNodes, it was just detecting the whole html page as one TextNode.But when I try with other simple HTML files, it does detect TagNodes. Please kindly help me out from this issue. Not sure if my HTML file charecter set is different ? Or Should I choose any encoding options ?Here is my code: Also Attached is my HTML file.It has images but I am not attaching them. parser = new Parser(\"atest.htm\"); for (NodeIterator i = parser.elements();i.hasMoreNodes();){ processMyNodes(i.nextNode()); } static void processMyNodes (Node node) throws ParserException { if (node instanceof TextNode) {e TextNode text = (TextNode)node; System.out.println (text.getText ()); } if (node instanceof RemarkNode) { RemarkNode remark = (RemarkNode)node; } else if (node instanceof TagNode) { TagNode tag = (TagNode)node; NodeList nl = tag.getChildren (); if (null != nl) for (NodeIterator i =nl.elements (); i.hasMoreNodes (); ) processMyNodes (i.nextNode ()); } }Kumar.-------------------------------------------------------------------------This SF.net email is sponsored by: Splunk Inc.Still grepping through log files to findproblems? Stop.Now Search log events and configuration files using AJAX and a browser.Download your FREE copy of Splunk now >> http://get.splunk.com/_______________________________________________Htmlparser-user mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Subir B. <sub...@re...> - 2007-07-30 06:58:00
|
=0AHi=0A I'm new to HtmlParser. I want to get the code snippet betwee= n body tag parsing an html file.=0A=0AFor Example :=0A=0A<html>=0A <head>= =0A <title>=0A title=0A </title>=0A </head>=0A <body>=0A <div id=3D"te= st">=0A <P ALIGN=3D'LEFT'>=0A <FONT FACE=3D"SolaimanLipi" SIZE=3D"3" C= OLOR=3D"#000000">হোম  </FONT>=0A Testing Using Ht= mlParser=0A </P>=0A=0A </div>=0A </body>=0A</html>=0A=0A=0A From the= above html i want the snippet below :=0A=0A<div id=3D"test">=0A <P ALIGN= =3D'LEFT'>=0A <FONT FACE=3D"SolaimanLipi" SIZE=3D"3" COLOR=3D"#000000">&= #2489;োম  </FONT>=0A Testing Using HtmlParser=0A </P>= =0A=0A </div>=0A=0ACan anybody send me sample code for that ?=0A=0A=0AThan= ks in advance..=0A=0ARegards....=0A=0ASubir |
From: Derrick O. <der...@ro...> - 2007-07-28 20:48:45
|
It appears the file is unicode, probably UTF-8, so you'll need to get the contents as a string yourself, or try parser.setEncoding ("UTF-8") before performing the parse. Some operating systems support a bye order mask prefix (like 0xFEFF) within the file to identify such files as other than plain ascii. ----- Original Message ---- From: k <km...@re...> To: htm...@li... Sent: Saturday, July 28, 2007 8:12:19 AM Subject: [Htmlparser-user] Tag Nodes not getting recognized...Please Help Hi All, First of all thanks very much for your precious time. I hope I will get help from here, as I have no other way. For more than 2 days, I was trying to parse (and process all nodes) one of my HTML file using differnt parsers available. But I was not able to get the Tag Nodes list only for this particular HTML file. When I tried to process this HTML file with HtmlPraser, it was not detecting the TagNodes, it was just detecting the whole html page as one TextNode. But when I try with other simple HTML files, it does detect TagNodes. Please kindly help me out from this issue. Not sure if my HTML file charecter set is different ? Or Should I choose any encoding options ? Here is my code: Also Attached is my HTML file.It has images but I am not attaching them. parser = new Parser("atest.htm"); for (NodeIterator i = parser.elements(); i.hasMoreNodes();){ processMyNodes(i.nextNode()); } static void processMyNodes (Node node) throws ParserException { if (node instanceof TextNode) {e TextNode text = (TextNode)node; System.out.println (text.getText ()); } if (node instanceof RemarkNode) { RemarkNode remark = (RemarkNode)node; } else if (node instanceof TagNode) { TagNode tag = (TagNode)node; NodeList nl = tag.getChildren (); if (null != nl) for (NodeIterator i = nl.elements (); i.hasMoreNodes (); ) processMyNodes (i.nextNode ()); } } Kumar. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Madhur K. T. <mad...@gm...> - 2007-07-27 05:33:58
|
Yes you can. have a look at the Parser <http://htmlparser.sourceforge.net/javadoc/org/htmlparser/Parser.html> class in HTMLParser - particularly the ctor Parser(String resource) <http://htmlparser.sourceforge.net/javadoc/org/htmlparser/Parser.html#Parser%28java.lang.String%29> - which allows you to create a parser for any resource. You can also have a look at the sample programs <http://htmlparser.sourceforge.net/samples.html>. Most of them work on URLs, but the Parser sample might be of interest ot you. HTH. Adam Dallis wrote: > Hi all, > > I am trying from my java program to parse an HTML file, but I was > wondering if I could use HTML parser to do that? if yes can you please > guide me to the right direction because to me it seems that it only > works with URLs. > > Thank you for your time > > Adam > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > ------------------------------------------------------------------------ > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > -- Madhur Kumar Tanwani <http://madhurtanwani.googlepages.com> Gebo <http://feeds.feedburner.com/%7Er/Gebo/%7E6/1> **************************************************************** * Knowledge is of two kinds : * We know a subject ourselves, or * We know where we can find information on it * -- Samuel Johnson **************************************************************** |
From: Adam D. <ada...@go...> - 2007-07-26 15:45:55
|
Hi all, I am trying from my java program to parse an HTML file, but I was wondering if I could use HTML parser to do that? if yes can you please guide me to the right direction because to me it seems that it only works with URLs. Thank you for your time Adam |
From: Derrick O. <der...@ro...> - 2007-07-25 23:24:32
|
You can add your own tags to handle tags that are not automatically nested as described in the FAQ: http://htmlparser.sourceforge.net/faq.html#composite ----- Original Message ---- From: "a....@un..." <a....@un...> To: htm...@li... Sent: Wednesday, July 25, 2007 12:18:38 PM Subject: [Htmlparser-user] New Hi i don't understan how i can set some node filter , becouse i must parser an html file where there is a tag <pre> but the filter of the library don't recognize it, i mean there is a tex from <pre> and </pre> but it don't find it can someone help me? bye antonio ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: <a....@un...> - 2007-07-25 16:37:26
|
Hi i don't understan how i can set some node filter , becouse i must parser an html file where there is a tag <pre> but the filter of the library don't recognize it, i mean there is a tex from <pre> and </pre> but it don't find it can someone help me? bye antonio |
From: Alberto <da...@li...> - 2007-07-16 09:10:35
|
Hi everyone! I have to start a project: the goal is to analyze an html page (tags and = attributes) and then make eventually some modifications (adding tags, mod= ifing attributes...). I ask you if the "html parser" is the right choice for my project? Thanks for your answers and suggestions! Bye, Alberto!=0A=0A=0A---------------------------------------------------= ---=0AScegli infostrada: ADSL gratis per tutta l=92estate e telefoni senz= a canone Telecom=0Ahttp://click.libero.it/infostrada=0A |
From: Derrick O. <der...@ro...> - 2007-07-11 15:28:09
|
Cinza, If you have the complete list of nodes in a node list and are using this to filter, you can find your node just by stepping through the list. So let's say your list is all_nodes, and your table filter code is something like this: NodeList tables = all_nodes.ExtractAllNodesThatMatch (new MyTableFilter ()); then you can step through the nodes in all_nodes until you reach the table, and the prior node is the one you want: Node previous = null; Node target = tables.elementAt(0); // or cycle through the results for (int i = 0; i < all_node.Size(); i++) { Node candidate = all_nodes.elementAt(i); if (candidate == target) break; else previous = candidate; } Here, previous is set to the node before your table if it's not null. Derrick ----- Original Message ---- From: "c....@ar..." <c....@ar...> To: htm...@li... Sent: Wednesday, July 11, 2007 10:53:23 AM Subject: [Htmlparser-user] No previous sibling Thanks Derrick! It worked! But now I have another question. I think that if my node doesn't have a parent node, the previousSibling and nextSibling methods do not work. Is it right? This is my problem: as I said I have no html, head and body tag. I have some table tags but I need to access to the previous row (or node) just before the table. I filtered the page with the TagNameFilter("table"), but I do not have any parent node. What can I do? Thanks again! Cinzia ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: <c....@ar...> - 2007-07-11 14:56:28
|
Thanks Derrick! It worked! But now I have another question. I think that if my node doesn't have a parent node, the previousSibling and nextSibling methods do not work. Is it right? This is my problem: as I said I have no html, head and body tag. I have some table tags but I need to access to the previous row (or node) just before the table. I filtered the page with the TagNameFilter("table"), but I do not have any parent node. What can I do? Thanks again! Cinzia |
From: Madhur K. T. <mad...@gm...> - 2007-07-11 05:08:42
|
One of the ways you can do it is to implement a custom NodeVisitor. This will help you capture all remark nodes in the input and work on them. Here is a small example : public class RemarkNodeVisitor extends NodeVisitor { public void visitRemarkNode(Remark remark) { //do your stuff here... - remark is the actual remark node in the input! } } This link <http://htmlparser.sourceforge.net/javadoc/org/htmlparser/visitors/NodeVisitor.html> refers to the details of the NodeVisitor class. HTH. c....@ar... wrote: > Hi, > I'm new to this list so first of all... hello! > Anyway, sorry for my english but it isn't my mother tongue. > > I have an html page but with no html, head nor body tag. I have only the > code (that will be put in another page's body). In this page I have some > html comments which I need to reach and to parse. > For example: > > <!-- TEMPLATE TIPO 1 TABELLA id=28 version=8 --> > > I need to get all these comments and access to their attributes. Which > kind of filter should I use? Or how can I do? > Thank you! > > > Cinzia > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > -- Madhur Kumar Tanwani <http://madhurtanwani.googlepages.com> Gebo <http://feeds.feedburner.com/%7Er/Gebo/%7E6/1> **************************************************************** * Knowledge is of two kinds : * We know a subject ourselves, or * We know where we can find information on it * -- Samuel Johnson **************************************************************** |
From: Derrick O. <der...@ro...> - 2007-07-11 00:30:14
|
Hi Cinza, There is no filter specifically for remark nodes, so you'll need to make your own. Start with an example like TagNameFilter.java and change the class name and the accept() method to return true for remarks. Something like this should work: public boolean accept (Node node) { return (node instanceof Remark); } Using that filter should give you all the comments in the page: NodeList remarks = parser.parse (new MyRemarkFilter()); Then the trouble begins. Remark nodes do not parse the content, so you will only be able to get at the entire contents with the getText() method. Then you either have to parse the text yourself for the 'attributes' or pervert the code that handles attribute parsing in the Lexer class to do the attribute parsing. One way to do that would be to enclose the text from the remark in a fake html tag and parse that. Something like this might work: parser = new Parser ("<html " + remark_text + " >"); Tag tag = parser.parse (null).element(0); Vector attributes = tag.getAttributesEx (); Derrick ----- Original Message ---- From: "c....@ar..." <c....@ar...> To: htm...@li... Sent: Tuesday, July 10, 2007 8:06:57 AM Subject: [Htmlparser-user] Retrieving html comments Hi, I'm new to this list so first of all... hello! Anyway, sorry for my english but it isn't my mother tongue. I have an html page but with no html, head nor body tag. I have only the code (that will be put in another page's body). In this page I have some html comments which I need to reach and to parse. For example: <!-- TEMPLATE TIPO 1 TABELLA id=28 version=8 --> I need to get all these comments and access to their attributes. Which kind of filter should I use? Or how can I do? Thank you! Cinzia ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: <c....@ar...> - 2007-07-10 12:09:43
|
Hi, I'm new to this list so first of all... hello! Anyway, sorry for my english but it isn't my mother tongue. I have an html page but with no html, head nor body tag. I have only the code (that will be put in another page's body). In this page I have some html comments which I need to reach and to parse. For example: <!-- TEMPLATE TIPO 1 TABELLA id=28 version=8 --> I need to get all these comments and access to their attributes. Which kind of filter should I use? Or how can I do? Thank you! Cinzia |