htmlparser-user Mailing List for HTML Parser (Page 25)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Ishmael R. <sak...@gm...> - 2007-09-26 16:58:19
|
Hello out there |
From: <mic...@no...> - 2007-09-26 14:31:01
|
I am having trouble parsing html tagged text. It seems that I can retrieve a node but that element does not have the child nodes as expected. String table = "<tbody>\n" + "<tr>\n" + "<td><span>brain_normal_GSM80627</span></td>\n" + "<td><span>normal</span></td>\n" + "<td><span>cerebral cortex</span></td>\n" + "<td><span>brain</span></td>\n" + "</tr>\n" + "</tbody>\n"; Parser parser = new Parser(new Lexer(table)); try { Node tBodyNode = parser.extractAllNodesThatMatch(new TagNameFilter("tbody")).elementAt(0); System.out.println(tBodyNode.getChildren()); // Prints null <--------------- } catch (ParserException e) { e.printStackTrace(); //To change body of catch statement use File | Settings | File Templates. } Does HTML Parser not handle text input or partial html files well? _________________________ CONFIDENTIALITY NOTICE The information contained in this e-mail message is intended only for the exclusive use of the individual or entity named above and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivery of the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by e-mail and delete the material from any computer. Thank you. |
From: Derrick O. <der...@ro...> - 2007-09-26 11:54:42
|
Rupanu,=0A=0AI'm not sure where your problem lies. The exception was raised= because the encoding of the stream didn't agree with the stated contents o= f the HTML within it. The code in ConnectionManager that opens a disk file = - URLConnection openConnection (String string) - uses the override - URLCon= nection openConnection (URL url) - with the url being the file name prefixe= d by "file://localhost".=0A=0ASo it's up to the JVM and operating system to= figure out the encoding of the text file on disk. Apparently, the file was= not written with the correct encoding bytes at the beginning of the file o= r something, so this couldn't be figured out and it was opened with ISO-885= 9-1 instead of UTF8 encoding.=0A=0ATo fix it, the text file of HTML needs t= o be written differently, or you need to open it differently using perhaps = your own stream passed to the Page constructor.=0A=0ADerrick=0A=0A----- Ori= ginal Message ----=0AFrom: Rupanu Ranjaneswar <rup...@ya...>=0ATo: = htm...@li...=0ASent: Wednesday, September 26, 2007= 2:07:27 AM=0ASubject: [Htmlparser-user] Encoding issue=0A=0AHello there,= =0A=0AWell, I copied and pasted the code you gave but there seems to be an = issue with encoding.I am trying to read from a non-unicode htm/html file an= d extract its contents and write them into a text file.=0AHere's the code = =0A*********************************=0AString inputfile =3D args[0];=0A = Parser parser =3D new Parser (inputfile);=0A StringBean sb = =3D new StringBean ();=0A parser.visitAllNodesWith (sb);=0A = String content =3D sb.getStrings();=0A String outputfilenam= e=3D "E:\\outputfile.txt"; =0A OutputStreamWriter osw= =3D new OutputStreamWriter(new FileOutputStream(outputfilename)); //,=0A= "UTF8"=0A osw.write(content);=0A =0A = osw.close();=0A********************************************= **=0Aand here is the exception I get=0Aorg.htmlparser.util.EncodingChangeEx= ception: character mismatch (new: ? [0xfeff] !=3D old: [0xef=C3=AF]) for e= ncoding change from ISO-8859-1 to UTF-8 at character offset 0=0A=0AHowever = then I wrote the following code which served my purpose to some extent.But = could you please explain what was the issue there and how can i render the = encoding of an htm/html file.(offline/saved in my hard drive).=0A=0A*******= ********=0AStringExtractor strext =3D new StringExtractor(input);=0AString = content =3D strext.extractStrings(false);=0A=0A String=0A outputfile= name=3D"output.txt";=0A OutputStreamWriter osw=3D new OutputStreamWr= iter(new FileOutputStream(outputfilename), "UTF8");=0A osw.write(con= tent);=0A*************=0A =0A =0ALuggage? GPS? Comic books? =0A=0AChec= k out fitting gifts for grads at Yahoo! Search.=0A=0A=0A |
From: Rupanu R. <rup...@ya...> - 2007-09-26 06:07:33
|
Hello there, Well, I copied and pasted the code you gave but there seems to be an issue with encoding.I am trying to read from a non-unicode htm/html file and extract its contents and write them into a text file. Here's the code ********************************* String inputfile = args[0]; Parser parser = new Parser (inputfile); StringBean sb = new StringBean (); parser.visitAllNodesWith (sb); String content = sb.getStrings(); String outputfilename= "E:\\outputfile.txt"; OutputStreamWriter osw= new OutputStreamWriter(new FileOutputStream(outputfilename)); //, "UTF8" osw.write(content); osw.close(); ********************************************** and here is the exception I get org.htmlparser.util.EncodingChangeException: character mismatch (new: ? [0xfeff] != old: [0xefï]) for encoding change from ISO-8859-1 to UTF-8 at character offset 0 However then I wrote the following code which served my purpose to some extent.But could you please explain what was the issue there and how can i render the encoding of an htm/html file.(offline/saved in my hard drive). *************** StringExtractor strext = new StringExtractor(input); String content = strext.extractStrings(false); String outputfilename="output.txt"; OutputStreamWriter osw= new OutputStreamWriter(new FileOutputStream(outputfilename), "UTF8"); osw.write(content); ************* --------------------------------- Luggage? GPS? Comic books? Check out fitting gifts for grads at Yahoo! Search. |
From: Derrick O. <der...@ro...> - 2007-09-25 22:38:59
|
You should be able to get the parent node and from there navigate past the = original node.=0AIf it was as simple as you say, this should work...=0A=0A = gender_node =3D my_heading_node.getParent ().getChildren ().elementAt (1)= ; // 0 is the header=0A=0A address_node =3D my_heading_node.getParent ().= getChildren ().elementAt (3);=0A=0A=0AYou will need to watch out for extran= eous whitespace and other nodes like <br/>.=0A=0A----- Original Message ---= -=0AFrom: Angelo Chen <ang...@ya...>=0ATo: htmlparser-user@li= sts.sourceforge.net=0ASent: Tuesday, September 25, 2007 6:16:42 AM=0ASubjec= t: [Htmlparser-user] getting data next to the current tag=0A=0AHi,=0A=0AI h= ave some lines like following, I can use:=0A a1 =3D nl.extractAllNodesThatM= atch(new=0ATagNameFilter("h1"),true);=0Ato reach <h1></h1>, and the gender = and address is=0Aalways below the <h1> line, how to access the two=0Alines = right after the <h1> tag? thanks.=0AA.C. =0A=0A <td ><h1>Angelo</h1>=0A = Male<br />San Jose, California <br />=0A </td>=0A=0A=0A Yaho= o! =A5=FE=B7s=A4=C9=AF=C5=BA=F4=A4W=AC=DB=C3=AF=A1A=C5=FD=A7A=A5=D1=AC=DB= =A4=F9=A4=A4=A4=C0=A8=C9=A5=CD=AC=A1=C2I=BAw=A1A=BD=D0=ABe=A9=B9http://hk.p= hotos.yahoo.com =A4F=B8=D1=A7=F3=A6h!=0A=0A=0A=0A=0A=0A=0A=0A |
From: Angelo C. <ang...@ya...> - 2007-09-25 10:16:56
|
Hi, I have some lines like following, I can use: a1 = nl.extractAllNodesThatMatch(new TagNameFilter("h1"),true); to reach <h1></h1>, and the gender and address is always below the <h1> line, how to access the two lines right after the <h1> tag? thanks. A.C. <td ><h1>Angelo</h1> Male<br />San Jose, California <br /> </td> Yahoo! 全新升級網上相簿,讓你由相片中分享生活點滴,請前往http://hk.photos.yahoo.com 了解更多! |
From: Nic S. <oo...@gm...> - 2007-09-24 23:10:21
|
visit the old website, links in the old website work. On 9/24/07, Derrick Oswald <der...@ro...> wrote: > > Sorry for the broken link. My bad. It's under 'old<http://htmlparser.sourceforge.net/old/javadoc/org/htmlparser/parserapplications/LinkExtractor.html> > '. > > It actually only pointed at the JavaDoc for the StringExtractor class. > Basically, it did this: > > Parser parser = new Parser (<url from command line>); > StringBean sb = new StringBean (); > parser.visitAllNodesWith (sb); > System.out.println (sb.getStrings()); > > > ----- Original Message ---- > From: Rupanu Ranjaneswar <rup...@ya...> > To: Htm...@li... > Sent: Monday, September 24, 2007 4:13:51 AM > Subject: [Htmlparser-user] newbie:Sample code for text extraction > > Hello there, > I just got started off and was looking for some sample codes given in the > website. but the link for String extractor doesn't seem to be working.Itwould be great if someone could provide me an alternate link or send me an > email with the Text Extractor code. > > Thanks and regards > Rupanu Pal > > ------------------------------ > Take the Internet to Go: Yahoo!Go puts the Internet in your pocket:<http://us.rd.yahoo.com/evt=48253/*http://mobile.yahoo.com/go?refer=1GNXIC>mail, news, photos & more. > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Derrick O. <der...@ro...> - 2007-09-24 11:54:19
|
Sorry for the broken link. My bad. It's under 'old'.=0A=0AIt actually only = pointed at the JavaDoc for the StringExtractor class.=0ABasically, it did t= his:=0A=0A Parser parser =3D new Parser (<url from command line>);=0A Str= ingBean sb =3D new StringBean ();=0A parser.visitAllNodesWith (sb);=0A Sy= stem.out.println (sb.getStrings());=0A=0A=0A----- Original Message ----=0AF= rom: Rupanu Ranjaneswar <rup...@ya...>=0ATo: Htmlparser-user@lists.= sourceforge.net=0ASent: Monday, September 24, 2007 4:13:51 AM=0ASubject: [H= tmlparser-user] newbie:Sample code for text extraction=0A=0AHello there,=0A= I just got started off and was looking for some sample codes given in the = website. but the link for String extractor doesn't seem to be working.It wo= uld be great if someone could provide me an alternate link or send me an em= ail with the Text Extractor code. =0A=0AThanks and regards=0ARupanu Pa= l=0A=0A =0A =0ATake the Internet to Go: Yahoo!Go puts the Internet in = your pocket: mail, news, photos & more. =0A=0A=0A |
From: Rupanu R. <rup...@ya...> - 2007-09-24 08:14:04
|
Hello there, I just got started off and was looking for some sample codes given in the website. but the link for String extractor doesn't seem to be working.It would be great if someone could provide me an alternate link or send me an email with the Text Extractor code. Thanks and regards Rupanu Pal --------------------------------- Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos & more. |
From: Rainy G. <Ra...@cr...> - 2007-09-21 13:15:54
|
R'umor N*e-w+s-: O,ncolog+y M,e,d,. I*n*c.. (OTC+: ON'CO) a C_ancer T.reatme,nt = So-lution s Grou'p is s-a+i_d to h*a.v'e e xperienc*ed o'v.e_r a 1000+% increas_ e in r-evenue.s f,o-r t+h+e = fis*cal 3-r.d qu-arter en_ding J.u'l_y+, 2 0+0 7 c*,ompared w'i.t_h t'h,e p,rior y'e.a.r w_hile fisca,l fou-rth = quar'ter resu*lts f.o*r 2+0'0.7 a*r*e on tr.ack to e*xceed t.h*i s yea*r=92s thi.rd qu-arter resu.lts. O'N+C,O a-dditi-onally p-lans to i*ncre+ase servi+ce offe-_rings w-hich = a r.e cur'ren_tly u+nderway'. D.on=92t w,a*i t f*o r t+h_e n*e,w-s to c.o_m+e o_u t a_n,d l+o_s e = t'h_e opp''ortunity to g+e,t in f.ront of the gen eral inv*e'sting p'ublic. On,,cology M_e.d is in a multi--billion = do_llar indust__ry w h'e r_e t h e_y a.r'e ga-ining mar+ket sha_re rapidl y. C*a_l*l y_o_u_r broke+r n.o'w f'o.r O-N C,O,. |
From: puig s. <pui...@Al...> - 2007-09-20 09:03:05
|
compliments htmlparser-user Low self-esteem?, change that today puig sloyer http://www.claaaol.com/ |
From: Derrick O. <der...@ro...> - 2007-09-17 11:48:00
|
=0AThe NodeList class has elementAt() and size() methods.=0Afor (int i =3D = 0; i < list.size(); i++)=0A node =3D list.elementAt(i);=0A=0AYou'll need t= o cast them to LinkTag nodes and then use getLink().=0A=0A----- Original Me= ssage ----=0AFrom: Nic Soltani <oo...@gm...>=0ATo: htmlparser-user@li= sts.sourceforge.net=0ASent: Sunday, September 16, 2007 11:56:25 PM=0ASubjec= t: [Htmlparser-user] iterate through node list=0A=0AHi=0AI created a NodeLi= st which contains hyperlinks extracted from an HTML webpage,=0AI need to be= able to iterate through every single node and extract its href.=0AWonderin= g if anyone can help me with:=0Ahow to Iterate nodes 1 by 1=0Aextract href= =0ANodeList URLs =3D ExtractHyperLinks(HTML);=0A/*=0A * at this stage we ha= ve all:=0A * <A HREF=3D"link1">something1</A>=0A * <A HREF=3D"link2= ">something2</A>=0A=0A * <A HREF=3D"link3">something3</A>=0A * <A H= REF=3D"link4">something4</A>=0A * <A HREF=3D"link5">something5</A>=0A *= /=0A=0A=0A=0A=0A=0A=0A |
From: Mattia T. <mat...@gm...> - 2007-09-17 11:44:59
|
Hi try this: in a new class, after importing: import org.htmlparser.tags.*; import org.htmlparser.util.*; insert next method: protected URL[] extractLinks(String url) throws ParserException { Parser parser; Vector vector; LinkTag link; URL[] ret; parser = new Parser(url); ObjectFindingVisitor visitor = new ObjectFindingVisitor( LinkTag.class); parser.visitAllNodesWith(visitor); Node[] nodes = visitor.getTags(); vector = new Vector(); for (int i = 0; i < nodes.length; i++) try { link = (LinkTag) nodes[i]; System.out.println(link.getLink() + " " + link.getLinkText ()); vector.add(new URL(link.getLink())); } catch (MalformedURLException murle) { murle.printStackTrace(); } ret = new URL[vector.size()]; vector.copyInto(ret); return (ret); } Hope this help. Cheers Mattia 2007/9/17, Nic Soltani <oo...@gm...>: > > Hi > I created a NodeList which contains hyperlinks extracted from an HTML > webpage, > I need to be able to iterate through every single node and extract its > href. > Wondering if anyone can help me with: > > 1. how to Iterate nodes 1 by 1 > 2. extract href > > > NodeList URLs = ExtractHyperLinks(HTML); > /* > * at this stage we have all: > * <A HREF="link1">something1</A> > * <A HREF="link2">something2</A> > * <A HREF="link3">something3</A> > * <A HREF="link4">something4</A> > * <A HREF="link5">something5</A> > */ > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Nic S. <oo...@gm...> - 2007-09-17 03:56:26
|
Hi I created a NodeList which contains hyperlinks extracted from an HTML webpage, I need to be able to iterate through every single node and extract its href. Wondering if anyone can help me with: 1. how to Iterate nodes 1 by 1 2. extract href NodeList URLs = ExtractHyperLinks(HTML); /* * at this stage we have all: * <A HREF="link1">something1</A> * <A HREF="link2">something2</A> * <A HREF="link3">something3</A> * <A HREF="link4">something4</A> * <A HREF="link5">something5</A> */ |
From: Derrick O. <der...@ro...> - 2007-09-12 12:14:20
|
=0AThis has been fixed in the trunk version of the subversion repository, b= ut not yet released as a package, sorry.=0ASee bug #1761484 tag.setAttribut= e() not compatible with <tag/>=0A=0A----- Original Message ----=0AFrom: Kar= sten Ohme <wid...@t-...>=0ATo: htm...@li...urceforge.= net=0ASent: Wednesday, September 12, 2007 5:15:16 AM=0ASubject: [Htmlparser= -user] HTML parser bug with closing tag=0A=0AHello,=0A=0AHTMPParser does no= t work like expected.=0AIf some XML conforming tags like <br/> are closed i= mmediately, the =0Afollowing happens if an attribute is added:=0A=0A<br /id= =3D"test">=0A=0AI would instead expect this: <br id=3D"test"/>=0A=0AThe att= ached test can be used for showing the problem.=0A=0ARegards,=0AKarsten=0A= =0Aimport org.htmlparser.Node;=0Aimport org.htmlparser.Parser;=0Aimport org= .htmlparser.nodes.TagNode;=0Aimport org.htmlparser.util.NodeIterator;=0Aimp= ort org.htmlparser.util.NodeList;=0Aimport org.htmlparser.util.ParserExcept= ion;=0Aimport org.junit.Test;=0A=0A=0Apublic class HTMLParserBug {=0A=0A = private final String invalid =3D "<!DOCTYPE html PUBLIC \"-//W3C//DTD =0AH= TML 4.01 Transitional//EN\">" + "<html>" + "<head>"=0A = + "<meta =0Ahttp-equiv=3D\"content-type\" content=3D\"t= ext/html; charset=3DISO-8859-1\">" =0A+ "</head>" + "<body>"=0A = + "Text" + "<br/>" + "Text" + =0A"</body>" + "= </html>";=0A=0A @Test=0A public void testClosingTag() {=0A try= {=0A Parser parser =3D Parser.createParser(invalid, "ISO-8859-1= ");=0A NodeIterator it =3D parser.elements();=0A proc= essNode(it);=0A } catch (ParserException e) {=0A e.printS= tackTrace();=0A }=0A }=0A=0A private static void processNode(N= odeIterator it) throws =0AParserException {=0A while (it.hasMoreNode= s()) {=0A Node node =3D it.nextNode();=0A System.out.= println(node);=0A if (node instanceof TagNode) {=0A = ((TagNode) node).setAttribute("id", "test");=0A System.ou= t.println(node);=0A NodeList list =3D ((TagNode) node).getCh= ildren();=0A if (list !=3D null) {=0A pro= cessNode(list.elements());=0A }=0A }=0A }= =0A }=0A=0A}=0A=0A=0A---------------------------------------------------= ----------------------=0AThis SF.net email is sponsored by: Microsoft=0ADef= y all challenges. Microsoft(R) Visual Studio 2005.=0Ahttp://clk.atdmt.com/M= RT/go/vse0120000070mrt/direct/01/=0A_______________________________________= ________=0AHtmlparser-user mailing list=0AH...@li...urceforge= .net=0Ahttps://lists.sourceforge.net/lists/listinfo/htmlparser-user=0A=0A= =0A=0A=0A |
From: Karsten O. <wid...@t-...> - 2007-09-12 09:19:04
|
Hello, HTMPParser does not work like expected. If some XML conforming tags like <br/> are closed immediately, the following happens if an attribute is added: <br /id="test"> I would instead expect this: <br id="test"/> The attached test can be used for showing the problem. Regards, Karsten import org.htmlparser.Node; import org.htmlparser.Parser; import org.htmlparser.nodes.TagNode; import org.htmlparser.util.NodeIterator; import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; import org.junit.Test; public class HTMLParserBug { private final String invalid = "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\">" + "<html>" + "<head>" + "<meta http-equiv=\"content-type\" content=\"text/html; charset=ISO-8859-1\">" + "</head>" + "<body>" + "Text" + "<br/>" + "Text" + "</body>" + "</html>"; @Test public void testClosingTag() { try { Parser parser = Parser.createParser(invalid, "ISO-8859-1"); NodeIterator it = parser.elements(); processNode(it); } catch (ParserException e) { e.printStackTrace(); } } private static void processNode(NodeIterator it) throws ParserException { while (it.hasMoreNodes()) { Node node = it.nextNode(); System.out.println(node); if (node instanceof TagNode) { ((TagNode) node).setAttribute("id", "test"); System.out.println(node); NodeList list = ((TagNode) node).getChildren(); if (list != null) { processNode(list.elements()); } } } } } |
From: Guthrie C. <Czi...@sy...> - 2007-09-10 15:55:18
|
Wazzup Czibolya My pen1s grew 1.9ÒÒ after 5 months of use lestin Pelton http://mathun.net/ |
From: lc L. <lc....@LA...> - 2007-09-08 10:23:39
|
Wassup htmlparser-user damn, i was so small but now muccch bigger. Gaming mnbmbnm http://www.hrlyi.com/ |
From: Dominick N. <Nie...@an...> - 2007-08-28 06:49:49
|
Mail to htmlparser-user A MASSIVE schlong, is only a few months away cari Bot http://pinester.com/ |
From: william l. <wil...@ya...> - 2007-08-28 04:27:30
|
just like in the following: "<li>info</li> <li>info</li> <li>info</li> <li>info</li>" or "<a title="taught" href="/index.html" rel="section">info</a> info <a title="research-led" href="/index.html" rel="section">info</a> info." How to extract these info circularly? Thanks in advance! --------------------------------- Building a website is a piece of cake. Yahoo! Small Business gives you all the tools to get online. |
From: lindberg e. <lin...@na...> - 2007-08-26 16:19:10
|
http://www.mankine.com/ Hi htmlparser-user Tired of being just average? Ravinder janzen |
From: Derrick O. <der...@ro...> - 2007-08-24 22:29:07
|
You probably have to hit the login page first.=0AThen use the same Connecti= onManager to access the desired page.=0A=0A----- Original Message ----=0AFr= om: "mic...@Ta..." <mic...@Ta...>=0ATo: htmlparser user l= ist <htm...@li...>=0ACc: htmlparser user list <htm= lpa...@li...>=0ASent: Friday, August 24, 2007 10:36:0= 5 AM=0ASubject: Re: [Htmlparser-user] How to login to web page=0A=0ADerrick= ,=0AI'm trying this approach (strictly HtmlParser) as well, but I can't get= =0Alogged in. The array of URLs is that of a page shown when not logged in.= =0A=0ACan you suggest anything else?=0AThanks,=0AMick=0A=0AURL[] urlArray;= =0A=0AConnectionManager connectionManager =3D new ConnectionManager();=0Aur= l =3D new URL("www.someloginpage.com";);=0AconnectionManager.openConnection= (url);=0A=0AconnectionManager.setRedirectionProcessingEnabled(true);=0Aconn= ectionManager.setCookieProcessingEnabled(true);=0AconnectionManager.setUser= (USER_NAME);=0AconnectionManager.setPassword(PASSWORD);=0A=0A// go to link = with stuff=0Aurl =3D new URL("a page beyond the login page");=0AconnectionM= anager.openConnection(url);=0AlinkBean.setConnection(connectionManager.open= Connection(url));=0AurlArray =3D linkBean.getLinks(); // get all links=0A= =0A=0A---------------------------------------------------------------------= =0A> You might try setRedirectionProcessingEnabled(true). Often the first U= RL=0A> is only a gateway.=0A> Also, it's setCookieProcessingEnabled(true), = not addCookies.=0A=0A=0A---------------------------------------------------= ----------------------=0AThis SF.net email is sponsored by: Splunk Inc.=0AS= till grepping through log files to find problems? Stop.=0ANow Search log e= vents and configuration files using AJAX and a browser.=0ADownload your FRE= E copy of Splunk now >> http://get.splunk.com/=0A_________________________= ______________________=0AHtmlparser-user mailing list=0AHtmlparser-user@lis= ts.sourceforge.net=0Ahttps://lists.sourceforge.net/lists/listinfo/htmlparse= r-user=0A=0A=0A=0A=0A |
From: <mic...@Ta...> - 2007-08-24 14:36:12
|
Derrick, I'm trying this approach (strictly HtmlParser) as well, but I can't get logged in. The array of URLs is that of a page shown when not logged in. Can you suggest anything else? Thanks, Mick URL[] urlArray; ConnectionManager connectionManager = new ConnectionManager(); url = new URL("www.someloginpage.com"); connectionManager.openConnection(url); connectionManager.setRedirectionProcessingEnabled(true); connectionManager.setCookieProcessingEnabled(true); connectionManager.setUser(USER_NAME); connectionManager.setPassword(PASSWORD); // go to link with stuff url = new URL("a page beyond the login page"); connectionManager.openConnection(url); linkBean.setConnection(connectionManager.openConnection(url)); urlArray = linkBean.getLinks(); // get all links --------------------------------------------------------------------- > You might try setRedirectionProcessingEnabled(true). Often the first URL > is only a gateway. > Also, it's setCookieProcessingEnabled(true), not addCookies. |
From: <mic...@Ta...> - 2007-08-24 13:35:25
|
Thanks for all the information! In your first response to my post, in your code, you call these methods: getCookiesArrayList(client) client.getResponseHeaders() getRequestHeaders() I removed these calls, but I guess now I need to use them. Can you post these methods? > 2007/8/24, mic...@ta... <mic...@ta...>: >> >> Mattia, >> >> In my original post, I showed this code that uses HtmlParser to connect >> to >> a web page and get all link from that page. >> >> url = new URL("http://www.google.com"); >> urlConnection = url.openConnection(); >> ConnectionManager connectionManager = new ConnectionManager(); >> connectionManager.setRedirectionProcessingEnabled(true); >> linkBean.setConnection(connectionManager.openConnection(url)); >> urlArray = linkBean.getLinks(); // get all links > > > Ok. So dis work for you and You are exactly in the same situation I was > some > month ago. > I needed to parse some pages and when I faced pages protected from login I > used HttpClient. Sorry for confusing you a bit! :o) > > My problem was that I couldn't get a page that required a login. You >> replied with some code, which I have modified a bit (below). This works, >> but it is not using HtmlParser. It uses apache commons. >> >> Can I get the HtmlParser code to work with the apache commons code. In >> particular, can I use the connection established with the HttpClient >> (from >> apache commons) in the HtmlParser code? If so, how? > > > The site I'm working on requires 2 cookies to be sent to pages protected > by > login so the first step is logging in with HttpClient commons lib, take > the > cookies and use those inside HttpHeaders I'm sending to the second page > (page protected from login) so I could enter and parse this second page. > > > I think you have to investigate if the site you are tring to enter has a > particular system to understand if you are logged or not. Try to see from > fireworks "right click" on the page and select "Page info" you will see > headers form and so on. Lets see if there are cookies. > > > Suppose you are logged in and you took a cookie that say to the site that > you are logged I use the HtmlParser in this way (similarly as you do in > your > code): > > public Hashtable getMySecondPage(String url, ArrayList cookies, > Header[] headers) { > logger.info("url: " + url); > try { > // I pass headers and cookies making the request so the site > see > that I'm logged > setUpConnectionManager(cookies, headers); > Parser parser = new Parser(url); > NodeList nodelist = parser.parse(null); > for (int i = 0; i < nodelist.size(); i++) { > System.out.print(nodelist.toHtml()); > } > } catch (Exception e) { > e.printStackTrace(); > } > return null; > } > > If the site doesn't use cookies try just to make 2 request sequentially, > FIRST the login call than call the page protected from login. > > Cheers > > Mattia > > > > Thanks, >> Mick >> >> -------------------------------------------------------------------------- >> >> import java.io.IOException; >> import java.util.logging.*; >> import java.util.ArrayList; >> import org.apache.commons.httpclient.NameValuePair; >> >> /** >> * WebScraper2 >> * >> */ >> public class WebScraper2 { >> >> private Logger logger; >> private HttpClientUtil httpClientUtil; >> private String loginURL; >> >> /** >> * constructor >> */ >> public WebScraper2() { >> >> createLogger(); >> >> loginURL = "https://www.ctslink.com/login.do"; >> >> httpClientUtil = new HttpClientUtil(); >> >> login(); >> >> } >> >> >> /* >> * login to site >> */ >> public String login() { >> >> String responseString = null; >> >> //logger.info(this.getClass().getName() + " - login"); >> >> try { >> ArrayList<NameValuePair> parameters = new >> ArrayList<NameValuePair>(); >> parameters.add(new NameValuePair("username", >> "joeUser")); >> parameters.add(new NameValuePair("password", >> "somePassword")); >> >> int response = >> httpClientUtil.submitPostForm(this.loginURL, "", >> parameters, null); >> logger.info("Response = " + response); >> parameters.clear(); >> } catch (Exception e) { >> logger.warning(" LOGIN PROBLEM!"); >> e.printStackTrace(); >> } >> >> return responseString; >> } >> >> /* >> * create the logger >> */ >> public void createLogger() { >> // Get a logger; the logger is automatically created if >> // it doesn't already exist >> try { >> // Create a file handler that writes log record to a >> file >> FileHandler handler = new >> FileHandler("webscraper.log"); >> handler.setFormatter(new SimpleFormatter()); // set >> file >> format >> // to plain text, not xml >> >> // Add to the desired logger >> logger = Logger.getLogger("webscraper.Webscraper"); >> logger.addHandler(handler); >> logger.setLevel(Level.INFO); >> } catch (IOException e) { >> System.out >> .println("WebScraper2:createLogger(): >> Error >> creating logger"); >> } >> } >> >> >> /** >> * >> * @param args >> */ >> public static void main(String[] args) { >> >> new WebScraper2(); >> System.out.println("Done"); >> >> } >> } >> >> --------------------------------------------------------------------- >> >> import java.io.*; >> import java.util.ArrayList; >> import org.apache.commons.httpclient.methods.PostMethod; >> import org.apache.commons.httpclient.*; >> import org.apache.commons.httpclient.cookie.CookiePolicy; >> import org.apache.commons.httpclient.NameValuePair; >> >> /** >> * HttpClientUtil >> * >> */ >> public class HttpClientUtil extends HttpClient { >> >> private PostMethod postMethod; >> >> /** >> * constructor >> */ >> public HttpClientUtil() { >> >> } >> >> >> /** >> * submitPostForm >> * >> * @param relativeUrl >> * @param formName >> * @param params >> * @param requestHeaders >> * @return >> */ >> public int submitPostForm(String relativeUrl, String formName, >> ArrayList<NameValuePair> params, Header[] >> requestHeaders) { >> BufferedReader bufferedReader = null; >> int statusCode = -999; >> >> byte[] result = null; >> try { >> NameValuePair[] data = null; >> if (params != null) { >> data = new NameValuePair[params.size()]; >> for (int i = 0; i < params.size(); i++) { >> data[i] = (NameValuePair) params.get(i); >> } >> } >> PostMethod method = new PostMethod(relativeUrl); >> this.postMethod = method; >> method.getParams().setCookiePolicy(CookiePolicy.RFC_2109 >> ); >> >> if (params != null) { >> method.addParameters(data); >> } >> statusCode = this.executeMethod(method); >> >> bufferedReader = new BufferedReader(new >> InputStreamReader(method.getResponseBodyAsStream())); >> String readLine; >> while(((readLine = bufferedReader.readLine()) != null)) { >> System.out.println(readLine); >> } >> >> >> } catch (IOException ioe) { >> ioe.printStackTrace(); >> } >> return statusCode; >> } >> >> } >> >> ------------------------------------------------------------------------- >> This SF.net email is sponsored by: Splunk Inc. >> Still grepping through log files to find problems? Stop. >> Now Search log events and configuration files using AJAX and a browser. >> Download your FREE copy of Splunk now >> http://get.splunk.com/ >> _______________________________________________ >> Htmlparser-user mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> > http://get.splunk.com/_______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Petja K. <Pe...@EA...> - 2007-08-24 10:56:24
|
Hello htmlparser-user Become the ultimate pleasure machine, women will love it http://www.paviboo.com/ |