htmlparser-user Mailing List for HTML Parser (Page 25)

Brought to you by: derrickoswald

htmlparser-user — The user mailing list for users of the htmlparser library

You can subscribe to this list here.

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov (1)	Dec
2002	Jan (7)	Feb	Mar (9)	Apr (50)	May (20)	Jun (47)	Jul (37)	Aug (32)	Sep (30)	Oct (11)	Nov (37)	Dec (47)
2003	Jan (31)	Feb (70)	Mar (67)	Apr (34)	May (66)	Jun (25)	Jul (48)	Aug (43)	Sep (58)	Oct (25)	Nov (10)	Dec (25)
2004	Jan (38)	Feb (17)	Mar (24)	Apr (25)	May (11)	Jun (6)	Jul (24)	Aug (42)	Sep (13)	Oct (17)	Nov (13)	Dec (44)
2005	Jan (10)	Feb (16)	Mar (16)	Apr (23)	May (6)	Jun (19)	Jul (39)	Aug (15)	Sep (40)	Oct (49)	Nov (29)	Dec (41)
2006	Jan (28)	Feb (24)	Mar (52)	Apr (41)	May (31)	Jun (34)	Jul (22)	Aug (12)	Sep (11)	Oct (11)	Nov (11)	Dec (4)
2007	Jan (39)	Feb (13)	Mar (16)	Apr (24)	May (13)	Jun (12)	Jul (21)	Aug (61)	Sep (31)	Oct (13)	Nov (32)	Dec (15)
2008	Jan (7)	Feb (8)	Mar (14)	Apr (12)	May (23)	Jun (20)	Jul (9)	Aug (6)	Sep (2)	Oct (7)	Nov (3)	Dec (2)
2009	Jan (5)	Feb (8)	Mar (10)	Apr (22)	May (85)	Jun (82)	Jul (45)	Aug (28)	Sep (26)	Oct (50)	Nov (8)	Dec (16)
2010	Jan (3)	Feb (11)	Mar (39)	Apr (56)	May (80)	Jun (64)	Jul (49)	Aug (48)	Sep (16)	Oct (3)	Nov (5)	Dec (5)
2011	Jan (13)	Feb	Mar (1)	Apr (7)	May (7)	Jun (7)	Jul (7)	Aug (8)	Sep	Oct (6)	Nov (2)	Dec
2012	Jan (5)	Feb	Mar (3)	Apr (3)	May (4)	Jun (8)	Jul (1)	Aug (5)	Sep (10)	Oct (3)	Nov (2)	Dec (4)
2013	Jan (4)	Feb (2)	Mar (7)	Apr (7)	May (6)	Jun (7)	Jul (3)	Aug	Sep (1)	Oct	Nov	Dec
2014	Jan	Feb (2)	Mar (1)	Apr	May (3)	Jun (1)	Jul	Aug	Sep (1)	Oct (4)	Nov (2)	Dec (4)
2015	Jan (4)	Feb (2)	Mar (8)	Apr (7)	May (6)	Jun (7)	Jul (3)	Aug (1)	Sep (1)	Oct (4)	Nov (3)	Dec (4)
2016	Jan (4)	Feb (6)	Mar (9)	Apr (9)	May (6)	Jun (1)	Jul (1)	Aug	Sep	Oct (1)	Nov (1)	Dec (1)
2017	Jan	Feb (1)	Mar (3)	Apr (1)	May	Jun (1)	Jul (2)	Aug (3)	Sep (6)	Oct (3)	Nov (2)	Dec (5)
2018	Jan (3)	Feb (13)	Mar (28)	Apr (5)	May (4)	Jun (2)	Jul (2)	Aug (8)	Sep (2)	Oct (1)	Nov (5)	Dec (1)
2019	Jan (8)	Feb (1)	Mar	Apr (1)	May (4)	Jun	Jul (1)	Aug	Sep	Oct	Nov (2)	Dec (2)
2020	Jan	Feb	Mar (1)	Apr (1)	May (1)	Jun (2)	Jul (1)	Aug (1)	Sep (1)	Oct	Nov (1)	Dec (1)
2021	Jan (3)	Feb (2)	Mar (1)	Apr (1)	May (2)	Jun (1)	Jul (2)	Aug (1)	Sep	Oct	Nov	Dec
2022	Jan	Feb	Mar	Apr (1)	May (1)	Jun (1)	Jul	Aug (1)	Sep	Oct	Nov	Dec
2023	Jan (2)	Feb	Mar	Apr	May	Jun	Jul	Aug (1)	Sep	Oct	Nov	Dec
2024	Jan (2)	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2025	Jan	Feb	Mar	Apr	May	Jun (1)	Jul	Aug	Sep	Oct (1)	Nov	Dec

Flat | Threaded

<< < 1 .. 23 24 25 26 27 .. 99 > >> (Page 25 of 99)

Re: [Htmlparser-user] Parsing Partial HTML text

From: Derrick O. <der...@ro...> - 2007-09-27 01:15:32

=0AThe tbody tag you are getting is a generic tag - because the parser does=
n't know about tbody.=0AHence it has no children because it is not a compos=
ite node.=0AYou can make your own tbody composite node as described here: h=
ttp://htmlparser.sourceforge.net/faq.html#composite=0A=0A=0A----- Original =
Message ----=0AFrom: "mic...@no..." <michaeld.jones@novartis=
.com>=0ATo: htm...@li...=0ASent: Wednesday, Septem=
ber 26, 2007 10:30:45 AM=0ASubject: [Htmlparser-user] Parsing Partial HTML =
text=0A=0A=0A=0AI am having trouble parsing html tagged=0Atext. It seems th=
at I can retrieve a node but that element does not have=0Athe child nodes a=
s expected. =0A=0A=0A=0A       String table=0A=3D=0A=0A           =0A    "<=
tbody>\n" +=0A=0A           =0A    "<tr>\n" +=0A=0A           =0A    "<td><=
span>brain_normal_GSM80627</span></td>\n"=0A+=0A=0A           =0A    "<td><=
span>normal</span></td>\n"=0A+=0A=0A           =0A    "<td><span>cerebral c=
ortex</span></td>\n"=0A+=0A=0A           =0A    "<td><span>brain</span></td=
>\n"=0A+=0A=0A           =0A    "</tr>\n" +=0A=0A           =0A    "</tbody=
>\n";=0A=0A=0A=0A        Parser parser=0A=3D new Parser(new Lexer(table));=
=0A=0A        try {=0A=0A           =0ANode tBodyNode =3D parser.extractAll=
NodesThatMatch(new TagNameFilter("tbody")).elementAt(0);=0A=0A           =
=0ASystem.out.println(tBodyNode.getChildren());  // Prints null <----------=
-----=0A=0A        } catch=0A(ParserException e) {=0A=0A           =0Ae.pri=
ntStackTrace();  //To change body of catch statement use File=0A| Settings =
| File Templates.=0A=0A        }=0A=0A=0A=0ADoes HTML Parser not handle tex=
t input=0Aor partial html files well?=0A=0A=0A=0A_________________________=
=0A=0A=0A=0ACONFIDENTIALITY NOTICE=0A=0A=0A=0AThe information contained in =
this e-mail message is intended only for the=0Aexclusive use of the individ=
ual or entity named above and may contain information=0Athat is privileged,=
 confidential or exempt from disclosure under applicable=0Alaw. If the read=
er of this message is not the intended recipient, or the=0Aemployee or agen=
t responsible for delivery of the message to the intended=0Arecipient, you =
are hereby notified that any dissemination, distribution=0Aor copying of th=
is communication is strictly prohibited. If you have received=0Athis commun=
ication in error, please notify the sender immediately by e-mail=0Aand dele=
te the material from any computer.  Thank you.=0A=0A=0A=0A=0A

[Htmlparser-user] Hello?

From: Ishmael R. <sak...@gm...> - 2007-09-26 16:58:19

Hello out there

[Htmlparser-user] Parsing Partial HTML text

From: <mic...@no...> - 2007-09-26 14:31:01

I am having trouble parsing html tagged text. It seems that I can retrieve 
a node but that element does not have the child nodes as expected. 

       String table =
                "<tbody>\n" +
                "<tr>\n" +
                "<td><span>brain_normal_GSM80627</span></td>\n" +
                "<td><span>normal</span></td>\n" +
                "<td><span>cerebral cortex</span></td>\n" +
                "<td><span>brain</span></td>\n" +
                "</tr>\n" +
                "</tbody>\n";

        Parser parser = new Parser(new Lexer(table));
        try {
            Node tBodyNode = parser.extractAllNodesThatMatch(new 
TagNameFilter("tbody")).elementAt(0);
            System.out.println(tBodyNode.getChildren());  // Prints null 
<---------------
        } catch (ParserException e) {
            e.printStackTrace();  //To change body of catch statement use 
File | Settings | File Templates.
        }

Does HTML Parser not handle text input or partial html files well?

_________________________

CONFIDENTIALITY NOTICE

The information contained in this e-mail message is intended only for the 
exclusive use of the individual or entity named above and may contain 
information that is privileged, confidential or exempt from disclosure 
under applicable law. If the reader of this message is not the intended 
recipient, or the employee or agent responsible for delivery of the 
message to the intended recipient, you are hereby notified that any 
dissemination, distribution or copying of this communication is strictly 
prohibited. If you have received this communication in error, please 
notify the sender immediately by e-mail and delete the material from any 
computer.  Thank you.

Re: [Htmlparser-user] Encoding issue

From: Derrick O. <der...@ro...> - 2007-09-26 11:54:42

Rupanu,=0A=0AI'm not sure where your problem lies. The exception was raised=
 because the encoding of the stream didn't agree with the stated contents o=
f the HTML within it. The code in ConnectionManager that opens a disk file =
- URLConnection openConnection (String string) - uses the override - URLCon=
nection openConnection (URL url) - with the url being the file name prefixe=
d by "file://localhost".=0A=0ASo it's up to the JVM and operating system to=
 figure out the encoding of the text file on disk. Apparently, the file was=
 not written with the correct encoding bytes at the beginning of the file o=
r something, so this couldn't be figured out and it was opened with ISO-885=
9-1 instead of UTF8 encoding.=0A=0ATo fix it, the text file of HTML needs t=
o be written differently, or you need to open it differently using perhaps =
your own stream passed to the Page constructor.=0A=0ADerrick=0A=0A----- Ori=
ginal Message ----=0AFrom: Rupanu Ranjaneswar <rup...@ya...>=0ATo: =
htm...@li...=0ASent: Wednesday, September 26, 2007=
 2:07:27 AM=0ASubject: [Htmlparser-user] Encoding issue=0A=0AHello there,=
=0A=0AWell, I copied and pasted the code you gave but there seems to be an =
issue with encoding.I am trying to read from a non-unicode htm/html file an=
d extract its contents and write them into a text file.=0AHere's the code =
=0A*********************************=0AString inputfile =3D args[0];=0A    =
      Parser parser =3D new Parser (inputfile);=0A          StringBean sb =
=3D new StringBean ();=0A          parser.visitAllNodesWith (sb);=0A       =
     String content =3D sb.getStrings();=0A            String outputfilenam=
e=3D "E:\\outputfile.txt";            =0A            OutputStreamWriter osw=
=3D new OutputStreamWriter(new FileOutputStream(outputfilename));    //,=0A=
 "UTF8"=0A            osw.write(content);=0A                    =0A        =
                osw.close();=0A********************************************=
**=0Aand here is the exception I get=0Aorg.htmlparser.util.EncodingChangeEx=
ception: character mismatch (new: ? [0xfeff] !=3D old:  [0xef=C3=AF]) for e=
ncoding change from ISO-8859-1 to UTF-8 at character offset 0=0A=0AHowever =
then I wrote the following code which served my purpose to some extent.But =
could you please explain what was the issue there and how can i render the =
encoding of an htm/html file.(offline/saved in my hard drive).=0A=0A*******=
********=0AStringExtractor strext =3D new StringExtractor(input);=0AString =
content =3D strext.extractStrings(false);=0A=0A        String=0A outputfile=
name=3D"output.txt";=0A        OutputStreamWriter osw=3D new OutputStreamWr=
iter(new FileOutputStream(outputfilename), "UTF8");=0A        osw.write(con=
tent);=0A*************=0A =0A      =0ALuggage? GPS? Comic books? =0A=0AChec=
k out fitting  gifts for grads at Yahoo! Search.=0A=0A=0A

[Htmlparser-user] Encoding issue

From: Rupanu R. <rup...@ya...> - 2007-09-26 06:07:33

Hello there,

Well, I copied and pasted the code you gave but there seems to be an issue with encoding.I am trying to read from a non-unicode htm/html file and extract its contents and write them into a text file.
Here's the code 
*********************************
String inputfile = args[0];
          Parser parser = new Parser (inputfile);
          StringBean sb = new StringBean ();
          parser.visitAllNodesWith (sb);
            String content = sb.getStrings();
            String outputfilename= "E:\\outputfile.txt";            
            OutputStreamWriter osw= new OutputStreamWriter(new FileOutputStream(outputfilename));    //, "UTF8"
            osw.write(content);
                    
                        osw.close();
**********************************************
and here is the exception I get
org.htmlparser.util.EncodingChangeException: character mismatch (new: ? [0xfeff] != old:  [0xefÃ¯]) for encoding change from ISO-8859-1 to UTF-8 at character offset 0

However then I wrote the following code which served my purpose to some extent.But could you please explain what was the issue there and how can i render the encoding of an htm/html file.(offline/saved in my hard drive).

***************
StringExtractor strext = new StringExtractor(input);
String content = strext.extractStrings(false);

        String outputfilename="output.txt";
        OutputStreamWriter osw= new OutputStreamWriter(new FileOutputStream(outputfilename), "UTF8");
        osw.write(content);
*************

       
---------------------------------
Luggage? GPS? Comic books? 
Check out fitting  gifts for grads at Yahoo! Search.

Re: [Htmlparser-user] getting data next to the current tag

From: Derrick O. <der...@ro...> - 2007-09-25 22:38:59

You should be able to get the parent node and from there navigate past the =
original node.=0AIf it was as simple as you say, this should work...=0A=0A =
  gender_node =3D my_heading_node.getParent ().getChildren ().elementAt (1)=
; // 0 is the header=0A=0A   address_node =3D my_heading_node.getParent ().=
getChildren ().elementAt (3);=0A=0A=0AYou will need to watch out for extran=
eous whitespace and other nodes like <br/>.=0A=0A----- Original Message ---=
-=0AFrom: Angelo Chen <ang...@ya...>=0ATo: htmlparser-user@li=
sts.sourceforge.net=0ASent: Tuesday, September 25, 2007 6:16:42 AM=0ASubjec=
t: [Htmlparser-user] getting data next to the current tag=0A=0AHi,=0A=0AI h=
ave some lines like following, I can use:=0A a1 =3D nl.extractAllNodesThatM=
atch(new=0ATagNameFilter("h1"),true);=0Ato reach <h1></h1>, and the gender =
and address is=0Aalways below the <h1> line, how to access the two=0Alines =
right after the <h1> tag? thanks.=0AA.C. =0A=0A <td ><h1>Angelo</h1>=0A    =
          Male<br />San Jose, California <br />=0A </td>=0A=0A=0A      Yaho=
o! =A5=FE=B7s=A4=C9=AF=C5=BA=F4=A4W=AC=DB=C3=AF=A1A=C5=FD=A7A=A5=D1=AC=DB=
=A4=F9=A4=A4=A4=C0=A8=C9=A5=CD=AC=A1=C2I=BAw=A1A=BD=D0=ABe=A9=B9http://hk.p=
hotos.yahoo.com =A4F=B8=D1=A7=F3=A6h!=0A=0A=0A=0A=0A=0A=0A=0A

[Htmlparser-user] getting data next to the current tag

From: Angelo C. <ang...@ya...> - 2007-09-25 10:16:56

Hi,

I have some lines like following, I can use:
 a1 = nl.extractAllNodesThatMatch(new
TagNameFilter("h1"),true);
to reach <h1></h1>, and the gender and address is
always below the <h1> line, how to access the two
lines right after the <h1> tag? thanks.
A.C. 

 <td ><h1>Angelo</h1>
        	  Male<br />San Jose, California <br />
 </td>


      Yahoo! 全新升級網上相簿，讓你由相片中分享生活點滴，請前往http://hk.photos.yahoo.com 了解更多!

Re: [Htmlparser-user] newbie:Sample code for text extraction

From: Nic S. <oo...@gm...> - 2007-09-24 23:10:21

visit the old website, links in the old website work.

On 9/24/07, Derrick Oswald <der...@ro...> wrote:
>
>  Sorry for the broken link. My bad. It's under 'old<http://htmlparser.sourceforge.net/old/javadoc/org/htmlparser/parserapplications/LinkExtractor.html>
> '.
>
> It actually only pointed at the JavaDoc for the StringExtractor class.
> Basically, it did this:
>
>   Parser parser = new Parser (<url from command line>);
>   StringBean sb = new StringBean ();
>   parser.visitAllNodesWith (sb);
>   System.out.println (sb.getStrings());
>
>
> ----- Original Message ----
> From: Rupanu Ranjaneswar <rup...@ya...>
> To: Htm...@li...
> Sent: Monday, September 24, 2007 4:13:51 AM
> Subject: [Htmlparser-user] newbie:Sample code for text extraction
>
> Hello there,
>  I just got started off and was looking for some sample codes given in the
> website. but the link for String extractor doesn't seem to be working.Itwould be great if someone could provide me an alternate link or send me an
> email  with  the  Text  Extractor  code.
>
> Thanks and regards
> Rupanu Pal
>
>  ------------------------------
> Take the Internet to Go: Yahoo!Go puts the Internet in your pocket:<http://us.rd.yahoo.com/evt=48253/*http://mobile.yahoo.com/go?refer=1GNXIC>mail, news, photos & more.
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2005.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>

Re: [Htmlparser-user] newbie:Sample code for text extraction

From: Derrick O. <der...@ro...> - 2007-09-24 11:54:19

Sorry for the broken link. My bad. It's under 'old'.=0A=0AIt actually only =
pointed at the JavaDoc for the StringExtractor class.=0ABasically, it did t=
his:=0A=0A  Parser parser =3D new Parser (<url from command line>);=0A  Str=
ingBean sb =3D new StringBean ();=0A  parser.visitAllNodesWith (sb);=0A  Sy=
stem.out.println (sb.getStrings());=0A=0A=0A----- Original Message ----=0AF=
rom: Rupanu Ranjaneswar <rup...@ya...>=0ATo: Htmlparser-user@lists.=
sourceforge.net=0ASent: Monday, September 24, 2007 4:13:51 AM=0ASubject: [H=
tmlparser-user] newbie:Sample code for text extraction=0A=0AHello there,=0A=
 I just got started off and was looking for some sample codes given in the =
website. but the link for String extractor doesn't seem to be working.It wo=
uld be great if someone could provide me an alternate link or send me an em=
ail  with  the  Text  Extractor  code. =0A=0AThanks and regards=0ARupanu Pa=
l=0A=0A =0A      =0ATake the Internet to Go: Yahoo!Go puts the Internet in =
your pocket: mail, news, photos & more. =0A=0A=0A

[Htmlparser-user] newbie:Sample code for text extraction

From: Rupanu R. <rup...@ya...> - 2007-09-24 08:14:04

Hello there,
 I just got started off and was looking for some sample codes given in the website. but the link for String extractor doesn't seem to be working.It would be great if someone could provide me an alternate link or send me an email  with  the  Text  Extractor  code. 

Thanks and regards
Rupanu Pal


       
---------------------------------
Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos & more.

[Htmlparser-user] inextric

From: Rainy G. <Ra...@cr...> - 2007-09-21 13:15:54

R'umor N*e-w+s-: 
O,ncolog+y M,e,d,. I*n*c..  (OTC+: ON'CO) a C_ancer T.reatme,nt =
So-lution s Grou'p is s-a+i_d to h*a.v'e 

e xperienc*ed o'v.e_r a 1000+% increas_ e in r-evenue.s f,o-r t+h+e =
fis*cal 3-r.d qu-arter en_ding J.u'l_y+, 
2 0+0 7 c*,ompared w'i.t_h t'h,e p,rior y'e.a.r w_hile fisca,l fou-rth =
quar'ter resu*lts f.o*r 2+0'0.7 a*r*e on

tr.ack to e*xceed t.h*i s yea*r=92s thi.rd qu-arter resu.lts. 

O'N+C,O a-dditi-onally p-lans to i*ncre+ase servi+ce offe-_rings w-hich =
a r.e cur'ren_tly u+nderway'. 

D.on=92t w,a*i t f*o r t+h_e n*e,w-s to c.o_m+e o_u t a_n,d l+o_s e =
t'h_e opp''ortunity to g+e,t in f.ront of the
gen eral inv*e'sting p'ublic.  On,,cology M_e.d is in a multi--billion =
do_llar indust__ry w h'e r_e 

t h e_y a.r'e ga-ining mar+ket sha_re rapidl y. 

C*a_l*l y_o_u_r broke+r n.o'w f'o.r O-N C,O,.

[Htmlparser-user] diilatim

From: puig s. <pui...@Al...> - 2007-09-20 09:03:05

compliments htmlparser-user
Low self-esteem?, change that today

puig sloyer
http://www.claaaol.com/

Re: [Htmlparser-user] iterate through node list

From: Derrick O. <der...@ro...> - 2007-09-17 11:48:00

=0AThe NodeList class has elementAt() and size() methods.=0Afor (int i =3D =
0; i < list.size(); i++)=0A  node =3D list.elementAt(i);=0A=0AYou'll need t=
o cast them to LinkTag nodes and then use getLink().=0A=0A----- Original Me=
ssage ----=0AFrom: Nic Soltani <oo...@gm...>=0ATo: htmlparser-user@li=
sts.sourceforge.net=0ASent: Sunday, September 16, 2007 11:56:25 PM=0ASubjec=
t: [Htmlparser-user] iterate through node list=0A=0AHi=0AI created a NodeLi=
st which contains hyperlinks extracted from an HTML webpage,=0AI need to be=
 able to iterate through every single node and extract its href.=0AWonderin=
g if anyone can help me with:=0Ahow to Iterate nodes 1 by 1=0Aextract href=
=0ANodeList URLs =3D ExtractHyperLinks(HTML);=0A/*=0A * at this stage we ha=
ve all:=0A *     <A HREF=3D"link1">something1</A>=0A *     <A HREF=3D"link2=
">something2</A>=0A=0A *     <A HREF=3D"link3">something3</A>=0A *     <A H=
REF=3D"link4">something4</A>=0A *     <A HREF=3D"link5">something5</A>=0A *=
/=0A=0A=0A=0A=0A=0A=0A

Re: [Htmlparser-user] iterate through node list

From: Mattia T. <mat...@gm...> - 2007-09-17 11:44:59

Hi try this:

in a new class, after importing:
import org.htmlparser.tags.*;
import org.htmlparser.util.*;

insert next method:

    protected URL[] extractLinks(String url) throws ParserException {
        Parser parser;
        Vector vector;
        LinkTag link;
        URL[] ret;
        parser = new Parser(url);
        ObjectFindingVisitor visitor = new ObjectFindingVisitor(
LinkTag.class);
        parser.visitAllNodesWith(visitor);
        Node[] nodes = visitor.getTags();
        vector = new Vector();
        for (int i = 0; i < nodes.length; i++)
            try {
                link = (LinkTag) nodes[i];
                System.out.println(link.getLink() + " " + link.getLinkText
());
                vector.add(new URL(link.getLink()));
            } catch (MalformedURLException murle) {
                murle.printStackTrace();
            }
        ret = new URL[vector.size()];
        vector.copyInto(ret);

        return (ret);
    }

Hope this help.

Cheers

Mattia

2007/9/17, Nic Soltani <oo...@gm...>:
>
> Hi
> I created a NodeList which contains hyperlinks extracted from an HTML
> webpage,
> I need to be able to iterate through every single node and extract its
> href.
> Wondering if anyone can help me with:
>
>    1. how to Iterate nodes 1 by 1
>    2. extract href
>
>
> NodeList URLs = ExtractHyperLinks(HTML);
> /*
>  * at this stage we have all:
>  *     <A HREF="link1">something1</A>
>  *     <A HREF="link2">something2</A>
>  *     <A HREF="link3">something3</A>
>  *     <A HREF="link4">something4</A>
>  *     <A HREF="link5">something5</A>
>  */
>
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2005.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>

[Htmlparser-user] iterate through node list

From: Nic S. <oo...@gm...> - 2007-09-17 03:56:26

Hi
I created a NodeList which contains hyperlinks extracted from an HTML
webpage,
I need to be able to iterate through every single node and extract its href.
Wondering if anyone can help me with:

   1. how to Iterate nodes 1 by 1
   2. extract href


NodeList URLs = ExtractHyperLinks(HTML);
/*
 * at this stage we have all:
 *     <A HREF="link1">something1</A>
 *     <A HREF="link2">something2</A>
 *     <A HREF="link3">something3</A>
 *     <A HREF="link4">something4</A>
 *     <A HREF="link5">something5</A>
 */

Re: [Htmlparser-user] HTML parser bug with closing tag

From: Derrick O. <der...@ro...> - 2007-09-12 12:14:20

=0AThis has been fixed in the trunk version of the subversion repository, b=
ut not yet released as a package, sorry.=0ASee bug #1761484 tag.setAttribut=
e() not compatible with <tag/>=0A=0A----- Original Message ----=0AFrom: Kar=
sten Ohme <wid...@t-...>=0ATo: htm...@li....=
net=0ASent: Wednesday, September 12, 2007 5:15:16 AM=0ASubject: [Htmlparser=
-user] HTML parser bug with closing tag=0A=0AHello,=0A=0AHTMPParser does no=
t work like expected.=0AIf some XML conforming tags like <br/> are closed i=
mmediately, the =0Afollowing happens if an attribute is added:=0A=0A<br /id=
=3D"test">=0A=0AI would instead expect this: <br id=3D"test"/>=0A=0AThe att=
ached test can be used for showing the problem.=0A=0ARegards,=0AKarsten=0A=
=0Aimport org.htmlparser.Node;=0Aimport org.htmlparser.Parser;=0Aimport org=
.htmlparser.nodes.TagNode;=0Aimport org.htmlparser.util.NodeIterator;=0Aimp=
ort org.htmlparser.util.NodeList;=0Aimport org.htmlparser.util.ParserExcept=
ion;=0Aimport org.junit.Test;=0A=0A=0Apublic class HTMLParserBug {=0A=0A   =
 private final String invalid =3D "<!DOCTYPE html PUBLIC \"-//W3C//DTD =0AH=
TML 4.01 Transitional//EN\">" + "<html>" + "<head>"=0A                     =
                    + "<meta =0Ahttp-equiv=3D\"content-type\" content=3D\"t=
ext/html; charset=3DISO-8859-1\">" =0A+ "</head>" + "<body>"=0A            =
                             + "Text" + "<br/>" + "Text" + =0A"</body>" + "=
</html>";=0A=0A    @Test=0A    public void testClosingTag() {=0A        try=
 {=0A            Parser parser =3D Parser.createParser(invalid, "ISO-8859-1=
");=0A            NodeIterator it =3D parser.elements();=0A            proc=
essNode(it);=0A        } catch (ParserException e) {=0A            e.printS=
tackTrace();=0A        }=0A    }=0A=0A    private static void processNode(N=
odeIterator it) throws =0AParserException {=0A        while (it.hasMoreNode=
s()) {=0A            Node node =3D it.nextNode();=0A            System.out.=
println(node);=0A            if (node instanceof TagNode) {=0A             =
   ((TagNode) node).setAttribute("id", "test");=0A                System.ou=
t.println(node);=0A                NodeList list =3D ((TagNode) node).getCh=
ildren();=0A                if (list !=3D null) {=0A                    pro=
cessNode(list.elements());=0A                }=0A            }=0A        }=
=0A    }=0A=0A}=0A=0A=0A---------------------------------------------------=
----------------------=0AThis SF.net email is sponsored by: Microsoft=0ADef=
y all challenges. Microsoft(R) Visual Studio 2005.=0Ahttp://clk.atdmt.com/M=
RT/go/vse0120000070mrt/direct/01/=0A_______________________________________=
________=0AHtmlparser-user mailing lis...@li...=
.net=0Ahttps://lists.sourceforge.net/lists/listinfo/htmlparser-user=0A=0A=
=0A=0A=0A

[Htmlparser-user] HTML parser bug with closing tag

From: Karsten O. <wid...@t-...> - 2007-09-12 09:19:04

Hello,

HTMPParser does not work like expected.
If some XML conforming tags like <br/> are closed immediately, the 
following happens if an attribute is added:

<br /id="test">

I would instead expect this: <br id="test"/>

The attached test can be used for showing the problem.

Regards,
Karsten

import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.nodes.TagNode;
import org.htmlparser.util.NodeIterator;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.junit.Test;


public class HTMLParserBug {

    private final String invalid = "<!DOCTYPE html PUBLIC \"-//W3C//DTD 
HTML 4.01 Transitional//EN\">" + "<html>" + "<head>"
                                         + "<meta 
http-equiv=\"content-type\" content=\"text/html; charset=ISO-8859-1\">" 
+ "</head>" + "<body>"
                                         + "Text" + "<br/>" + "Text" + 
"</body>" + "</html>";

    @Test
    public void testClosingTag() {
        try {
            Parser parser = Parser.createParser(invalid, "ISO-8859-1");
            NodeIterator it = parser.elements();
            processNode(it);
        } catch (ParserException e) {
            e.printStackTrace();
        }
    }

    private static void processNode(NodeIterator it) throws 
ParserException {
        while (it.hasMoreNodes()) {
            Node node = it.nextNode();
            System.out.println(node);
            if (node instanceof TagNode) {
                ((TagNode) node).setAttribute("id", "test");
                System.out.println(node);
                NodeList list = ((TagNode) node).getChildren();
                if (list != null) {
                    processNode(list.elements());
                }
            }
        }
    }

}

[Htmlparser-user] taketo

From: Guthrie C. <Czi...@sy...> - 2007-09-10 15:55:18

Wazzup Czibolya
My pen1s grew 1.9ÒÒ after 5 months of use

lestin Pelton
http://mathun.net/

[Htmlparser-user] rednebni

From: lc L. <lc....@LA...> - 2007-09-08 10:23:39

Wassup htmlparser-user
damn, i was so small but now muccch bigger.

Gaming mnbmbnm
http://www.hrlyi.com/

[Htmlparser-user] selabela

From: Dominick N. <Nie...@an...> - 2007-08-28 06:49:49

Mail to htmlparser-user
A MASSIVE schlong, is only a few months away

cari Bot
http://pinester.com/

[Htmlparser-user] How to extract info circularly?

From: william l. <wil...@ya...> - 2007-08-28 04:27:30

just like in the following:
  "<li>info</li>
  <li>info</li>
  <li>info</li>
  <li>info</li>"
   
  or
  "<a title="taught" href="/index.html" rel="section">info</a> info <a title="research-led" href="/index.html" rel="section">info</a> info."
  How to extract these info circularly?
  Thanks in advance!

       
---------------------------------
Building a website is a piece of cake. 
Yahoo! Small Business gives you all the tools to get online.

[Htmlparser-user] tsrebban

From: lindberg e. <lin...@na...> - 2007-08-26 16:19:10

http://www.mankine.com/
Hi htmlparser-user
Tired of being just average?

Ravinder janzen

Re: [Htmlparser-user] How to login to web page

From: Derrick O. <der...@ro...> - 2007-08-24 22:29:07

You probably have to hit the login page first.=0AThen use the same Connecti=
onManager to access the desired page.=0A=0A----- Original Message ----=0AFr=
om: "mic...@Ta..." <mic...@Ta...>=0ATo: htmlparser user l=
ist <htm...@li...>=0ACc: htmlparser user list <htm=
lpa...@li...>=0ASent: Friday, August 24, 2007 10:36:0=
5 AM=0ASubject: Re: [Htmlparser-user] How to login to web page=0A=0ADerrick=
,=0AI'm trying this approach (strictly HtmlParser) as well, but I can't get=
=0Alogged in. The array of URLs is that of a page shown when not logged in.=
=0A=0ACan you suggest anything else?=0AThanks,=0AMick=0A=0AURL[] urlArray;=
=0A=0AConnectionManager connectionManager =3D new ConnectionManager();=0Aur=
l =3D new URL("www.someloginpage.com";);=0AconnectionManager.openConnection=
(url);=0A=0AconnectionManager.setRedirectionProcessingEnabled(true);=0Aconn=
ectionManager.setCookieProcessingEnabled(true);=0AconnectionManager.setUser=
(USER_NAME);=0AconnectionManager.setPassword(PASSWORD);=0A=0A// go to link =
with stuff=0Aurl =3D new URL("a page beyond the login page");=0AconnectionM=
anager.openConnection(url);=0AlinkBean.setConnection(connectionManager.open=
Connection(url));=0AurlArray =3D linkBean.getLinks(); // get all links=0A=
=0A=0A---------------------------------------------------------------------=
=0A> You might try setRedirectionProcessingEnabled(true). Often the first U=
RL=0A> is only a gateway.=0A> Also, it's setCookieProcessingEnabled(true), =
not addCookies.=0A=0A=0A---------------------------------------------------=
----------------------=0AThis SF.net email is sponsored by: Splunk Inc.=0AS=
till grepping through log files to find problems?  Stop.=0ANow Search log e=
vents and configuration files using AJAX and a browser.=0ADownload your FRE=
E copy of Splunk now >>  http://get.splunk.com/=0A_________________________=
______________________=0AHtmlparser-user mailing list=0AHtmlparser-user@lis=
ts.sourceforge.net=0Ahttps://lists.sourceforge.net/lists/listinfo/htmlparse=
r-user=0A=0A=0A=0A=0A

Re: [Htmlparser-user] How to login to web page

From: <mic...@Ta...> - 2007-08-24 14:36:12

Derrick,
I'm trying this approach (strictly HtmlParser) as well, but I can't get
logged in. The array of URLs is that of a page shown when not logged in.

Can you suggest anything else?
Thanks,
Mick

URL[] urlArray;

ConnectionManager connectionManager = new ConnectionManager();
url = new URL("www.someloginpage.com");
connectionManager.openConnection(url);

connectionManager.setRedirectionProcessingEnabled(true);
connectionManager.setCookieProcessingEnabled(true);
connectionManager.setUser(USER_NAME);
connectionManager.setPassword(PASSWORD);

// go to link with stuff
url = new URL("a page beyond the login page");
connectionManager.openConnection(url);
linkBean.setConnection(connectionManager.openConnection(url));
urlArray = linkBean.getLinks(); // get all links


---------------------------------------------------------------------
> You might try setRedirectionProcessingEnabled(true). Often the first URL
> is only a gateway.
> Also, it's setCookieProcessingEnabled(true), not addCookies.

Re: [Htmlparser-user] How to login to web page

From: <mic...@Ta...> - 2007-08-24 13:35:25

Thanks for all the information!

In your first response to my post, in your code, you call these methods:
getCookiesArrayList(client)
client.getResponseHeaders()
getRequestHeaders()

I removed these calls, but I guess now I need to use them.
Can you post these methods?

> 2007/8/24, mic...@ta... <mic...@ta...>:
>>
>> Mattia,
>>
>> In my original post, I showed this code that uses HtmlParser to connect
>> to
>> a web page and get all link from that page.
>>
>> url = new URL("http://www.google.com");
>> urlConnection = url.openConnection();
>> ConnectionManager connectionManager = new ConnectionManager();
>> connectionManager.setRedirectionProcessingEnabled(true);
>> linkBean.setConnection(connectionManager.openConnection(url));
>> urlArray = linkBean.getLinks(); // get all links
>
>
> Ok. So dis work for you and You are exactly in the same situation I was
> some
> month ago.
> I needed to parse some pages and when I faced pages protected from login I
> used HttpClient. Sorry for confusing you a bit! :o)
>
> My problem was that I couldn't get a page that required a login. You
>> replied with some code, which I have modified a bit (below). This works,
>> but it is not using HtmlParser. It uses apache commons.
>>
>> Can I get the HtmlParser code to work with the apache commons code. In
>> particular, can I use the connection established with the HttpClient
>> (from
>> apache commons) in the HtmlParser code? If so, how?
>
>
> The site I'm working on requires 2 cookies to be sent to pages protected
> by
> login so the first step is logging in with HttpClient commons lib, take
> the
> cookies and use those inside HttpHeaders I'm sending to the second page
> (page protected from login) so I could enter and parse this second page.
>
>
> I think you have to investigate if the site you are tring to enter has a
> particular system to understand if you are logged or not. Try to see from
> fireworks "right click" on the page and select "Page info" you will see
> headers form and so on. Lets see if there are cookies.
>
>
> Suppose you are logged in and you took a cookie that say to the site that
> you are logged I use the HtmlParser in this way (similarly as you do in
> your
> code):
>
>     public Hashtable getMySecondPage(String url, ArrayList cookies,
>             Header[] headers) {
>         logger.info("url: " + url);
>         try {
>             // I pass headers and cookies making the request so the site
> see
> that I'm logged
>             setUpConnectionManager(cookies, headers);
>             Parser parser = new Parser(url);
>             NodeList nodelist = parser.parse(null);
>             for (int i = 0; i < nodelist.size(); i++) {
>                 System.out.print(nodelist.toHtml());
>             }
>         } catch (Exception e) {
>             e.printStackTrace();
>         }
>         return null;
>     }
>
> If the site doesn't use cookies try just to make 2 request sequentially,
> FIRST the login call than call the page protected from login.
>
> Cheers
>
> Mattia
>
>
>
> Thanks,
>> Mick
>>
>> --------------------------------------------------------------------------
>>
>> import java.io.IOException;
>> import java.util.logging.*;
>> import java.util.ArrayList;
>> import org.apache.commons.httpclient.NameValuePair;
>>
>> /**
>> * WebScraper2
>> *
>> */
>> public class WebScraper2 {
>>
>>       private Logger logger;
>>       private HttpClientUtil httpClientUtil;
>>       private String loginURL;
>>
>>       /**
>>        * constructor
>>        */
>>       public WebScraper2() {
>>
>>             createLogger();
>>
>>             loginURL = "https://www.ctslink.com/login.do";
>>
>>             httpClientUtil = new HttpClientUtil();
>>
>>             login();
>>
>>       }
>>
>>
>>       /*
>>        * login to site
>>        */
>>       public String login() {
>>
>>             String responseString = null;
>>
>>             //logger.info(this.getClass().getName() + " - login");
>>
>>             try {
>>                   ArrayList<NameValuePair> parameters = new
>> ArrayList<NameValuePair>();
>>                   parameters.add(new NameValuePair("username",
>> "joeUser"));
>>                   parameters.add(new NameValuePair("password",
>> "somePassword"));
>>
>>                   int response =
>> httpClientUtil.submitPostForm(this.loginURL, "",
>>                               parameters, null);
>>                   logger.info("Response = " + response);
>>                   parameters.clear();
>>             } catch (Exception e) {
>>                   logger.warning(" LOGIN PROBLEM!");
>>                   e.printStackTrace();
>>             }
>>
>>             return responseString;
>>       }
>>
>>       /*
>>        * create the logger
>>        */
>>       public void createLogger() {
>>             // Get a logger; the logger is automatically created if
>>             // it doesn't already exist
>>             try {
>>                   // Create a file handler that writes log record to a
>> file
>>                   FileHandler handler = new
>> FileHandler("webscraper.log");
>>                   handler.setFormatter(new SimpleFormatter()); // set
>> file
>> format
>>                   // to plain text, not xml
>>
>>                   // Add to the desired logger
>>                   logger = Logger.getLogger("webscraper.Webscraper");
>>                   logger.addHandler(handler);
>>                   logger.setLevel(Level.INFO);
>>             } catch (IOException e) {
>>                   System.out
>>                               .println("WebScraper2:createLogger():
>> Error
>> creating logger");
>>             }
>>       }
>>
>>
>>       /**
>>        *
>>        * @param args
>>        */
>>       public static void main(String[] args) {
>>
>>             new WebScraper2();
>>         System.out.println("Done");
>>
>>       }
>> }
>>
>> ---------------------------------------------------------------------
>>
>> import java.io.*;
>> import java.util.ArrayList;
>> import org.apache.commons.httpclient.methods.PostMethod;
>> import org.apache.commons.httpclient.*;
>> import org.apache.commons.httpclient.cookie.CookiePolicy;
>> import org.apache.commons.httpclient.NameValuePair;
>>
>> /**
>> * HttpClientUtil
>> *
>> */
>> public class HttpClientUtil extends HttpClient {
>>
>>       private PostMethod postMethod;
>>
>>       /**
>>        * constructor
>>        */
>>       public HttpClientUtil() {
>>
>>       }
>>
>>
>>       /**
>>        * submitPostForm
>>        *
>>        * @param relativeUrl
>>        * @param formName
>>        * @param params
>>        * @param requestHeaders
>>        * @return
>>        */
>>       public int submitPostForm(String relativeUrl, String formName,
>>                   ArrayList<NameValuePair> params, Header[]
>> requestHeaders) {
>>           BufferedReader bufferedReader = null;
>>           int statusCode = -999;
>>
>>             byte[] result = null;
>>             try {
>>                   NameValuePair[] data = null;
>>                   if (params != null) {
>>                         data = new NameValuePair[params.size()];
>>                         for (int i = 0; i < params.size(); i++) {
>>                               data[i] = (NameValuePair) params.get(i);
>>                         }
>>                   }
>>                   PostMethod method = new PostMethod(relativeUrl);
>>                   this.postMethod = method;
>>                   method.getParams().setCookiePolicy(CookiePolicy.RFC_2109
>> );
>>
>>                   if (params != null) {
>>                         method.addParameters(data);
>>                   }
>>                   statusCode = this.executeMethod(method);
>>
>>             bufferedReader = new BufferedReader(new
>> InputStreamReader(method.getResponseBodyAsStream()));
>>               String readLine;
>>               while(((readLine = bufferedReader.readLine()) != null)) {
>>                 System.out.println(readLine);
>>             }
>>
>>
>>             } catch (IOException ioe) {
>>                   ioe.printStackTrace();
>>             }
>>             return statusCode;
>>       }
>>
>> }
>>
>> -------------------------------------------------------------------------
>> This SF.net email is sponsored by: Splunk Inc.
>> Still grepping through log files to find problems?  Stop.
>> Now Search log events and configuration files using AJAX and a browser.
>> Download your FREE copy of Splunk now >>  http://get.splunk.com/
>> _______________________________________________
>> Htmlparser-user mailing list
>> Htm...@li...
>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >>
> http://get.splunk.com/_______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>

790 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 23 24 25 26 27 .. 99 > >> (Page 25 of 99)