htmlparser-user Mailing List for HTML Parser (Page 34)

Brought to you by: derrickoswald

htmlparser-user — The user mailing list for users of the htmlparser library

You can subscribe to this list here.

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov (1)	Dec
2002	Jan (7)	Feb	Mar (9)	Apr (50)	May (20)	Jun (47)	Jul (37)	Aug (32)	Sep (30)	Oct (11)	Nov (37)	Dec (47)
2003	Jan (31)	Feb (70)	Mar (67)	Apr (34)	May (66)	Jun (25)	Jul (48)	Aug (43)	Sep (58)	Oct (25)	Nov (10)	Dec (25)
2004	Jan (38)	Feb (17)	Mar (24)	Apr (25)	May (11)	Jun (6)	Jul (24)	Aug (42)	Sep (13)	Oct (17)	Nov (13)	Dec (44)
2005	Jan (10)	Feb (16)	Mar (16)	Apr (23)	May (6)	Jun (19)	Jul (39)	Aug (15)	Sep (40)	Oct (49)	Nov (29)	Dec (41)
2006	Jan (28)	Feb (24)	Mar (52)	Apr (41)	May (31)	Jun (34)	Jul (22)	Aug (12)	Sep (11)	Oct (11)	Nov (11)	Dec (4)
2007	Jan (39)	Feb (13)	Mar (16)	Apr (24)	May (13)	Jun (12)	Jul (21)	Aug (61)	Sep (31)	Oct (13)	Nov (32)	Dec (15)
2008	Jan (7)	Feb (8)	Mar (14)	Apr (12)	May (23)	Jun (20)	Jul (9)	Aug (6)	Sep (2)	Oct (7)	Nov (3)	Dec (2)
2009	Jan (5)	Feb (8)	Mar (10)	Apr (22)	May (85)	Jun (82)	Jul (45)	Aug (28)	Sep (26)	Oct (50)	Nov (8)	Dec (16)
2010	Jan (3)	Feb (11)	Mar (39)	Apr (56)	May (80)	Jun (64)	Jul (49)	Aug (48)	Sep (16)	Oct (3)	Nov (5)	Dec (5)
2011	Jan (13)	Feb	Mar (1)	Apr (7)	May (7)	Jun (7)	Jul (7)	Aug (8)	Sep	Oct (6)	Nov (2)	Dec
2012	Jan (5)	Feb	Mar (3)	Apr (3)	May (4)	Jun (8)	Jul (1)	Aug (5)	Sep (10)	Oct (3)	Nov (2)	Dec (4)
2013	Jan (4)	Feb (2)	Mar (7)	Apr (7)	May (6)	Jun (7)	Jul (3)	Aug	Sep (1)	Oct	Nov	Dec
2014	Jan	Feb (2)	Mar (1)	Apr	May (3)	Jun (1)	Jul	Aug	Sep (1)	Oct (4)	Nov (2)	Dec (4)
2015	Jan (4)	Feb (2)	Mar (8)	Apr (7)	May (6)	Jun (7)	Jul (3)	Aug (1)	Sep (1)	Oct (4)	Nov (3)	Dec (4)
2016	Jan (4)	Feb (6)	Mar (9)	Apr (9)	May (6)	Jun (1)	Jul (1)	Aug	Sep	Oct (1)	Nov (1)	Dec (1)
2017	Jan	Feb (1)	Mar (3)	Apr (1)	May	Jun (1)	Jul (2)	Aug (3)	Sep (6)	Oct (3)	Nov (2)	Dec (5)
2018	Jan (3)	Feb (13)	Mar (28)	Apr (5)	May (4)	Jun (2)	Jul (2)	Aug (8)	Sep (2)	Oct (1)	Nov (5)	Dec (1)
2019	Jan (8)	Feb (1)	Mar	Apr (1)	May (4)	Jun	Jul (1)	Aug	Sep	Oct	Nov (2)	Dec (2)
2020	Jan	Feb	Mar (1)	Apr (1)	May (1)	Jun (2)	Jul (1)	Aug (1)	Sep (1)	Oct	Nov (1)	Dec (1)
2021	Jan (3)	Feb (2)	Mar (1)	Apr (1)	May (2)	Jun (1)	Jul (2)	Aug (1)	Sep	Oct	Nov	Dec
2022	Jan	Feb	Mar	Apr (1)	May (1)	Jun (1)	Jul	Aug (1)	Sep	Oct	Nov	Dec
2023	Jan (2)	Feb	Mar	Apr	May	Jun	Jul	Aug (1)	Sep	Oct	Nov	Dec
2024	Jan (2)	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2025	Jan	Feb	Mar	Apr	May	Jun (1)	Jul	Aug	Sep	Oct (1)	Nov	Dec

Flat | Threaded

<< < 1 .. 32 33 34 35 36 .. 99 > >> (Page 34 of 99)

Re: [Htmlparser-user] set URL for relative links

From: Garry H. <ga...@gm...> - 2006-09-11 17:07:28

Did you try my_parser.setURL("http://www.bar.com/"); ?

Just a thought.

Cheers,
Garry

On Sep 12, 2006, at 12:58 AM, jpdogg wrote:

> Hello,
>
> I've cached some HTML pages in local files and would like to tell the
> Parser object what the original URLs were so that it can correctly
> interpret relative links.
>
> As a simple example, say I do this:
>
> Parser my_parser = new Parser("<html><img src='foo.jpg'></html>");
>
> If I construct a filter to give me all of the ImageTags in this simple
> document, I get one.  Unfortunately, it has the URL foo.jpg.  If I
> know that this file was originally located at
> http://www.bar.com/foo.html, how do I inform the parser module?  I
> want it to be able to report that the above image is located at
> http://www.bar.com/foo.jpg.
>
> Thanks!
> Jeff
>
> ---------------------------------------------------------------------- 
> ---
> Using Tomcat but need to do more? Need to support web services,  
> security?
> Get stuff done quickly with pre-integrated technology to make your  
> job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache  
> Geronimo
> http://sel.as-us.falkag.net/sel? 
> cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user

[Htmlparser-user] set URL for relative links

From: jpdogg <jp...@gm...> - 2006-09-11 16:59:01

Hello,

I've cached some HTML pages in local files and would like to tell the
Parser object what the original URLs were so that it can correctly
interpret relative links.

As a simple example, say I do this:

Parser my_parser = new Parser("<html><img src='foo.jpg'></html>");

If I construct a filter to give me all of the ImageTags in this simple
document, I get one.  Unfortunately, it has the URL foo.jpg.  If I
know that this file was originally located at
http://www.bar.com/foo.html, how do I inform the parser module?  I
want it to be able to report that the above image is located at
http://www.bar.com/foo.jpg.

Thanks!
Jeff

Re: [Htmlparser-user] Extract Data from Table Row Question.

From: andrew d. <and...@ho...> - 2006-09-07 18:06:16

Thank you for this it was just what was needed..


>From: Derrick Oswald <Der...@Ro...>
>Reply-To: This is the user list of htmlparser 
><htm...@li...>
>To: This is the user list of htmlparser 
><htm...@li...>
>Subject: Re: [Htmlparser-user] Extract Data from Table Row Question.
>Date: Thu, 07 Sep 2006 07:50:37 -0400
>
>Andrew,
>
>You could use a filter on the row NodeList, something like:
>
>     NodeList td_tags = TableList.extractAllNodesThatMatch (
>         new AndFilter (new TagNameFilter ("TD"), new HasAttributeFilter
>("class", "listi")));
>
>Once you have the tags you can fetch their text contents with a StringBean:
>     StringBean sb = new StringBean ();
>     td_tags.visitAllNodesWith (sb);
>     System.out.println (sb.getStrings () );
>
>Derrick
>
>andrew davis wrote:
>
> >Hello All and Thanks for looking at my Question.
> >
> >I am still new to Java and HtmlParser I have se series of Web pages 
>stored
> >offline that i need to process, that are made up of tables, i can find 
>the
> >tables tag, and then all Table Rows, but the next bit is stumping me, I.e
> >how do i read the TD values or how to check invidual tags to see if there 
>is
> >more processing to do (see Source Example below)
> >
> >Many Thanks for Any help.
> >
> >
> >public static void process(NodeList listx)
> >    {
> >    // Scan for "tr" tags and Extract info
> >    NodeList TableList = listx.extractAllNodesThatMatch(new
> >TagNameFilter("tr"));
> >    for (int x = 0; x < xx.size(); x++)
> >    {
> >
> >    // Process Nodes or Tags  this is what is stamping me
> >
> >   1. How do i read all TD from nodes with say format <TD class="listi"> 
>etc
> >and get their value
> >   2. Or How do i get invidural Tags for futher processing
> >
> >    }
> >    }
> >
> >
> >    public static void main(String[] args) {
> >
> >         try {
> >            parser = new Parser("c:\\HtmlTest0002.htm");
> >
> >// Look for Table Tag
> >
> >            list = parser.parse (new TagNameFilter("table"));
> >            for (int x = 0; x < list.size(); x++)
> >            {
> >
> >// Is it the right Table
> >
> >            if (list.elementAt(x).toString().contains("listme"))
> >            {
> >            // Get all Children and process
> >                process(list.elementAt(x).getChildren());
> >            }
> >            }
> >            } catch (ParserException ex) {
> >            ex.printStackTrace();
> >        }
> >
> >    }
> >
> >}
> >
> >
> >
> >-------------------------------------------------------------------------
> >Using Tomcat but need to do more? Need to support web services, security?
> >Get stuff done quickly with pre-integrated technology to make your job 
>easier
> >Download IBM WebSphere Application Server v.1.0.1 based on Apache 
>Geronimo
> >http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> >_______________________________________________
> >Htmlparser-user mailing list
> >Htm...@li...
> >https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> >
> >
> >
>
>
>-------------------------------------------------------------------------
>Using Tomcat but need to do more? Need to support web services, security?
>Get stuff done quickly with pre-integrated technology to make your job 
>easier
>Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
>http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
>_______________________________________________
>Htmlparser-user mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user

Re: [Htmlparser-user] Extract Data from Table Row Question.

From: Derrick O. <Der...@Ro...> - 2006-09-07 11:50:47

Andrew,

You could use a filter on the row NodeList, something like:

    NodeList td_tags = TableList.extractAllNodesThatMatch (
        new AndFilter (new TagNameFilter ("TD"), new HasAttributeFilter 
("class", "listi")));

Once you have the tags you can fetch their text contents with a StringBean:
    StringBean sb = new StringBean ();
    td_tags.visitAllNodesWith (sb);
    System.out.println (sb.getStrings () );

Derrick

andrew davis wrote:

>Hello All and Thanks for looking at my Question.
>
>I am still new to Java and HtmlParser I have se series of Web pages stored 
>offline that i need to process, that are made up of tables, i can find the 
>tables tag, and then all Table Rows, but the next bit is stumping me, I.e 
>how do i read the TD values or how to check invidual tags to see if there is
>more processing to do (see Source Example below)
>
>Many Thanks for Any help.
>
>
>public static void process(NodeList listx)
>    {
>    // Scan for "tr" tags and Extract info
>    NodeList TableList = listx.extractAllNodesThatMatch(new 
>TagNameFilter("tr"));
>    for (int x = 0; x < xx.size(); x++)
>    {
>
>    // Process Nodes or Tags  this is what is stamping me
>
>   1. How do i read all TD from nodes with say format <TD class="listi"> etc 
>and get their value
>   2. Or How do i get invidural Tags for futher processing
>
>    }
>    }
>
>
>    public static void main(String[] args) {
>
>         try {
>            parser = new Parser("c:\\HtmlTest0002.htm");
>
>// Look for Table Tag
>
>            list = parser.parse (new TagNameFilter("table"));
>            for (int x = 0; x < list.size(); x++)
>            {
>
>// Is it the right Table
>
>            if (list.elementAt(x).toString().contains("listme"))
>            {
>            // Get all Children and process
>                process(list.elementAt(x).getChildren());
>            }
>            }
>            } catch (ParserException ex) {
>            ex.printStackTrace();
>        }
>
>    }
>
>}
>
>
>
>-------------------------------------------------------------------------
>Using Tomcat but need to do more? Need to support web services, security?
>Get stuff done quickly with pre-integrated technology to make your job easier
>Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
>http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
>_______________________________________________
>Htmlparser-user mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>  
>

[Htmlparser-user] Extract Data from Table Row Question.

From: andrew d. <and...@ho...> - 2006-09-06 11:01:25

Hello All and Thanks for looking at my Question.

I am still new to Java and HtmlParser I have se series of Web pages stored 
offline that i need to process, that are made up of tables, i can find the 
tables tag, and then all Table Rows, but the next bit is stumping me, I.e 
how do i read the TD values or how to check invidual tags to see if there is
more processing to do (see Source Example below)

Many Thanks for Any help.


public static void process(NodeList listx)
    {
    // Scan for "tr" tags and Extract info
    NodeList TableList = listx.extractAllNodesThatMatch(new 
TagNameFilter("tr"));
    for (int x = 0; x < xx.size(); x++)
    {

    // Process Nodes or Tags  this is what is stamping me

   1. How do i read all TD from nodes with say format <TD class="listi"> etc 
and get their value
   2. Or How do i get invidural Tags for futher processing

    }
    }


    public static void main(String[] args) {

         try {
            parser = new Parser("c:\\HtmlTest0002.htm");

// Look for Table Tag

            list = parser.parse (new TagNameFilter("table"));
            for (int x = 0; x < list.size(); x++)
            {

// Is it the right Table

            if (list.elementAt(x).toString().contains("listme"))
            {
            // Get all Children and process
                process(list.elementAt(x).getChildren());
            }
            }
            } catch (ParserException ex) {
            ex.printStackTrace();
        }

    }

}

Re: [Htmlparser-user] How to parse the form tag input attributes when table tag is placed above the form tag

From: Ian M. <ian...@gm...> - 2006-08-30 16:09:08

Can you give a copy of the file that shows this problem?

On 8/25/06, Srinivas N <sn...@os...> wrote:
>
>
>
> hi , all
>
> Please help me it is very urgent
>
>
> I have an html content which consists of 48 input tags in a form tag when
> formTag.getFormInputs() is called it returned me 48 counts consisting of
> many table tags inside the form tag , but when the same content is paced
> including the formtag in table tag the parsed parsed upto 14 input tags and
> could not return the count of 48 tags which is expected
>
> please let me know the problem with the parser of the way of representation
> of table tag above the form tag
>
> with regards
> Srinivas
>
>
>
>
>
>
>
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job
> easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
>
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
>

Re: [Htmlparser-user] parsing XHTML and XML

From: Eugeny N D. <bo...@re...> - 2006-08-28 08:10:15

On Fri, Aug 25, 2006 at 09:56:48AM +0100, Ian Macfarlane wrote:
> If it's guaranteed to be valid XML, I'd use an XML parser instead.
> Java has one built in, or look into Xerces.

The thing is I will get the document as input, and I don't know which of
formats - HTML, XHTML or XML - it will be, so I'm looking for common way to
build DOM for these formats.

-- 
Eugene N Dzhurinsky

[Htmlparser-user] How to parse the form tag input attributes when table tag is placed above the form tag

From: Srinivas N <sn...@os...> - 2006-08-25 12:12:14

hi , all

Please help me it is very urgent


I have an html content which consists of 48 input tags in a form tag =
when formTag.getFormInputs() is called it returned me 48 counts =
consisting of many table tags inside the form tag , but when the same =
content is paced including the formtag in table tag the parsed parsed =
upto 14 input tags and could not return the count of 48 tags which is =
expected

please let me know the problem with the parser of the way of =
representation of table tag above the form tag

with regards
Srinivas






   =20

Re: [Htmlparser-user] parsing XHTML and XML

From: Ian M. <ian...@gm...> - 2006-08-25 08:56:52

If it's guaranteed to be valid XML, I'd use an XML parser instead.
Java has one built in, or look into Xerces.

Ian

On 8/23/06, Eugeny N Dzhurinsky <bo...@re...> wrote:
> Is it possible to parse XML documents as well as XHTML documents with
> htmlparser?
>
> --
> Eugene N Dzhurinsky
>
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>

[Htmlparser-user] parsing XHTML and XML

From: Eugeny N D. <bo...@re...> - 2006-08-23 08:34:56

Is it possible to parse XML documents as well as XHTML documents with
htmlparser?

-- 
Eugene N Dzhurinsky

[Htmlparser-user] user stories

From: Derrick O. <Der...@Ro...> - 2006-08-10 02:59:48

Hi,

I would be interested to hear some real user stories.  The traffic on 
this list is pretty much all problems encountered - and solutions 
provided hopefully - but there must be a whole bunch of people who are 
using it for weird and wild projects without a problem.  After all there 
are 3000 downloads a month, and it's not that hard to use is it?

So how about it?  Tell us your success story or something small or large 
you are proud of accomplishing with htmlparser.

Derrick

[Htmlparser-user] (no subject)

From: lu d. <dom...@gm...> - 2006-08-09 02:38:54

Re: [Htmlparser-user] A strange question?

From: Derrick O. <Der...@Ro...> - 2006-08-08 20:37:07

Jesse,

The problem may be within the HtmlUtils.registerTags.
What does this do? What tags does it register?

The div tag filter will return multiple elements with the same text as
in the case of:
<div class='A'><div class='B'>the text</div></div>
will return a list containing two items:
1) <div class='A'><div class='B'>the text</div></div>
2) <div class='B'>the text</div>
which if you pass it to string extractor will return:
the textthe text

Derrick

hpq852 wrote:

> Hi All, I encountered a very strange question. My code is very simple
> as following:
> public void doTest() throws Exception
> {
> URL url = new URL("http://www.uume.com/play_CPRz8a2si4zK");
> InputStream in = url.openStream();
> BufferedReader br = new BufferedReader(new InputStreamReader(in,
> "GB2312"));
> String line = null;
> StringBuffer sb = new StringBuffer();
> while ((line = br.readLine()) != null)
> {
> sb.append(line);
> sb.append("\n");
> }
> extractText2(sb.toString());
> }
>
> public String extractText2(String inputHtml) throws Exception
> {
> Parser parser = Parser.createParser(new
> String(inputHtml.getBytes(),"GB2312"), "GB2312");
> HtmlUtils.registerTags(parser);
> NodeFilter tagNameFilter = new TagNameFilter("div");
> NodeList nodeList = parser.extractAllNodesThatMatch(tagNameFilter);
>
> System.out.println(nodeList.toHtml());
> return null;
> }
> I just want to get all of div tags, so I used a TagNameFilter, but the
> result I got in the console is strange, it includes many repeated div
> tags with same content.
> I have tried for many times, but what I got was the same, I really
> don't know what't the reason. Could you help me please?
> Thanks and Best Regards
> Jesse
>

[Htmlparser-user] A strange question?

From: hpq852 <hp...@gm...> - 2006-08-08 16:19:16

Hi All,  I encountered a very strange question. My code is very simple as following:

 public void doTest() throws Exception
 {
  URL url = new URL("http://www.uume.com/play_CPRz8a2si4zK");
  InputStream in = url.openStream();
  BufferedReader br = new BufferedReader(new InputStreamReader(in, "GB2312"));
  String line = null;
  StringBuffer sb = new StringBuffer();
  while ((line = br.readLine()) != null) 
  {
   sb.append(line);    
   sb.append("\n");
  }
  extractText2(sb.toString());
 }
 
 public String extractText2(String inputHtml) throws Exception
 {
  Parser parser = Parser.createParser(new String(inputHtml.getBytes(),"GB2312"), "GB2312");
  HtmlUtils.registerTags(parser);
  NodeFilter tagNameFilter = new TagNameFilter("div");
  NodeList nodeList = parser.extractAllNodesThatMatch(tagNameFilter);

  System.out.println(nodeList.toHtml());
  return null;
 }
 
I just want to get all of div tags, so I used a TagNameFilter, but the result I got in the console is strange, it includes many repeated div tags with same content.
I have tried for many times, but what I got was the same, I really don't know what't the reason. Could you help me please?

Thanks and Best Regards
Jesse

Re: [Htmlparser-user] How to extract more than one tag by only once parsering?

From: Derrick O. <Der...@Ro...> - 2006-08-04 11:42:35

Jesse,

 From your example, you can also get all the div tags at once and filter 
on class in a secondary pass:

NodeList divs = nodelist.extractAllTagsThatMatch (new TagNameFilter 
("DIV"));
DivTag div_a = divs.extractAllTagsThatMatch (new HasAttributeFilter 
("class", "A")).element (0); // presuming there is only one
DivTag div_b = divs.extractAllTagsThatMatch (new HasAttributeFilter 
("class", "B")).element (0); // presuming there is only one

and this may be faster than searching the entire page each time.

Derrick

Ian Macfarlane wrote:

>As long as you keep the original reference to the NodeList created by
>Parser.parse, and you haven't modified that NodeList, you should be
>able to reuse it, I think.
>
>Ian
>
>On 8/3/06, Jesse Hou <hp...@gm...> wrote:
>  
>
>>Hi All,   When I'm using the htmlparser library, I suffered from a
>>difficulty. In a html there are many tags such as title, div, input, span
>>and so on. For example:
>>
>><title>this is a test </title>
>>
>>
>>//...... any other tags
>>
>><div class="A">
>>       <span class="B"><a href=" www.google.com ">google</a></span>
>></div>
>>
>>
>>//...... any other tags
>>
>><div class="C">
>>       <div class="D"><input type="text" id="E" value="msn" /></div>
>></div>
>>
>>//...... any other tags
>>
>>
>><div class="C">
>>       <div class="E"><span class="B"><input type="text" id="E" value="aol"
>>/><a href=" www.live.com ">live</a></span></div>
>></div>
>>
>>In this example maybe the whole html include many tags. if I want to get the
>>content 'this is a test',  maybe I can use a TagNameFilter, I have to parse
>>the whole html. If I want to get the content 'google' or ' www.google.com'
>>then I have to parse the whole html for the second time and if I want to get
>>'msn', 'aol', 'live' maybe I should parse the whole html for several times.
>>In this way I can get the content what I need but maybe this way will impact
>>the performance. Is there any other way to do that?  Maybe I can also use
>>OrFilter to get the Nodes but how can I identify a text match which tag? If
>>I want to store them into DB I have no idea how to do that by only once
>>parsing the html (the best performance).  I beg your help. :-)
>>
>>Thanks and Best Regards
>>
>>Jesse
>>
>  
>

Re: [Htmlparser-user] How to extract more than one tag by only once parsering?

From: Ian M. <ian...@gm...> - 2006-08-04 10:42:24

As long as you keep the original reference to the NodeList created by
Parser.parse, and you haven't modified that NodeList, you should be
able to reuse it, I think.

Ian

On 8/3/06, Jesse Hou <hp...@gm...> wrote:
>
> Hi All,   When I'm using the htmlparser library, I suffered from a
> difficulty. In a html there are many tags such as title, div, input, span
> and so on. For example:
>
> <title>this is a test </title>
>
>
> //...... any other tags
>
> <div class="A">
>        <span class="B"><a href=" www.google.com ">google</a></span>
> </div>
>
>
> //...... any other tags
>
> <div class="C">
>        <div class="D"><input type="text" id="E" value="msn" /></div>
> </div>
>
> //...... any other tags
>
>
> <div class="C">
>        <div class="E"><span class="B"><input type="text" id="E" value="aol"
> /><a href=" www.live.com ">live</a></span></div>
> </div>
>
> In this example maybe the whole html include many tags. if I want to get the
> content 'this is a test',  maybe I can use a TagNameFilter, I have to parse
> the whole html. If I want to get the content 'google' or ' www.google.com'
> then I have to parse the whole html for the second time and if I want to get
> 'msn', 'aol', 'live' maybe I should parse the whole html for several times.
> In this way I can get the content what I need but maybe this way will impact
> the performance. Is there any other way to do that?  Maybe I can also use
> OrFilter to get the Nodes but how can I identify a text match which tag? If
> I want to store them into DB I have no idea how to do that by only once
> parsing the html (the best performance).  I beg your help. :-)
>
> Thanks and Best Regards
>
> Jesse
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys -- and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
>
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
>

[Htmlparser-user] How to extract more than one tag by only once parsering?

From: Jesse H. <hp...@gm...> - 2006-08-03 02:21:56

Hi All,   When I'm using the htmlparser library, I suffered from a
difficulty. In a html there are many tags such as title, div, input,
span and so on. For example:

<title>this is a test </title>

//...... any other tags

<div class="A">
       <span class="B"><a href=" www.google.com ">google</a></span>
</div>

//...... any other tags

<div class="C">
       <div class="D"><input type="text" id="E" value="msn" /></div>
</div>

//...... any other tags

<div class="C">
       <div class="E"><span class="B"><input type="text" id="E" value="aol"
/><a href=" www.live.com ">live</a></span></div>
</div>

In this example maybe the whole html include many tags. if I want to get the
content 'this is a test',  maybe I can use a TagNameFilter, I have to parse
the whole html. If I want to get the content 'google' or 'www.google.com'
then I have to parse the whole html for the second time and if I want to get
'msn', 'aol', 'live' maybe I should parse the whole html for several times.
In this way I can get the content what I need but maybe this way will impact
the performance. Is there any other way to do that?  Maybe I can also use
OrFilter to get the Nodes but how can I identify a text match which tag? If
I want to store them into DB I have no idea how to do that by only once
parsing the html (the best performance).  I beg your help. :-)

Thanks and Best Regards

Jesse

Re: [Htmlparser-user] Could you help me?

From: Derrick O. <Der...@Ro...> - 2006-07-31 04:52:07

Sorry, replied without thinking.
You can apply the StringBean directly to a node list:

Parser parser = new Parser ("http://yadda.yadda");
NodeList list = parser.parse (my_spiffo_DIV_finding_filter);
Div div = list.elementAt (0);
StringBean bean = new StringBean ();
div.getChildren ().visitAllNodesWith (bean);
System.out.println (bean.getStrings ());

Derrick

Derrick Oswald wrote:

>Jesse,
>
>The job breaks down into two tasks:
>  1) get the outermost tag (your <div id="video_infobox_con"> tag) using 
>a filter you construct.
>  2) use a StringBean as a visitor on that node and it's children to 
>extract the text, like so:
>
>Parser parser = new Parser ("http://yadda.yadda");
>NodeList list = parser.parse (my_spiffo_DIV_finding_filter);
>Div div = list.elementAt (0);
>// now re-create the HTML and pass it into another Parser
>Parser parser = new Parser (div.toHtml ()); // Note: for older versions 
>you need to use setInputHtml()
>StringBean bean = new StringBean ();
>parser.visitAllNodesWith (bean);
>System.out.println (bean.getStrings ());
>
>Derrick
>
>h pq wrote:
>
>  
>
>>Hi all, I have a question when I parsered the html content.  In the 
>>html content there are many tags, if I want to get a tag text like 
>>LinkTag or TableTag , it's very easy to use the LinkRegexFilter or 
>>TagNameFilter, but if I want to get more than one tag's content , is 
>>there a filter chain ?  Maybe the example following will explain what 
>>I said directly:
>> 
>> <div id="video_infobox_con">
>>    ·add by:<span class="fcolor_03">2006.07.27 - 01:22</span><br />
>>    ·Label: 
>>                 <a href="search.do?q=%B0%CD%B6%FB%C4%E1%D1%C7%C4%E1" 
>>class="lnk_04" target=_self><u>test_a</u></a>              
>>              
>>                 <a href="search.do?q=%D7%B4%D4%AA%D0%E3" 
>>class="lnk_04" target=_self><u>test_b</u></a>              
>>              
>>                 <a href=" search.do?q=%C0%BA%C7%F2" class="lnk_04" 
>>target=_self><u>test_c</u></a>              
>>              
>>                 <a href="search.do?q=%CC%E5%D3%FD" class="lnk_04" 
>>target=_self><u>test_d</u></a>              
>>              
>> </div>
>><input type="text" id="htmlurl" name="htmlurl" value='value_test'  />
>> 
>>there are four tags such as div, span, a ,input, and  all content in 
>>these tags are what I need like 2006.07.27 - 01:22,  test_a,  test_b,  
>> test_c,  test_d and value_test
>>How should I do?  Maybe I can parser the html for 4 times to get the 
>>four tags' content, but I think it'll impact the proformance. Could 
>>you help me ? Thank you very much.
>> 
>>Best Regards
>>Jesse
>> 
>>
>>------------------------------------------------------------------------
>>
>>-------------------------------------------------------------------------
>>Take Surveys. Earn Cash. Influence the Future of IT
>>Join SourceForge.net's Techsay panel and you'll get the chance to share your
>>opinions on IT & business topics through brief surveys -- and earn cash
>>http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
>>
>>------------------------------------------------------------------------
>>
>>_______________________________________________
>>Htmlparser-user mailing list
>>Htm...@li...
>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>> 
>>
>>    
>>
>
>
>-------------------------------------------------------------------------
>Take Surveys. Earn Cash. Influence the Future of IT
>Join SourceForge.net's Techsay panel and you'll get the chance to share your
>opinions on IT & business topics through brief surveys -- and earn cash
>http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
>_______________________________________________
>Htmlparser-user mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>  
>

Re: [Htmlparser-user] Could you help me?

From: Derrick O. <Der...@Ro...> - 2006-07-31 04:47:16

Jesse,

The job breaks down into two tasks:
  1) get the outermost tag (your <div id="video_infobox_con"> tag) using 
a filter you construct.
  2) use a StringBean as a visitor on that node and it's children to 
extract the text, like so:

Parser parser = new Parser ("http://yadda.yadda");
NodeList list = parser.parse (my_spiffo_DIV_finding_filter);
Div div = list.elementAt (0);
// now re-create the HTML and pass it into another Parser
Parser parser = new Parser (div.toHtml ()); // Note: for older versions 
you need to use setInputHtml()
StringBean bean = new StringBean ();
parser.visitAllNodesWith (bean);
System.out.println (bean.getStrings ());

Derrick

h pq wrote:

> Hi all, I have a question when I parsered the html content.  In the 
> html content there are many tags, if I want to get a tag text like 
> LinkTag or TableTag , it's very easy to use the LinkRegexFilter or 
> TagNameFilter, but if I want to get more than one tag's content , is 
> there a filter chain ?  Maybe the example following will explain what 
> I said directly:
>  
>  <div id="video_infobox_con">
>     ·add by:<span class="fcolor_03">2006.07.27 - 01:22</span><br />
>     ·Label: 
>                  <a href="search.do?q=%B0%CD%B6%FB%C4%E1%D1%C7%C4%E1" 
> class="lnk_04" target=_self><u>test_a</u></a>              
>               
>                  <a href="search.do?q=%D7%B4%D4%AA%D0%E3" 
> class="lnk_04" target=_self><u>test_b</u></a>              
>               
>                  <a href=" search.do?q=%C0%BA%C7%F2" class="lnk_04" 
> target=_self><u>test_c</u></a>              
>               
>                  <a href="search.do?q=%CC%E5%D3%FD" class="lnk_04" 
> target=_self><u>test_d</u></a>              
>               
>  </div>
> <input type="text" id="htmlurl" name="htmlurl" value='value_test'  />
>  
> there are four tags such as div, span, a ,input, and  all content in 
> these tags are what I need like 2006.07.27 - 01:22,  test_a,  test_b,  
>  test_c,  test_d and value_test
> How should I do?  Maybe I can parser the html for 4 times to get the 
> four tags' content, but I think it'll impact the proformance. Could 
> you help me ? Thank you very much.
>  
> Best Regards
> Jesse
>  
>
>------------------------------------------------------------------------
>
>-------------------------------------------------------------------------
>Take Surveys. Earn Cash. Influence the Future of IT
>Join SourceForge.net's Techsay panel and you'll get the chance to share your
>opinions on IT & business topics through brief surveys -- and earn cash
>http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Htmlparser-user mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>  
>

[Htmlparser-user] Could you help me?

From: h p. <hp...@gm...> - 2006-07-31 03:35:57

Hi all, I have a question when I parsered the html content.  In the html
content there are many tags, if I want to get a tag text like LinkTag or
TableTag , it's very easy to use the LinkRegexFilter or TagNameFilter, but
if I want to get more than one tag's content , is there a filter chain ?
Maybe the example following will explain what I said directly:

 <div id=3D"video_infobox_con">
    =B7add by:<span class=3D"fcolor_03">2006.07.27 - 01:22</span><br />
    =B7Label:
                 <a href=3D"search.do?q=3D%B0%CD%B6%FB%C4%E1%D1%C7%C4%E1"
class=3D"lnk_04" target=3D_self><u>test_a</u></a>

                 <a href=3D"search.do?q=3D%D7%B4%D4%AA%D0%E3" class=3D"lnk_=
04"
target=3D_self><u>test_b</u></a>

                 <a href=3D"search.do?q=3D%C0%BA%C7%F2" class=3D"lnk_04"
target=3D_self><u>test_c</u></a>

                 <a href=3D"search.do?q=3D%CC%E5%D3%FD" class=3D"lnk_04"
target=3D_self><u>test_d</u></a>

 </div>
<input type=3D"text" id=3D"htmlurl" name=3D"htmlurl" value=3D'value_test'  =
/>

there are four tags such as div, span, a ,input, and  all content in these
tags are what I need like 2006.07.27 - 01:22,  test_a,  test_b,   test_c,
 test_d and value_test
How should I do?  Maybe I can parser the html for 4 times to get the four
tags' content, but I think it'll impact the proformance. Could you help me =
?
Thank you very much.

Best Regards
Jesse

Re: [Htmlparser-user] finding meta data

From: Derrick O. <Der...@Ro...> - 2006-07-30 12:12:21

Kavorka,

Maybe if you just want to remove the whole link, use something like:
   getParent ().getChildren ().remove (this);
in the doSemanticAction() override of your custom LinkTag class.
That will remove the current link tag from the enclosing parent tag by 
altering the children list.

Derrick

kavorka wrote:

> Hi Oswald,
> Yes i want to remove text within <a></a>. i'll try to do what you have 
> said, but 
> i'm a newbie java coder i didnt understand what you have said clearly. 
> I tried to override
> linkTAg to not to take text <a></a> now myLinkTag doesnt find links. 
> but now how can i take
> text other that <a></a>.
> if i ask to much, i'm sorry.
> thanks a lot
> murat
>
>  
> On 7/29/06, *Derrick Oswald* <Der...@ro... 
> <mailto:Der...@ro...>> wrote:
>
>     Murat,
>
>     I'm not sure what you mean by 'pure' text.
>     The stringextractor program uses the StringBean under the hood.
>     It only collects text which would be presented in a browser - or at
>     least it's supposed to.
>     The stringextractor program has an option (-links) to output the links
>     within angle brackets. Make sure this is not used.
>     If you want to remove text within <a></a> pairs you will need to
>     override the default LinkTag to not do this and register it with the
>     PrototypicalNodeFactory.
>
>     Derrick
>
>     kavorka wrote:
>
>     > Hi Oswald,
>     > I have another question. In HTMLPARSER, is it possible to
>     extract only
>     > the text in the webpage. In the stringextractor program, it extract
>     > also link text in the page, i want to extract "pure" text. can i
>     do it?
>     > thanks
>     > Murat
>     >
>
>
>     -------------------------------------------------------------------------
>     Take Surveys. Earn Cash. Influence the Future of IT
>     Join SourceForge.net's Techsay panel and you'll get the chance to
>     share your
>     opinions on IT & business topics through brief surveys -- and earn
>     cash
>     http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
>     <http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV>
>     _______________________________________________
>     Htmlparser-user mailing list
>     Htm...@li...
>     <mailto:Htm...@li...>
>     https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
>------------------------------------------------------------------------
>
>-------------------------------------------------------------------------
>Take Surveys. Earn Cash. Influence the Future of IT
>Join SourceForge.net's Techsay panel and you'll get the chance to share your
>opinions on IT & business topics through brief surveys -- and earn cash
>http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Htmlparser-user mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>  
>

Re: [Htmlparser-user] finding meta data

From: kavorka <the...@gm...> - 2006-07-29 13:07:11

Hi Oswald,
Yes i want to remove text within <a></a>. i'll try to do what you have said,
but
i'm a newbie java coder i didnt understand what you have said clearly. I
tried to override
linkTAg to not to take text <a></a> now myLinkTag doesnt find links. but now
how can i take
text other that <a></a>.
if i ask to much, i'm sorry.
thanks a lot
murat


On 7/29/06, Derrick Oswald <Der...@ro...> wrote:
>
> Murat,
>
> I'm not sure what you mean by 'pure' text.
> The stringextractor program uses the StringBean under the hood.
> It only collects text which would be presented in a browser - or at
> least it's supposed to.
> The stringextractor program has an option (-links) to output the links
> within angle brackets. Make sure this is not used.
> If you want to remove text within <a></a> pairs you will need to
> override the default LinkTag to not do this and register it with the
> PrototypicalNodeFactory.
>
> Derrick
>
> kavorka wrote:
>
> > Hi Oswald,
> > I have another question. In HTMLPARSER, is it possible to extract only
> > the text in the webpage. In the stringextractor program, it extract
> > also link text in the page, i want to extract "pure" text. can i do it?
> > thanks
> > Murat
> >
>
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share
> your
> opinions on IT & business topics through brief surveys -- and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>

Re: [Htmlparser-user] wrong link encoding?

From: Derrick O. <Der...@Ro...> - 2006-07-29 11:26:58

Eugeny,

Perhaps the web page is broken and has characters that can't be encoded 
by the encoding specified in the HTTP header or META tag.
Or perhaps those are lying and the real encoding is something else.
What does it look like in your browser? What encoding is it using to 
interpret it?
Use parser.setEncoding ("XXXXX"); to set the encoding before beginning 
the parse.

Derrick

Eugeny N Dzhurinsky wrote:

>Hello!
>I'm trying to parse this page and extract all links there: 
>http://www.vu.lt/lt/naujienos/337/
>
>for some reason the link to PDF file looks like: 
>http://www.vu.lt/site_files/InfS/Naujienos/istorik??%20dienos.pdf
>
>which is wrong. It seems like some wrong charset was used?
>
>Here is part of my code which does the parsing:
>
>public LinkedList parseDocument(InputStream document, String encoding) {
>    try {
>	Lexer lexer = new Lexer(new Page(document, encoding));
>	String href;
>	try {
>	    lexer.reset();
>	    if (banner != null)
>		validateBanner(lexer);
>	    lexer.reset();
>	    Parser parser = new Parser(lexer);
>	    NodeList list = null;
>	    try {
>		list = parser
>			.extractAllNodesThatMatch(new InterestedTagsFilter());
>	    } catch (EncodingChangeException e) {
>		log.warn(e);
>		lexer.reset();
>		lexer.getPage().setEncoding(parser.getEncoding());
>		list = parser
>			.extractAllNodesThatMatch(new InterestedTagsFilter());
>	    }
>	    for (SimpleNodeIterator it = list.elements(); it.hasMoreNodes();) {
>		TagNode node = (TagNode) it.nextNode();
>		href = null;
>		if (LinkTag.class.equals(node.getClass())
>			&& validateLink((LinkTag) node)) {
>		    href = ((LinkTag) node).getLink();
>		} else if (ImageTag.class.equals(node.getClass())
>			|| FrameTag.class.equals(node.getClass())) {
>		    href = node.getAttribute("src");
>		} else if (TitleTag.class.equals(node.getClass())) {
>		    title = ((TitleTag) node).getTitle();
>		} else if (BaseHrefTag.class.equals(node.getClass())) {
>		    try {
>			baseTag = getBaseURL(new URI(((BaseHrefTag) node)
>				.getBaseUrl(), false));
>		    } catch (URIException e2) {
>		    }
>		} else if (MetaTag.class.equals(node.getClass())
>			&& "refresh".equalsIgnoreCase(((MetaTag) node)
>				.getHttpEquiv())) {
>		    String URL = ((MetaTag) node).getMetaContent();
>		    if (URL != null && URL.length() > 0) {
>			String arr[] = URL.split("URL=");
>			if (arr != null && arr.length == 2)
>			    href = arr[1];
>		    }
>		}
>		if (href != null && href.length() > 0) {
>		    if (log.isDebugEnabled())
>------->		log.debug(href);		<-----------
>		    results.add(getURL(StringEscapeUtils
>			    .unescapeHtml(getEscapedURL(href.trim()))));
>		}
>	    }
>	    this.encoding = parser.getEncoding();
>	    if (log.isDebugEnabled())
>		log.debug(this.encoding);
>	} catch (ParserException e1) {
>	    log.error(e1, e1);
>	}
>    } catch (UnsupportedEncodingException e) {
>	log.error(e, e);
>    }
>    return results;
>}
>
>And on marked line application logs
>/site_files/InfS/Naujienos/istorik??%20dienos.pdf
>
>what could be wrong there?
>
>  
>

Re: [Htmlparser-user] Issue for lexer modification

From: Derrick O. <Der...@Ro...> - 2006-07-29 11:18:57

Xue-Feng,

There are many examples of collecting the parsed nodes in a nodelist, 
modify them and print the list.
Something like this should work.

NodeList list = parser.parse (null);
TextNodes text = list.extractAllNodesThatMatch (new NodeClassFilter 
(TextNode.class));
// modify the text items in the text list
System.out.println (list.toHtml ());

Derrick

Xue-Feng Yang wrote:

>I am trying to modify for the TextNodes in a lexer by 
>TextNode.setText(String). Then I tried to print the
>lexer by
>
> Page toPage=lexer.getPage();
> String toString=toPage.getText();
> System.out.println(toString);
>
>The page was unchanged.
> 
>Does any one have idea how to modify a lexer or simply
>a html page?
>
>Thanks,
>
>  
>

Re: [Htmlparser-user] finding meta data

From: Derrick O. <Der...@Ro...> - 2006-07-29 11:14:28

Murat,

I'm not sure what you mean by 'pure' text.
The stringextractor program uses the StringBean under the hood.
It only collects text which would be presented in a browser - or at 
least it's supposed to.
The stringextractor program has an option (-links) to output the links 
within angle brackets. Make sure this is not used.
If you want to remove text within <a></a> pairs you will need to 
override the default LinkTag to not do this and register it with the 
PrototypicalNodeFactory.

Derrick

kavorka wrote:

> Hi Oswald,
> I have another question. In HTMLPARSER, is it possible to extract only 
> the text in the webpage. In the stringextractor program, it extract 
> also link text in the page, i want to extract "pure" text. can i do it?
> thanks
> Murat
>

790 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 32 33 34 35 36 .. 99 > >> (Page 34 of 99)