htmlparser-user Mailing List for HTML Parser (Page 34)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: jpdogg <jp...@gm...> - 2006-09-11 16:59:01
|
Hello, I've cached some HTML pages in local files and would like to tell the Parser object what the original URLs were so that it can correctly interpret relative links. As a simple example, say I do this: Parser my_parser = new Parser("<html><img src='foo.jpg'></html>"); If I construct a filter to give me all of the ImageTags in this simple document, I get one. Unfortunately, it has the URL foo.jpg. If I know that this file was originally located at http://www.bar.com/foo.html, how do I inform the parser module? I want it to be able to report that the above image is located at http://www.bar.com/foo.jpg. Thanks! Jeff |
From: andrew d. <and...@ho...> - 2006-09-07 18:06:16
|
Thank you for this it was just what was needed.. >From: Derrick Oswald <Der...@Ro...> >Reply-To: This is the user list of htmlparser ><htm...@li...> >To: This is the user list of htmlparser ><htm...@li...> >Subject: Re: [Htmlparser-user] Extract Data from Table Row Question. >Date: Thu, 07 Sep 2006 07:50:37 -0400 > >Andrew, > >You could use a filter on the row NodeList, something like: > > NodeList td_tags = TableList.extractAllNodesThatMatch ( > new AndFilter (new TagNameFilter ("TD"), new HasAttributeFilter >("class", "listi"))); > >Once you have the tags you can fetch their text contents with a StringBean: > StringBean sb = new StringBean (); > td_tags.visitAllNodesWith (sb); > System.out.println (sb.getStrings () ); > >Derrick > >andrew davis wrote: > > >Hello All and Thanks for looking at my Question. > > > >I am still new to Java and HtmlParser I have se series of Web pages >stored > >offline that i need to process, that are made up of tables, i can find >the > >tables tag, and then all Table Rows, but the next bit is stumping me, I.e > >how do i read the TD values or how to check invidual tags to see if there >is > >more processing to do (see Source Example below) > > > >Many Thanks for Any help. > > > > > >public static void process(NodeList listx) > > { > > // Scan for "tr" tags and Extract info > > NodeList TableList = listx.extractAllNodesThatMatch(new > >TagNameFilter("tr")); > > for (int x = 0; x < xx.size(); x++) > > { > > > > // Process Nodes or Tags this is what is stamping me > > > > 1. How do i read all TD from nodes with say format <TD class="listi"> >etc > >and get their value > > 2. Or How do i get invidural Tags for futher processing > > > > } > > } > > > > > > public static void main(String[] args) { > > > > try { > > parser = new Parser("c:\\HtmlTest0002.htm"); > > > >// Look for Table Tag > > > > list = parser.parse (new TagNameFilter("table")); > > for (int x = 0; x < list.size(); x++) > > { > > > >// Is it the right Table > > > > if (list.elementAt(x).toString().contains("listme")) > > { > > // Get all Children and process > > process(list.elementAt(x).getChildren()); > > } > > } > > } catch (ParserException ex) { > > ex.printStackTrace(); > > } > > > > } > > > >} > > > > > > > >------------------------------------------------------------------------- > >Using Tomcat but need to do more? Need to support web services, security? > >Get stuff done quickly with pre-integrated technology to make your job >easier > >Download IBM WebSphere Application Server v.1.0.1 based on Apache >Geronimo > >http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > >------------------------------------------------------------------------- >Using Tomcat but need to do more? Need to support web services, security? >Get stuff done quickly with pre-integrated technology to make your job >easier >Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo >http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Derrick O. <Der...@Ro...> - 2006-09-07 11:50:47
|
Andrew, You could use a filter on the row NodeList, something like: NodeList td_tags = TableList.extractAllNodesThatMatch ( new AndFilter (new TagNameFilter ("TD"), new HasAttributeFilter ("class", "listi"))); Once you have the tags you can fetch their text contents with a StringBean: StringBean sb = new StringBean (); td_tags.visitAllNodesWith (sb); System.out.println (sb.getStrings () ); Derrick andrew davis wrote: >Hello All and Thanks for looking at my Question. > >I am still new to Java and HtmlParser I have se series of Web pages stored >offline that i need to process, that are made up of tables, i can find the >tables tag, and then all Table Rows, but the next bit is stumping me, I.e >how do i read the TD values or how to check invidual tags to see if there is >more processing to do (see Source Example below) > >Many Thanks for Any help. > > >public static void process(NodeList listx) > { > // Scan for "tr" tags and Extract info > NodeList TableList = listx.extractAllNodesThatMatch(new >TagNameFilter("tr")); > for (int x = 0; x < xx.size(); x++) > { > > // Process Nodes or Tags this is what is stamping me > > 1. How do i read all TD from nodes with say format <TD class="listi"> etc >and get their value > 2. Or How do i get invidural Tags for futher processing > > } > } > > > public static void main(String[] args) { > > try { > parser = new Parser("c:\\HtmlTest0002.htm"); > >// Look for Table Tag > > list = parser.parse (new TagNameFilter("table")); > for (int x = 0; x < list.size(); x++) > { > >// Is it the right Table > > if (list.elementAt(x).toString().contains("listme")) > { > // Get all Children and process > process(list.elementAt(x).getChildren()); > } > } > } catch (ParserException ex) { > ex.printStackTrace(); > } > > } > >} > > > >------------------------------------------------------------------------- >Using Tomcat but need to do more? Need to support web services, security? >Get stuff done quickly with pre-integrated technology to make your job easier >Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo >http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: andrew d. <and...@ho...> - 2006-09-06 11:01:25
|
Hello All and Thanks for looking at my Question. I am still new to Java and HtmlParser I have se series of Web pages stored offline that i need to process, that are made up of tables, i can find the tables tag, and then all Table Rows, but the next bit is stumping me, I.e how do i read the TD values or how to check invidual tags to see if there is more processing to do (see Source Example below) Many Thanks for Any help. public static void process(NodeList listx) { // Scan for "tr" tags and Extract info NodeList TableList = listx.extractAllNodesThatMatch(new TagNameFilter("tr")); for (int x = 0; x < xx.size(); x++) { // Process Nodes or Tags this is what is stamping me 1. How do i read all TD from nodes with say format <TD class="listi"> etc and get their value 2. Or How do i get invidural Tags for futher processing } } public static void main(String[] args) { try { parser = new Parser("c:\\HtmlTest0002.htm"); // Look for Table Tag list = parser.parse (new TagNameFilter("table")); for (int x = 0; x < list.size(); x++) { // Is it the right Table if (list.elementAt(x).toString().contains("listme")) { // Get all Children and process process(list.elementAt(x).getChildren()); } } } catch (ParserException ex) { ex.printStackTrace(); } } } |
From: Ian M. <ian...@gm...> - 2006-08-30 16:09:08
|
Can you give a copy of the file that shows this problem? On 8/25/06, Srinivas N <sn...@os...> wrote: > > > > hi , all > > Please help me it is very urgent > > > I have an html content which consists of 48 input tags in a form tag when > formTag.getFormInputs() is called it returned me 48 counts consisting of > many table tags inside the form tag , but when the same content is paced > including the formtag in table tag the parsed parsed upto 14 input tags and > could not return the count of 48 tags which is expected > > please let me know the problem with the parser of the way of representation > of table tag above the form tag > > with regards > Srinivas > > > > > > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Eugeny N D. <bo...@re...> - 2006-08-28 08:10:15
|
On Fri, Aug 25, 2006 at 09:56:48AM +0100, Ian Macfarlane wrote: > If it's guaranteed to be valid XML, I'd use an XML parser instead. > Java has one built in, or look into Xerces. The thing is I will get the document as input, and I don't know which of formats - HTML, XHTML or XML - it will be, so I'm looking for common way to build DOM for these formats. -- Eugene N Dzhurinsky |
From: Srinivas N <sn...@os...> - 2006-08-25 12:12:14
|
hi , all Please help me it is very urgent I have an html content which consists of 48 input tags in a form tag = when formTag.getFormInputs() is called it returned me 48 counts = consisting of many table tags inside the form tag , but when the same = content is paced including the formtag in table tag the parsed parsed = upto 14 input tags and could not return the count of 48 tags which is = expected please let me know the problem with the parser of the way of = representation of table tag above the form tag with regards Srinivas =20 |
From: Ian M. <ian...@gm...> - 2006-08-25 08:56:52
|
If it's guaranteed to be valid XML, I'd use an XML parser instead. Java has one built in, or look into Xerces. Ian On 8/23/06, Eugeny N Dzhurinsky <bo...@re...> wrote: > Is it possible to parse XML documents as well as XHTML documents with > htmlparser? > > -- > Eugene N Dzhurinsky > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Eugeny N D. <bo...@re...> - 2006-08-23 08:34:56
|
Is it possible to parse XML documents as well as XHTML documents with htmlparser? -- Eugene N Dzhurinsky |
From: Derrick O. <Der...@Ro...> - 2006-08-10 02:59:48
|
Hi, I would be interested to hear some real user stories. The traffic on this list is pretty much all problems encountered - and solutions provided hopefully - but there must be a whole bunch of people who are using it for weird and wild projects without a problem. After all there are 3000 downloads a month, and it's not that hard to use is it? So how about it? Tell us your success story or something small or large you are proud of accomplishing with htmlparser. Derrick |
From: lu d. <dom...@gm...> - 2006-08-09 02:38:54
|
From: Derrick O. <Der...@Ro...> - 2006-08-08 20:37:07
|
Jesse, The problem may be within the HtmlUtils.registerTags. What does this do? What tags does it register? The div tag filter will return multiple elements with the same text as in the case of: <div class='A'><div class='B'>the text</div></div> will return a list containing two items: 1) <div class='A'><div class='B'>the text</div></div> 2) <div class='B'>the text</div> which if you pass it to string extractor will return: the textthe text Derrick hpq852 wrote: > Hi All, I encountered a very strange question. My code is very simple > as following: > public void doTest() throws Exception > { > URL url = new URL("http://www.uume.com/play_CPRz8a2si4zK"); > InputStream in = url.openStream(); > BufferedReader br = new BufferedReader(new InputStreamReader(in, > "GB2312")); > String line = null; > StringBuffer sb = new StringBuffer(); > while ((line = br.readLine()) != null) > { > sb.append(line); > sb.append("\n"); > } > extractText2(sb.toString()); > } > > public String extractText2(String inputHtml) throws Exception > { > Parser parser = Parser.createParser(new > String(inputHtml.getBytes(),"GB2312"), "GB2312"); > HtmlUtils.registerTags(parser); > NodeFilter tagNameFilter = new TagNameFilter("div"); > NodeList nodeList = parser.extractAllNodesThatMatch(tagNameFilter); > > System.out.println(nodeList.toHtml()); > return null; > } > I just want to get all of div tags, so I used a TagNameFilter, but the > result I got in the console is strange, it includes many repeated div > tags with same content. > I have tried for many times, but what I got was the same, I really > don't know what't the reason. Could you help me please? > Thanks and Best Regards > Jesse > |
From: hpq852 <hp...@gm...> - 2006-08-08 16:19:16
|
Hi All, I encountered a very strange question. My code is very simple as following: public void doTest() throws Exception { URL url = new URL("http://www.uume.com/play_CPRz8a2si4zK"); InputStream in = url.openStream(); BufferedReader br = new BufferedReader(new InputStreamReader(in, "GB2312")); String line = null; StringBuffer sb = new StringBuffer(); while ((line = br.readLine()) != null) { sb.append(line); sb.append("\n"); } extractText2(sb.toString()); } public String extractText2(String inputHtml) throws Exception { Parser parser = Parser.createParser(new String(inputHtml.getBytes(),"GB2312"), "GB2312"); HtmlUtils.registerTags(parser); NodeFilter tagNameFilter = new TagNameFilter("div"); NodeList nodeList = parser.extractAllNodesThatMatch(tagNameFilter); System.out.println(nodeList.toHtml()); return null; } I just want to get all of div tags, so I used a TagNameFilter, but the result I got in the console is strange, it includes many repeated div tags with same content. I have tried for many times, but what I got was the same, I really don't know what't the reason. Could you help me please? Thanks and Best Regards Jesse |
From: Derrick O. <Der...@Ro...> - 2006-08-04 11:42:35
|
Jesse, From your example, you can also get all the div tags at once and filter on class in a secondary pass: NodeList divs = nodelist.extractAllTagsThatMatch (new TagNameFilter ("DIV")); DivTag div_a = divs.extractAllTagsThatMatch (new HasAttributeFilter ("class", "A")).element (0); // presuming there is only one DivTag div_b = divs.extractAllTagsThatMatch (new HasAttributeFilter ("class", "B")).element (0); // presuming there is only one and this may be faster than searching the entire page each time. Derrick Ian Macfarlane wrote: >As long as you keep the original reference to the NodeList created by >Parser.parse, and you haven't modified that NodeList, you should be >able to reuse it, I think. > >Ian > >On 8/3/06, Jesse Hou <hp...@gm...> wrote: > > >>Hi All, When I'm using the htmlparser library, I suffered from a >>difficulty. In a html there are many tags such as title, div, input, span >>and so on. For example: >> >><title>this is a test </title> >> >> >>//...... any other tags >> >><div class="A"> >> <span class="B"><a href=" www.google.com ">google</a></span> >></div> >> >> >>//...... any other tags >> >><div class="C"> >> <div class="D"><input type="text" id="E" value="msn" /></div> >></div> >> >>//...... any other tags >> >> >><div class="C"> >> <div class="E"><span class="B"><input type="text" id="E" value="aol" >>/><a href=" www.live.com ">live</a></span></div> >></div> >> >>In this example maybe the whole html include many tags. if I want to get the >>content 'this is a test', maybe I can use a TagNameFilter, I have to parse >>the whole html. If I want to get the content 'google' or ' www.google.com' >>then I have to parse the whole html for the second time and if I want to get >>'msn', 'aol', 'live' maybe I should parse the whole html for several times. >>In this way I can get the content what I need but maybe this way will impact >>the performance. Is there any other way to do that? Maybe I can also use >>OrFilter to get the Nodes but how can I identify a text match which tag? If >>I want to store them into DB I have no idea how to do that by only once >>parsing the html (the best performance). I beg your help. :-) >> >>Thanks and Best Regards >> >>Jesse >> > > |
From: Ian M. <ian...@gm...> - 2006-08-04 10:42:24
|
As long as you keep the original reference to the NodeList created by Parser.parse, and you haven't modified that NodeList, you should be able to reuse it, I think. Ian On 8/3/06, Jesse Hou <hp...@gm...> wrote: > > Hi All, When I'm using the htmlparser library, I suffered from a > difficulty. In a html there are many tags such as title, div, input, span > and so on. For example: > > <title>this is a test </title> > > > //...... any other tags > > <div class="A"> > <span class="B"><a href=" www.google.com ">google</a></span> > </div> > > > //...... any other tags > > <div class="C"> > <div class="D"><input type="text" id="E" value="msn" /></div> > </div> > > //...... any other tags > > > <div class="C"> > <div class="E"><span class="B"><input type="text" id="E" value="aol" > /><a href=" www.live.com ">live</a></span></div> > </div> > > In this example maybe the whole html include many tags. if I want to get the > content 'this is a test', maybe I can use a TagNameFilter, I have to parse > the whole html. If I want to get the content 'google' or ' www.google.com' > then I have to parse the whole html for the second time and if I want to get > 'msn', 'aol', 'live' maybe I should parse the whole html for several times. > In this way I can get the content what I need but maybe this way will impact > the performance. Is there any other way to do that? Maybe I can also use > OrFilter to get the Nodes but how can I identify a text match which tag? If > I want to store them into DB I have no idea how to do that by only once > parsing the html (the best performance). I beg your help. :-) > > Thanks and Best Regards > > Jesse > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Jesse H. <hp...@gm...> - 2006-08-03 02:21:56
|
Hi All, When I'm using the htmlparser library, I suffered from a difficulty. In a html there are many tags such as title, div, input, span and so on. For example: <title>this is a test </title> //...... any other tags <div class="A"> <span class="B"><a href=" www.google.com ">google</a></span> </div> //...... any other tags <div class="C"> <div class="D"><input type="text" id="E" value="msn" /></div> </div> //...... any other tags <div class="C"> <div class="E"><span class="B"><input type="text" id="E" value="aol" /><a href=" www.live.com ">live</a></span></div> </div> In this example maybe the whole html include many tags. if I want to get the content 'this is a test', maybe I can use a TagNameFilter, I have to parse the whole html. If I want to get the content 'google' or 'www.google.com' then I have to parse the whole html for the second time and if I want to get 'msn', 'aol', 'live' maybe I should parse the whole html for several times. In this way I can get the content what I need but maybe this way will impact the performance. Is there any other way to do that? Maybe I can also use OrFilter to get the Nodes but how can I identify a text match which tag? If I want to store them into DB I have no idea how to do that by only once parsing the html (the best performance). I beg your help. :-) Thanks and Best Regards Jesse |
From: Derrick O. <Der...@Ro...> - 2006-07-31 04:52:07
|
Sorry, replied without thinking. You can apply the StringBean directly to a node list: Parser parser = new Parser ("http://yadda.yadda"); NodeList list = parser.parse (my_spiffo_DIV_finding_filter); Div div = list.elementAt (0); StringBean bean = new StringBean (); div.getChildren ().visitAllNodesWith (bean); System.out.println (bean.getStrings ()); Derrick Derrick Oswald wrote: >Jesse, > >The job breaks down into two tasks: > 1) get the outermost tag (your <div id="video_infobox_con"> tag) using >a filter you construct. > 2) use a StringBean as a visitor on that node and it's children to >extract the text, like so: > >Parser parser = new Parser ("http://yadda.yadda"); >NodeList list = parser.parse (my_spiffo_DIV_finding_filter); >Div div = list.elementAt (0); >// now re-create the HTML and pass it into another Parser >Parser parser = new Parser (div.toHtml ()); // Note: for older versions >you need to use setInputHtml() >StringBean bean = new StringBean (); >parser.visitAllNodesWith (bean); >System.out.println (bean.getStrings ()); > >Derrick > >h pq wrote: > > > >>Hi all, I have a question when I parsered the html content. In the >>html content there are many tags, if I want to get a tag text like >>LinkTag or TableTag , it's very easy to use the LinkRegexFilter or >>TagNameFilter, but if I want to get more than one tag's content , is >>there a filter chain ? Maybe the example following will explain what >>I said directly: >> >> <div id="video_infobox_con"> >> ·add by:<span class="fcolor_03">2006.07.27 - 01:22</span><br /> >> ·Label: >> <a href="search.do?q=%B0%CD%B6%FB%C4%E1%D1%C7%C4%E1" >>class="lnk_04" target=_self><u>test_a</u></a> >> >> <a href="search.do?q=%D7%B4%D4%AA%D0%E3" >>class="lnk_04" target=_self><u>test_b</u></a> >> >> <a href=" search.do?q=%C0%BA%C7%F2" class="lnk_04" >>target=_self><u>test_c</u></a> >> >> <a href="search.do?q=%CC%E5%D3%FD" class="lnk_04" >>target=_self><u>test_d</u></a> >> >> </div> >><input type="text" id="htmlurl" name="htmlurl" value='value_test' /> >> >>there are four tags such as div, span, a ,input, and all content in >>these tags are what I need like 2006.07.27 - 01:22, test_a, test_b, >> test_c, test_d and value_test >>How should I do? Maybe I can parser the html for 4 times to get the >>four tags' content, but I think it'll impact the proformance. Could >>you help me ? Thank you very much. >> >>Best Regards >>Jesse >> >> >>------------------------------------------------------------------------ >> >>------------------------------------------------------------------------- >>Take Surveys. Earn Cash. Influence the Future of IT >>Join SourceForge.net's Techsay panel and you'll get the chance to share your >>opinions on IT & business topics through brief surveys -- and earn cash >>http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV >> >>------------------------------------------------------------------------ >> >>_______________________________________________ >>Htmlparser-user mailing list >>Htm...@li... >>https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> >> >> > > >------------------------------------------------------------------------- >Take Surveys. Earn Cash. Influence the Future of IT >Join SourceForge.net's Techsay panel and you'll get the chance to share your >opinions on IT & business topics through brief surveys -- and earn cash >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Derrick O. <Der...@Ro...> - 2006-07-31 04:47:16
|
Jesse, The job breaks down into two tasks: 1) get the outermost tag (your <div id="video_infobox_con"> tag) using a filter you construct. 2) use a StringBean as a visitor on that node and it's children to extract the text, like so: Parser parser = new Parser ("http://yadda.yadda"); NodeList list = parser.parse (my_spiffo_DIV_finding_filter); Div div = list.elementAt (0); // now re-create the HTML and pass it into another Parser Parser parser = new Parser (div.toHtml ()); // Note: for older versions you need to use setInputHtml() StringBean bean = new StringBean (); parser.visitAllNodesWith (bean); System.out.println (bean.getStrings ()); Derrick h pq wrote: > Hi all, I have a question when I parsered the html content. In the > html content there are many tags, if I want to get a tag text like > LinkTag or TableTag , it's very easy to use the LinkRegexFilter or > TagNameFilter, but if I want to get more than one tag's content , is > there a filter chain ? Maybe the example following will explain what > I said directly: > > <div id="video_infobox_con"> > ·add by:<span class="fcolor_03">2006.07.27 - 01:22</span><br /> > ·Label: > <a href="search.do?q=%B0%CD%B6%FB%C4%E1%D1%C7%C4%E1" > class="lnk_04" target=_self><u>test_a</u></a> > > <a href="search.do?q=%D7%B4%D4%AA%D0%E3" > class="lnk_04" target=_self><u>test_b</u></a> > > <a href=" search.do?q=%C0%BA%C7%F2" class="lnk_04" > target=_self><u>test_c</u></a> > > <a href="search.do?q=%CC%E5%D3%FD" class="lnk_04" > target=_self><u>test_d</u></a> > > </div> > <input type="text" id="htmlurl" name="htmlurl" value='value_test' /> > > there are four tags such as div, span, a ,input, and all content in > these tags are what I need like 2006.07.27 - 01:22, test_a, test_b, > test_c, test_d and value_test > How should I do? Maybe I can parser the html for 4 times to get the > four tags' content, but I think it'll impact the proformance. Could > you help me ? Thank you very much. > > Best Regards > Jesse > > >------------------------------------------------------------------------ > >------------------------------------------------------------------------- >Take Surveys. Earn Cash. Influence the Future of IT >Join SourceForge.net's Techsay panel and you'll get the chance to share your >opinions on IT & business topics through brief surveys -- and earn cash >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > >------------------------------------------------------------------------ > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: h p. <hp...@gm...> - 2006-07-31 03:35:57
|
Hi all, I have a question when I parsered the html content. In the html content there are many tags, if I want to get a tag text like LinkTag or TableTag , it's very easy to use the LinkRegexFilter or TagNameFilter, but if I want to get more than one tag's content , is there a filter chain ? Maybe the example following will explain what I said directly: <div id=3D"video_infobox_con"> =B7add by:<span class=3D"fcolor_03">2006.07.27 - 01:22</span><br /> =B7Label: <a href=3D"search.do?q=3D%B0%CD%B6%FB%C4%E1%D1%C7%C4%E1" class=3D"lnk_04" target=3D_self><u>test_a</u></a> <a href=3D"search.do?q=3D%D7%B4%D4%AA%D0%E3" class=3D"lnk_= 04" target=3D_self><u>test_b</u></a> <a href=3D"search.do?q=3D%C0%BA%C7%F2" class=3D"lnk_04" target=3D_self><u>test_c</u></a> <a href=3D"search.do?q=3D%CC%E5%D3%FD" class=3D"lnk_04" target=3D_self><u>test_d</u></a> </div> <input type=3D"text" id=3D"htmlurl" name=3D"htmlurl" value=3D'value_test' = /> there are four tags such as div, span, a ,input, and all content in these tags are what I need like 2006.07.27 - 01:22, test_a, test_b, test_c, test_d and value_test How should I do? Maybe I can parser the html for 4 times to get the four tags' content, but I think it'll impact the proformance. Could you help me = ? Thank you very much. Best Regards Jesse |
From: Derrick O. <Der...@Ro...> - 2006-07-30 12:12:21
|
Kavorka, Maybe if you just want to remove the whole link, use something like: getParent ().getChildren ().remove (this); in the doSemanticAction() override of your custom LinkTag class. That will remove the current link tag from the enclosing parent tag by altering the children list. Derrick kavorka wrote: > Hi Oswald, > Yes i want to remove text within <a></a>. i'll try to do what you have > said, but > i'm a newbie java coder i didnt understand what you have said clearly. > I tried to override > linkTAg to not to take text <a></a> now myLinkTag doesnt find links. > but now how can i take > text other that <a></a>. > if i ask to much, i'm sorry. > thanks a lot > murat > > > On 7/29/06, *Derrick Oswald* <Der...@ro... > <mailto:Der...@ro...>> wrote: > > Murat, > > I'm not sure what you mean by 'pure' text. > The stringextractor program uses the StringBean under the hood. > It only collects text which would be presented in a browser - or at > least it's supposed to. > The stringextractor program has an option (-links) to output the links > within angle brackets. Make sure this is not used. > If you want to remove text within <a></a> pairs you will need to > override the default LinkTag to not do this and register it with the > PrototypicalNodeFactory. > > Derrick > > kavorka wrote: > > > Hi Oswald, > > I have another question. In HTMLPARSER, is it possible to > extract only > > the text in the webpage. In the stringextractor program, it extract > > also link text in the page, i want to extract "pure" text. can i > do it? > > thanks > > Murat > > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to > share your > opinions on IT & business topics through brief surveys -- and earn > cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > <http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV> > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > <mailto:Htm...@li...> > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > >------------------------------------------------------------------------ > >------------------------------------------------------------------------- >Take Surveys. Earn Cash. Influence the Future of IT >Join SourceForge.net's Techsay panel and you'll get the chance to share your >opinions on IT & business topics through brief surveys -- and earn cash >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > >------------------------------------------------------------------------ > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: kavorka <the...@gm...> - 2006-07-29 13:07:11
|
Hi Oswald, Yes i want to remove text within <a></a>. i'll try to do what you have said, but i'm a newbie java coder i didnt understand what you have said clearly. I tried to override linkTAg to not to take text <a></a> now myLinkTag doesnt find links. but now how can i take text other that <a></a>. if i ask to much, i'm sorry. thanks a lot murat On 7/29/06, Derrick Oswald <Der...@ro...> wrote: > > Murat, > > I'm not sure what you mean by 'pure' text. > The stringextractor program uses the StringBean under the hood. > It only collects text which would be presented in a browser - or at > least it's supposed to. > The stringextractor program has an option (-links) to output the links > within angle brackets. Make sure this is not used. > If you want to remove text within <a></a> pairs you will need to > override the default LinkTag to not do this and register it with the > PrototypicalNodeFactory. > > Derrick > > kavorka wrote: > > > Hi Oswald, > > I have another question. In HTMLPARSER, is it possible to extract only > > the text in the webpage. In the stringextractor program, it extract > > also link text in the page, i want to extract "pure" text. can i do it? > > thanks > > Murat > > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Derrick O. <Der...@Ro...> - 2006-07-29 11:26:58
|
Eugeny, Perhaps the web page is broken and has characters that can't be encoded by the encoding specified in the HTTP header or META tag. Or perhaps those are lying and the real encoding is something else. What does it look like in your browser? What encoding is it using to interpret it? Use parser.setEncoding ("XXXXX"); to set the encoding before beginning the parse. Derrick Eugeny N Dzhurinsky wrote: >Hello! >I'm trying to parse this page and extract all links there: >http://www.vu.lt/lt/naujienos/337/ > >for some reason the link to PDF file looks like: >http://www.vu.lt/site_files/InfS/Naujienos/istorik??%20dienos.pdf > >which is wrong. It seems like some wrong charset was used? > >Here is part of my code which does the parsing: > >public LinkedList parseDocument(InputStream document, String encoding) { > try { > Lexer lexer = new Lexer(new Page(document, encoding)); > String href; > try { > lexer.reset(); > if (banner != null) > validateBanner(lexer); > lexer.reset(); > Parser parser = new Parser(lexer); > NodeList list = null; > try { > list = parser > .extractAllNodesThatMatch(new InterestedTagsFilter()); > } catch (EncodingChangeException e) { > log.warn(e); > lexer.reset(); > lexer.getPage().setEncoding(parser.getEncoding()); > list = parser > .extractAllNodesThatMatch(new InterestedTagsFilter()); > } > for (SimpleNodeIterator it = list.elements(); it.hasMoreNodes();) { > TagNode node = (TagNode) it.nextNode(); > href = null; > if (LinkTag.class.equals(node.getClass()) > && validateLink((LinkTag) node)) { > href = ((LinkTag) node).getLink(); > } else if (ImageTag.class.equals(node.getClass()) > || FrameTag.class.equals(node.getClass())) { > href = node.getAttribute("src"); > } else if (TitleTag.class.equals(node.getClass())) { > title = ((TitleTag) node).getTitle(); > } else if (BaseHrefTag.class.equals(node.getClass())) { > try { > baseTag = getBaseURL(new URI(((BaseHrefTag) node) > .getBaseUrl(), false)); > } catch (URIException e2) { > } > } else if (MetaTag.class.equals(node.getClass()) > && "refresh".equalsIgnoreCase(((MetaTag) node) > .getHttpEquiv())) { > String URL = ((MetaTag) node).getMetaContent(); > if (URL != null && URL.length() > 0) { > String arr[] = URL.split("URL="); > if (arr != null && arr.length == 2) > href = arr[1]; > } > } > if (href != null && href.length() > 0) { > if (log.isDebugEnabled()) >-------> log.debug(href); <----------- > results.add(getURL(StringEscapeUtils > .unescapeHtml(getEscapedURL(href.trim())))); > } > } > this.encoding = parser.getEncoding(); > if (log.isDebugEnabled()) > log.debug(this.encoding); > } catch (ParserException e1) { > log.error(e1, e1); > } > } catch (UnsupportedEncodingException e) { > log.error(e, e); > } > return results; >} > >And on marked line application logs >/site_files/InfS/Naujienos/istorik??%20dienos.pdf > >what could be wrong there? > > > |
From: Derrick O. <Der...@Ro...> - 2006-07-29 11:18:57
|
Xue-Feng, There are many examples of collecting the parsed nodes in a nodelist, modify them and print the list. Something like this should work. NodeList list = parser.parse (null); TextNodes text = list.extractAllNodesThatMatch (new NodeClassFilter (TextNode.class)); // modify the text items in the text list System.out.println (list.toHtml ()); Derrick Xue-Feng Yang wrote: >I am trying to modify for the TextNodes in a lexer by >TextNode.setText(String). Then I tried to print the >lexer by > > Page toPage=lexer.getPage(); > String toString=toPage.getText(); > System.out.println(toString); > >The page was unchanged. > >Does any one have idea how to modify a lexer or simply >a html page? > >Thanks, > > > |
From: Derrick O. <Der...@Ro...> - 2006-07-29 11:14:28
|
Murat, I'm not sure what you mean by 'pure' text. The stringextractor program uses the StringBean under the hood. It only collects text which would be presented in a browser - or at least it's supposed to. The stringextractor program has an option (-links) to output the links within angle brackets. Make sure this is not used. If you want to remove text within <a></a> pairs you will need to override the default LinkTag to not do this and register it with the PrototypicalNodeFactory. Derrick kavorka wrote: > Hi Oswald, > I have another question. In HTMLPARSER, is it possible to extract only > the text in the webpage. In the stringextractor program, it extract > also link text in the page, i want to extract "pure" text. can i do it? > thanks > Murat > |
From: Eugeny N D. <bo...@re...> - 2006-07-28 21:30:11
|
Hello, I'm trying to parse page http://www.vu.lt/lt/naujienos/337/ but HtmlParser fails with this error: ERROR org.htmlparser.util.EncodingChangeException: character mismatch (new: ? [0x2013] != old: [0xe2?]) for encoding change from ISO-8859-1 to UTF-8 at character offset 218 [junit] org.htmlparser.util.EncodingChangeException: character mismatch (new: ? [0x2013] != old: [0xe2?]) for encoding change from ISO-8859-1 to UTF-8 at character offset 218 [junit] at org.htmlparser.lexer.InputStreamSource.setEncoding(InputStreamSource.java:280) [junit] at org.htmlparser.lexer.Page.setEncoding(Page.java:865) [junit] at org.htmlparser.tags.MetaTag.doSemanticAction(MetaTag.java:150) [junit] at org.htmlparser.scanners.TagScanner.scan(TagScanner.java:69) [junit] at org.htmlparser.scanners.CompositeTagScanner.scan(CompositeTagScanner.java:160) [junit] at org.htmlparser.util.IteratorImpl.nextNode(IteratorImpl.java:92) [junit] at org.htmlparser.Parser.extractAllNodesThatMatch(Parser.java:768) at this line: Lexer lexer = new Lexer(new Page(document, encoding)); Parser parser = new Parser(lexer); ---->NodeList list = parser.extractAllNodesThatMatch(new InterestedTagsFilter());<---- I don't know the document encoding initially, and thus it's null. Could somebody please advice? -- Eugene N Dzhurinsky |