htmlparser-user Mailing List for HTML Parser (Page 79)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: mohammad a. <re...@em...> - 2003-04-16 10:29:21
|
I have difficulities with webpages that uses "windows-1252", "windows-1256" and "ISO-8859-1", since they are not recognized by HTMLparser and it usee ISO instead, which ruins the text. is there any way to convert all these to true Unicode, like UTF-8 encoding? i have tried several methods from java.nio.charset* and also "String a = new String(s.getBytes("ISO-8859-1"), "UTF-8")", without any luck. it seems that the text gets corrupt directly under InputStream. please help! tanks, rezamotori -- _______________________________________________ Sign-up for your own FREE Personalized E-mail at Mail.com http://www.mail.com/?sr=signup |
From: Somik R. <so...@ya...> - 2003-04-14 00:06:47
|
Hi Folks, This week's release contains : Integration Build 1.3 - 20030413 -------------------------------- [1] reimplement StringBean as NodeVisitor, testStringBeanListener now = succeeds [2] Implemented feature request 702541 (Tags created by = CompositeTagScanner now have startLine and endLine information in their TagData) [3] Modified ScriptScanner to allow for subclassing and fixed minor bug [4] Re-architected CompositeTagScanner [5] Fixed Tag scanning bugs, OOM exceptions connected to #4 Thanks to Derrick Oswald for his work on the StringBean and Marc = Novakowski for his work on adding line number support. In this release, the CompositeTagScanner has been totally redesigned - = using principles of Evolutionary Design. The new code is dramatically = simpler and easier to understand. I tried ED on assertXmlEquals last = week and had the same results. The Out of Memory bugs have been fixed - = it will be good to have some freedback. For those who might be interested in ED, before taking each step, I have = taken a snapshot and put it in CVS. Many thanks to Josh Kerievsky for = teaching me how to do ED. Cheers, Somik |
From: Somik R. <so...@ya...> - 2003-04-12 00:15:05
|
> I read the documenation as well as saw the sample > program on how to make visitors. But i am still > unclear as to what does a visitor exactly do. In my > case i need to visit the string node before and > after > the link node if there are any. So how do i write a > visitor for that ? A visitor basically visits all the nodes. When a certain node is visited, the appropriate visit method is automatically called. If you've used the apache parsers (xerces) - it follows the same model - it is often known as the SAX model of parsing. In patterns terminology - we call it an Internal Iterator. (Check the book - design patterns by Erich Gamma). To solve your problem, you will have to work hard and create an algorithm that will use the visitor and maintain flags, etc.. (Do not get confused, this is not about logic in the parser, no flags need to be set, etc.. I am talking about logic you have to develop on your own). To speed up your learning, at least try the sample programs that use a visitor. Get hold of a nice debugger - and step through the sample program right from the beginning, to get a clear understanding (or read the code which many people find simpler). Regards, Somik __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - File online, calculators, forms, and more http://tax.yahoo.com |
From: vihang d. <vih...@ya...> - 2003-04-11 23:44:44
|
HI Somik, I read the documenation as well as saw the sample program on how to make visitors. But i am still unclear as to what does a visitor exactly do. In my case i need to visit the string node before and after the link node if there are any. So how do i write a visitor for that ? Also i need to retrieve entire sting a well as all the links in a page. So eveytime do i need to define 2 parsers to do it...I tried using 1 parser but it doesnt work. HOwever using 2 parsers for the same page worked. Vihang --- Somik Raha <so...@ya...> wrote: > My suggestion is: dont register any scanners except > for the link scanner. Create your own algo that > saves > prev nodes and next string nodes the way you want. > (You can get 10 words out of each from the string > nodes) > > You might find this easy to do in the visitors - > check > the documentation on how to write visitors - > http://htmlparser.sourceforge.net/docs/ > > Regards, > Somik > > --- vihang dalal <vih...@ya...> wrote: > > > > Hi everyone, > > I am using the HTML parser to build a focussed > web > > crawler. I need to extract the text that is around > > links to do the focussed crawling...for eg i need > to > > know what are the 10 words before and after the > > given > > link.. Can somebody tell me how do i go about it > ?? > > > > Thanx > > > > Vihang > > > > > > __________________________________________________ > > Do you Yahoo!? > > Yahoo! Tax Center - File online, calculators, > forms, > > and more > > http://tax.yahoo.com > > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: Etnus, makers > of > > TotalView, The debugger > > for complex code. Debugging C/C++ programs can > leave > > you feeling lost and > > disoriented. TotalView can help you find your way. > > Available on major UNIX > > and Linux platforms. Try it free. www.etnus.com > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Tax Center - File online, calculators, forms, > and more > http://tax.yahoo.com > > > ------------------------------------------------------- > This SF.net email is sponsored by: Etnus, makers of > TotalView, The debugger > for complex code. Debugging C/C++ programs can leave > you feeling lost and > disoriented. TotalView can help you find your way. > Available on major UNIX > and Linux platforms. Try it free. www.etnus.com > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - File online, calculators, forms, and more http://tax.yahoo.com |
From: Somik R. <so...@ya...> - 2003-04-11 15:35:15
|
Hi Linz, > Basically I am trying to convert or re-write html > into 'slightly' formatted > plain text. (Wiki formatting in fact, something > like "tag conversion") Thats interesting. I recently put out a WikiGrabber - which does something similar - the idea was to use a Wiki for creating documentation, and convert that to static html. Check http://sourceforge.net/projects/wikigrabber There are no releases yet, but the source is in cvs. The parser project itself uses WikiGrabber to create the doc bundle. > So I want to change "<br>" in "\n" and "<p>" into > "\n\n" > If I find "<p>blah blah</p>" I just want to ignore > the "</p>". <p> will > sometimes have an end tag and sometimes not. > If using the Visitor method is it suitable for doing > a re-write of a > document? Or is Visitor just for doing extractions > of all the Links for > example? A visitor is very useful for this sort of work. Check the WikiGrabber code- it does similar things - modifying certain tags in certain ways, while keeping the rest of the html intact. Basically, override the common visitXXX() methods and have them call toHtml and add this to a stringbuffer. Your requirements can be easily met in visitTag() - check if its br or p, and do the needful. There is visitEndTag - where u can ignore (dont add to your buffer) if it is </p> HTH. Regards, Somik __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - File online, calculators, forms, and more http://tax.yahoo.com |
From: Lindsay S. <lin...@ho...> - 2003-04-11 08:24:36
|
Hi Somik, (code below) Basically I am trying to convert or re-write html into 'slightly' formatted plain text. (Wiki formatting in fact, something like "tag conversion") So I want to change "<br>" in "\n" and "<p>" into "\n\n" If I find "<p>blah blah</p>" I just want to ignore the "</p>". <p> will sometimes have an end tag and sometimes not. If using the Visitor method is it suitable for doing a re-write of a document? Or is Visitor just for doing extractions of all the Links for example? Thank you for any pointers. Cheers Linz /** * Convert html into wiki formatting. */ public String parseHtmlIntoWiki (String html) throws ParserException { NodeReader reader = new NodeReader (new BufferedReader (new StringReader (html.toString ())), html.length ()); Parser parser = new Parser (reader); parser.addScanner(new BRScanner("")); parser.addScanner(new PScanner("")); StringBuffer results = new StringBuffer(); for (NodeIterator i = parser.elements();i.hasMoreNodes();) { Node node = i.nextNode(); if (node instanceof BRTag) { results.append("\n"); //System.out.println("BR"); } else if (node instanceof PTag) { results.append("\n\n"); //System.out.println("P"); } else { results.append(node.toHTML()); System.out.println(node.toHTML()); } } return results.toString (); } > > >>From: "Somik Raha" <so...@ya...> >>Reply-To: htm...@li... >>To: <htm...@li...> >>Subject: Re: [Htmlparser-user] How to handle <p> tag? >>Date: Thu, 10 Apr 2003 20:43:52 -0700 >> >>Hi Linz >> > How would I make a <p> scanner? Should I extend TagScanner or >> > CompositeTagScanner? >> > >> > Sometimes <p> is a composite style tag and sometimes it is a single >>tag. >> >>Can you describe the problem that you're trying to solve ? It will be >>easier >>to advise you if we have the whole picture. >>Writing a scanner should not be the first option (when parsing html) - try >>and see if a visitor does not solve the same problem easily. >> >>Making <p> a CompositeTagScanner is not a good idea at this time. The >>CompositeTagScanner took a quantum leap (2 weeks back) and it didn't quite >>make it - there are some important bugs that need tackling, with regard to >>its handling of non-ended tags. >> >>Regards >>Somik >> >> >> >> >> >> >>------------------------------------------------------- >>This SF.net email is sponsored by: Etnus, makers of TotalView, The >>debugger >>for complex code. Debugging C/C++ programs can leave you feeling lost and >>disoriented. TotalView can help you find your way. Available on major UNIX >>and Linux platforms. Try it free. www.etnus.com >>_______________________________________________ >>Htmlparser-user mailing list >>Htm...@li... >>https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > >_________________________________________________________________ >Overloaded with spam? With MSN 8, you can filter it out >http://join.msn.com/?page=features/junkmail&pgmarket=en-gb&XAPID=32&DI=1059 > > > >------------------------------------------------------- >This SF.net email is sponsored by: Etnus, makers of TotalView, The debugger >for complex code. Debugging C/C++ programs can leave you feeling lost and >disoriented. TotalView can help you find your way. Available on major UNIX >and Linux platforms. Try it free. www.etnus.com >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user _________________________________________________________________ Use MSN Messenger to send music and pics to your friends http://www.msn.co.uk/messenger |
From: Lindsay S. <lin...@ho...> - 2003-04-11 07:15:40
|
Hi Somik, Basically I am trying to convert or re-write html into 'slightly' formatted plain text. (Wiki formatting in fact, something like "tag conversion") So I want to change "<br>" in "\n" and "<p>" into "\n\n" If I find "<p>blah blah</p>" I just want to ignore the "</p>". <p> will sometimes have an end tag and sometimes not. If using the Visitor method is it suitable for doing a re-write of a document? Or is Visitor just for doing extractions of all the Links for example? Thank you for any pointers. Cheers Linz >From: "Somik Raha" <so...@ya...> >Reply-To: htm...@li... >To: <htm...@li...> >Subject: Re: [Htmlparser-user] How to handle <p> tag? >Date: Thu, 10 Apr 2003 20:43:52 -0700 > >Hi Linz > > How would I make a <p> scanner? Should I extend TagScanner or > > CompositeTagScanner? > > > > Sometimes <p> is a composite style tag and sometimes it is a single tag. > >Can you describe the problem that you're trying to solve ? It will be >easier >to advise you if we have the whole picture. >Writing a scanner should not be the first option (when parsing html) - try >and see if a visitor does not solve the same problem easily. > >Making <p> a CompositeTagScanner is not a good idea at this time. The >CompositeTagScanner took a quantum leap (2 weeks back) and it didn't quite >make it - there are some important bugs that need tackling, with regard to >its handling of non-ended tags. > >Regards >Somik > > > > > > >------------------------------------------------------- >This SF.net email is sponsored by: Etnus, makers of TotalView, The debugger >for complex code. Debugging C/C++ programs can leave you feeling lost and >disoriented. TotalView can help you find your way. Available on major UNIX >and Linux platforms. Try it free. www.etnus.com >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user _________________________________________________________________ Overloaded with spam? With MSN 8, you can filter it out http://join.msn.com/?page=features/junkmail&pgmarket=en-gb&XAPID=32&DI=1059 |
From: Somik R. <so...@ya...> - 2003-04-11 03:42:37
|
Hi Linz > How would I make a <p> scanner? Should I extend TagScanner or > CompositeTagScanner? > > Sometimes <p> is a composite style tag and sometimes it is a single tag. Can you describe the problem that you're trying to solve ? It will be easier to advise you if we have the whole picture. Writing a scanner should not be the first option (when parsing html) - try and see if a visitor does not solve the same problem easily. Making <p> a CompositeTagScanner is not a good idea at this time. The CompositeTagScanner took a quantum leap (2 weeks back) and it didn't quite make it - there are some important bugs that need tackling, with regard to its handling of non-ended tags. Regards Somik |
From: Lindsay S. <lin...@ho...> - 2003-04-10 16:25:11
|
Hi, How would I make a <p> scanner? Should I extend TagScanner or CompositeTagScanner? Sometimes <p> is a composite style tag and sometimes it is a single tag. Any suggestions? Cheers Linz _________________________________________________________________ Stay in touch with absent friends - get MSN Messenger http://www.msn.co.uk/messenger |
From: Somik R. <so...@ya...> - 2003-04-09 18:49:45
|
My suggestion is: dont register any scanners except for the link scanner. Create your own algo that saves prev nodes and next string nodes the way you want. (You can get 10 words out of each from the string nodes) You might find this easy to do in the visitors - check the documentation on how to write visitors - http://htmlparser.sourceforge.net/docs/ Regards, Somik --- vihang dalal <vih...@ya...> wrote: > > Hi everyone, > I am using the HTML parser to build a focussed web > crawler. I need to extract the text that is around > links to do the focussed crawling...for eg i need to > know what are the 10 words before and after the > given > link.. Can somebody tell me how do i go about it ?? > > Thanx > > Vihang > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Tax Center - File online, calculators, forms, > and more > http://tax.yahoo.com > > > ------------------------------------------------------- > This SF.net email is sponsored by: Etnus, makers of > TotalView, The debugger > for complex code. Debugging C/C++ programs can leave > you feeling lost and > disoriented. TotalView can help you find your way. > Available on major UNIX > and Linux platforms. Try it free. www.etnus.com > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - File online, calculators, forms, and more http://tax.yahoo.com |
From: vihang d. <vih...@ya...> - 2003-04-09 17:00:11
|
Hi everyone, I am using the HTML parser to build a focussed web crawler. I need to extract the text that is around links to do the focussed crawling...for eg i need to know what are the 10 words before and after the given link.. Can somebody tell me how do i go about it ?? Thanx Vihang __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - File online, calculators, forms, and more http://tax.yahoo.com |
From: Philippe W. <pwe...@rt...> - 2003-04-08 15:47:41
|
I did it : bug# 717342 Thanks. > Sounds like a bug, pls file a bug report. > Regards, > Somik > --- Philippe WEYTENS <pwe...@rt...> wrote: >> Well ..., this code used release 1.3-20030405 !! >> >> > You may want to try the latest htmlparser release >> 1.3-20030405. Somik fixed an OOM bug that might be >> related to this. >> >> > Marc >> >> >> >> > ------------------------------------------------------- >> This SF.net email is sponsored by: ValueWeb: >> Dedicated Hosting for just $79/mo with 500 GB of >> bandwidth! >> No other company gives more support or power for >> your dedicated server >> > http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/ >> _______________________________________________ >> Htmlparser-user mailing list >> Htm...@li... >> > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > __________________________________________________ > Do you Yahoo!? > Yahoo! Tax Center - File online, calculators, forms, and more > http://tax.yahoo.com > ------------------------------------------------------- > This SF.net email is sponsored by: ValueWeb: > Dedicated Hosting for just $79/mo with 500 GB of bandwidth! > No other company gives more support or power for your dedicated server > http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user -- =============================== Philippe WEYTENS <pwe...@rt...> Technicien en informatique Programmeur R T Radio Television Belge Francophone B Direction Informatique - local 4P16 F 52 Bld. Reyers, B-1044 Bruxelles Tel: +32-2-737 3232 Fax: +32-2-737 3224 Web: http://www.rtbf.be =============================== |
From: Somik R. <so...@ya...> - 2003-04-08 15:31:16
|
Sounds like a bug, pls file a bug report. Regards, Somik --- Philippe WEYTENS <pwe...@rt...> wrote: > Well ..., this code used release 1.3-20030405 !! > > > You may want to try the latest htmlparser release > 1.3-20030405. Somik fixed an OOM bug that might be > related to this. > > > Marc > > > > ------------------------------------------------------- > This SF.net email is sponsored by: ValueWeb: > Dedicated Hosting for just $79/mo with 500 GB of > bandwidth! > No other company gives more support or power for > your dedicated server > http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - File online, calculators, forms, and more http://tax.yahoo.com |
From: Philippe W. <pwe...@rt...> - 2003-04-08 06:38:30
|
Well ..., this code used release 1.3-20030405 !! > You may want to try the latest htmlparser release 1.3-20030405. Somik fixed an OOM bug that might be related to this. > Marc |
From: Marc N. <ma...@ke...> - 2003-04-07 16:28:41
|
You may want to try the latest htmlparser release 1.3-20030405. Somik = fixed an OOM bug that might be related to this. Marc -----Original Message----- From: pw...@rt... [mailto:pw...@rt...] Sent: Monday, April 07, 2003 7:16 AM To: htm...@li... Subject: [Htmlparser-user] Nested tables crashes the VM! Dear Sir, it seems that the package does not parse nested tables. As you try the = following code you'll get a java.lang.OutOfMemoryError ! What should I do? Thank you. The code : import org.htmlparser.*; import org.htmlparser.visitors.*; import org.htmlparser.util.*; import org.htmlparser.scanners.*; import org.htmlparser.tags.*; public class StringExtractor { public static void main(String[] args) throws Exception { Parser parser =3D new Parser(args[0]); parser.addScanner(new TableScanner(parser)); NodeIterator it =3D parser.elements(); while(it.hasMoreNodes()) { System.out.println(it.nextNode()); } } } The html file tested : <HTML> <table border> <tr> <td>Head1</td> <td>Val1</td> </tr> <tr> <td>Head2</td> <td>Val2</td> </tr> <tr> <td> <table border> <tr> <td>table2 Head1</td> <td>table2 Val1</td> </tr> </table> </td> </tr> </BODY> </HTML> ------------------------------------------------------- This SF.net email is sponsored by: ValueWeb:=20 Dedicated Hosting for just $79/mo with 500 GB of bandwidth!=20 No other company gives more support or power for your dedicated server http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/ _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: <pw...@rt...> - 2003-04-07 14:16:51
|
Dear Sir, it seems that the package does not parse nested tables. As you try the following code you'll get a java.lang.OutOfMemoryError ! What should I do? Thank you. The code : import org.htmlparser.*; import org.htmlparser.visitors.*; import org.htmlparser.util.*; import org.htmlparser.scanners.*; import org.htmlparser.tags.*; public class StringExtractor { public static void main(String[] args) throws Exception { Parser parser = new Parser(args[0]); parser.addScanner(new TableScanner(parser)); NodeIterator it = parser.elements(); while(it.hasMoreNodes()) { System.out.println(it.nextNode()); } } } The html file tested : <HTML> <table border> <tr> <td>Head1</td> <td>Val1</td> </tr> <tr> <td>Head2</td> <td>Val2</td> </tr> <tr> <td> <table border> <tr> <td>table2 Head1</td> <td>table2 Val1</td> </tr> </table> </td> </tr> </BODY> </HTML> |
From: Philippe W. <pwe...@rt...> - 2003-04-07 14:06:35
|
Dear Sir, it seems that the package does not parse nested tables. As you try the following code you'll get a java.lang.OutOfMemoryError ! What should I do? Thank you. The code : import org.htmlparser.*; import org.htmlparser.visitors.*; import org.htmlparser.util.*; import org.htmlparser.scanners.*; import org.htmlparser.tags.*; public class StringExtractor { public static void main(String[] args) throws Exception { Parser parser = new Parser(args[0]); parser.addScanner(new TableScanner(parser)); NodeIterator it = parser.elements(); while(it.hasMoreNodes()) { System.out.println(it.nextNode()); } } } The html file tested : <HTML> <table border> <tr> <td>Head1</td> <td>Val1</td> </tr> <tr> <td>Head2</td> <td>Val2</td> </tr> <tr> <td> <table border> <tr> <td>table2 Head1</td> <td>table2 Val1</td> </tr> </table> </td> </tr> </BODY> </HTML> |
From: Somik R. <so...@ya...> - 2003-04-05 20:02:22
|
Hi Folks, This week's integration release is out. From the change log: Integration Build 1.3 - 20030405 -------------------------------- [1] Fixed bug 712888 (scanning nested custom tags) [2] Redesigned assertXmlEquals() [3] Fixed bug in Parser.removeScanner() [4] Fixed unnecessary addition of ACTION attribute in Form tag [5] Fixed Bullet scanner out of memory exception [6] Replaced scanner HashTable with Map Regards, Somik |
From: Sean_YZU90 <s9...@ma...> - 2003-04-04 08:32:35
|
Ohh..oh.. I found it! -_- =20 import org.htmlparser.visitors.TagFindingVisitor; =20 Dear Sir: I want to extract h1 and h2 tags and their texts. I follow the sample program as below, but my jdk cannot resolve symbol: class TagFindingVisitor What should I do? or What package should I import? =20 Thank you very much, Sean =20 =20 Parser parser =3D new Parser(..); String [] tagsToBeFound =3D {"P","BR","MYTAG"}; TagFindingVisitor visitor =3D new TagFindingVisitor(tagsToBeFound); parser.visitAllNodesWith(visitor); // First tag specified in search Node [] allPTags =3D visitor.getTags(0); // Second tag specified in search Node [] allBRTags =3D visitor.getTags(1); // Third tag specified in search Node [] allMyTags =3D visitor.getTags(2); |
From: <s9...@ma...> - 2003-04-04 08:20:11
|
Dear Sir: I want to extract h1 and h2 tags and their texts. I follow the sample program as below, but my jdk cannot resolve symbol: class TagFindingVisitor What should I do? or What package should I import? =20 Thank you very much, Sean =20 =20 Parser parser =3D new Parser(..); String [] tagsToBeFound =3D {"P","BR","MYTAG"}; TagFindingVisitor visitor =3D new TagFindingVisitor(tagsToBeFound); parser.visitAllNodesWith(visitor); // First tag specified in search Node [] allPTags =3D visitor.getTags(0); // Second tag specified in search Node [] allBRTags =3D visitor.getTags(1); // Third tag specified in search Node [] allMyTags =3D visitor.getTags(2); |
From: Somik R. <so...@ya...> - 2003-04-03 03:26:00
|
Did you try http://htmlparser.sourceforge.net/docs/ Let us know if the sample programs couldnt get you started. Regards, Somik --- vihang dalal <vih...@ya...> wrote: > HI I am completely new to HTML parsing. Can anyone > suggest to me a good site where i can read get > proper > documentation about it. I m starting from scratch. > I need a site that can exlain the meaning of all the > terms as well ...like say NODE etc.. > thanx > Vihang > > __________________________________________________ > Do you Yahoo!? > Yahoo! Tax Center - File online, calculators, forms, > and more > http://tax.yahoo.com > > > ------------------------------------------------------- > This SF.net email is sponsored by: ValueWeb: > Dedicated Hosting for just $79/mo with 500 GB of > bandwidth! > No other company gives more support or power for > your dedicated server > http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - File online, calculators, forms, and more http://tax.yahoo.com |
From: vihang d. <vih...@ya...> - 2003-04-03 03:22:28
|
HI I am completely new to HTML parsing. Can anyone suggest to me a good site where i can read get proper documentation about it. I m starting from scratch. I need a site that can exlain the meaning of all the terms as well ...like say NODE etc.. thanx Vihang __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - File online, calculators, forms, and more http://tax.yahoo.com |
From: Somik R. <so...@ya...> - 2003-04-02 23:01:10
|
Hi Navid, others, Have you looked at http://htmlparser.sourceforge.net/docs/index.php/UsingCookiesWithParser ? Regards, Somik --- Joseph Robins <jmr...@tg...> wrote: > Navid H.Langaroudi wrote: > > Thank you Rich, > > I hope it works. > > > > Navid > > > > --- Rich Williams <ri...@wi...> wrote: > > > >>What is needed is cookiejar functionality.. > > > I've been using jCookie > (http://jcookie.sourceforge.net/) with much > success. It wigs out on malformed cookies, instead > of trying to handle > them gracefully, but in most cases, it works very > well. > > > _____________________________________________________________ > Joe Robins Tel: 212-918-5057 > Thaumaturgix, Inc. Fax: 212-918-5001 > 19 W. 44th St., 18th Floor Email: jmr...@tg... > New York, NY 10036 http://www.tgix.com > > thau'ma-tur-gy, n. the working of miracles. > > > > ------------------------------------------------------- > This SF.net email is sponsored by: ValueWeb: > Dedicated Hosting for just $79/mo with 500 GB of > bandwidth! > No other company gives more support or power for > your dedicated server > http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - File online, calculators, forms, and more http://tax.yahoo.com |
From: Joseph R. <jmr...@tg...> - 2003-04-02 22:55:32
|
Navid H.Langaroudi wrote: > Thank you Rich, > I hope it works. > > Navid > > --- Rich Williams <ri...@wi...> wrote: > >>What is needed is cookiejar functionality.. I've been using jCookie (http://jcookie.sourceforge.net/) with much success. It wigs out on malformed cookies, instead of trying to handle them gracefully, but in most cases, it works very well. _____________________________________________________________ Joe Robins Tel: 212-918-5057 Thaumaturgix, Inc. Fax: 212-918-5001 19 W. 44th St., 18th Floor Email: jmr...@tg... New York, NY 10036 http://www.tgix.com thau'ma-tur-gy, n. the working of miracles. |
From: Navid H.L. <na...@ya...> - 2003-04-02 22:44:40
|
Thank you Rich, I hope it works. Navid --- Rich Williams <ri...@wi...> wrote: > What is needed is cookiejar functionality.. > Something that will give and > accept cookies when making requests. There is no > other way around it. Many > sites use cookies to deter spidering.. > > rw > > > ----- Original Message ----- > From: "Navid H.Langaroudi" <na...@ya...> > To: <htm...@li...> > Sent: Wednesday, April 02, 2003 3:45 PM > Subject: Re: [Htmlparser-user] Integration Release > 1.3-20030330 is out > > > > Hi Somik and everybody else, > > Things are really going fast and interesting here. > It > > is a great job. I hope once my program is > completed, I > > can share it with others. > > > > Well, I faced a new problem yesterday. It may not > be > > very much related to HTMLParser, but I appreciate > it > > if any one could give me a hint. > > > > My program uses HTMLparser classes to access sites > and > > extract all urls, and then in another run, using > those > > urls, it extract data from pages of those urls. > > > > There is this site which uses MicorsoftCommerc > Server > > 2000, and attaches the cookie to url, if request > is > > not from a Browser: > > some thing like this. > > > > > http://www.shoemall.com/product.asp?family%5Fid=2543&type=0&cat%5Fid= > > > 0&MSCSProfile=61E4CECF7275066FD87B9817DA5865CBE5EA506A04C53D8558451EC3D02BB5 > 7732 > > > 7CA398F52348946BD1631D503EA92FF120A8E45A336FAD8E7E4E31B1356470B79DDD041A4F98 > A5B4 > > 03FC86D8A52985761A9F6CEA80 > > > > And once I try to access the same page with same > url, > > every time I get a differnt page!!! > > > > Can anybody tell me why this is so? and How can I > > change my java program to avoid it, or recieve the > > correct page. > > > > I am also using > > connectionnew.setRequestProperty > > ("User-Agent","Mozilla/3.0(Windows NT 4.0; U) > Opera > > 6.0 [en]"); > > > > but still this does help! > > > > Thank you > > Navid > > > > __________________________________________________ > > Do you Yahoo!? > > Yahoo! Tax Center - File online, calculators, > forms, and more > > http://tax.yahoo.com > > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: ValueWeb: > > Dedicated Hosting for just $79/mo with 500 GB of > bandwidth! > > No other company gives more support or power for > your dedicated server > > > http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/ > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > ------------------------------------------------------- > This SF.net email is sponsored by: ValueWeb: > Dedicated Hosting for just $79/mo with 500 GB of > bandwidth! > No other company gives more support or power for > your dedicated server > http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - File online, calculators, forms, and more http://tax.yahoo.com |