htmlparser-user Mailing List for HTML Parser (Page 33)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Derrick O. <Der...@Ro...> - 2006-11-06 13:09:00
|
The StringBean is a NodeVisitor, so it can be applied to a NodeList to extract tthe text from a child list. I guess it's up to you to remove stuff you don't want. Dave wrote: > Hi Derrick, > > Thanks for help. > > ... find the H3 node (with Description as the contents), > ... get it's parent > ... and extract all text from the parent's children (after the Heading) > ExtractTextFromChildrenOf (HasSibling (And (TagName(H3), HasChild > (String(Description))))) > > I could not find the method: ExtractTextFromChildrenOf(), which class? > Does the text extracted include "Good Morning" or "Description"? I > want the text after the heading(Description) only. > > Thanks! > Dave > > */Derrick Oswald <Der...@Ro...>/* wrote: > > Dave, > > PRE has not been added as a tag because it very often is not > closed by > the /PRE. You can create your own "PRE" tag class derived from > CompositeTag, and register it with a PrototypicalNodeFactory you > give to > the parser. > > To answer your previous question about filters for: > > Good Morning > > > Description > > >*Text to extract Line1* >*Text to extract Line2* > > > > Good Morning > > > ... find the H3 node (with Description as the contents), > ... get it's parent > ... and extract all text from the parent's children (after the > Heading) > > so it would be something like > ExtractTextFromChildrenOf (HasSibling (And (TagName(H3), HasChild > (String(Description))))) > This is a lot easier to construct with the FilterBuilder application. > > ... or alternatively I had thought of making a 'TriggerFilter' that > would set a member flag when it's subordinate filter went true, and > after that would always return true because the flag was set... > but then > this member would need to be reset or you would need to build the > filter > fresh for each parse. > > Derrick > > Dave wrote: > > > > >> text1 >> > > > > > > > > > text2 > > > > > > > > > >parse http://web-site table > > show the whole table structure > > >parse http://web-site pre > > show the tag "pre" only, no text inside the pre tag. > > > > It seems that pre is not treated as the parent node of "text1". > > > > Is this a bug? > > > > Thanks! > > > > > > > > > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, > security? > Get stuff done quickly with pre-integrated technology to make your > job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache > Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > ------------------------------------------------------------------------ > Check out the New Yahoo! Mail > <http://us.rd.yahoo.com/evt=43257/*http://advision.webevents.yahoo.com/mailbeta>- > Fire up a more powerful email and get things done faster. > >------------------------------------------------------------------------ > >------------------------------------------------------------------------- >Using Tomcat but need to do more? Need to support web services, security? >Get stuff done quickly with pre-integrated technology to make your job easier >Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo >http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > >------------------------------------------------------------------------ > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Ian M. <ian...@gm...> - 2006-11-06 09:07:54
|
One thing you could do is make a PRE tag class, if one does not currently exist. Ideally, you would then submit it to the project ;-) Ian On 11/4/06, Dave <jav...@ya...> wrote: > I am new to htmlparser. > > For example, > > <div>Good Morning</div> > <h3>Description</h3> > <pre> > Text to extract Line1 > Text to extract Line2 > </pre> > <div>Good Morning</div> > > My question: > > How to extract > > Text to extract Line1 > Text to extract Line2 > > after "Description" using filters? > > I tried Sibling and HasChild filters, it does not work. Also I noticed that > <pre> is not treated as a tag. > > Thanks for help! > Dave > > ________________________________ > Access over 1 million songs - Yahoo! Music Unlimited Try it today. > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Dave <jav...@ya...> - 2006-11-06 03:12:53
|
Hi, Parser parser = new Parser(); parser.setResource(http://web-site); ... NodeList nodes = parser.extractAllNodesThatMatch(filter); NodeList nodes1 = parser.extractAllNodesThatMatch(filter); The first call is correct, having the right node list. but the second call with the same filter returned null. I need to use the same parser multiple times without re-parsing the same page. parser.reset() will re-parse the same page. What should I do? Thanks for help. david --------------------------------- Check out the New Yahoo! Mail - Fire up a more powerful email and get things done faster. |
From: Dave <jav...@ya...> - 2006-11-05 09:32:01
|
Hi Derrick, Thanks for help. ... find the H3 node (with Description as the contents), ... get it's parent ... and extract all text from the parent's children (after the Heading) ExtractTextFromChildrenOf (HasSibling (And (TagName(H3), HasChild (String(Description))))) I could not find the method: ExtractTextFromChildrenOf(), which class? Does the text extracted include "Good Morning" or "Description"? I want the text after the heading(Description) only. Thanks! Dave Derrick Oswald <Der...@Ro...> wrote: Dave, PRE has not been added as a tag because it very often is not closed by the /PRE. You can create your own "PRE" tag class derived from CompositeTag, and register it with a PrototypicalNodeFactory you give to the parser. To answer your previous question about filters for: Good Morning Description *Text to extract Line1* *Text to extract Line2* Good Morning ... find the H3 node (with Description as the contents), ... get it's parent ... and extract all text from the parent's children (after the Heading) so it would be something like ExtractTextFromChildrenOf (HasSibling (And (TagName(H3), HasChild (String(Description))))) This is a lot easier to construct with the FilterBuilder application. ... or alternatively I had thought of making a 'TriggerFilter' that would set a member flag when it's subordinate filter went true, and after that would always return true because the flag was set... but then this member would need to be reset or you would need to build the filter fresh for each parse. Derrick Dave wrote: > > text1 > > > > text2 > > > >parse http://web-site table > show the whole table structure > >parse http://web-site pre > show the tag "pre" only, no text inside the pre tag. > > It seems that pre is not treated as the parent node of "text1". > > Is this a bug? > > Thanks! > > > > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user --------------------------------- Check out the New Yahoo! Mail - Fire up a more powerful email and get things done faster. |
From: Derrick O. <Der...@Ro...> - 2006-11-04 21:52:20
|
Dave, PRE has not been added as a tag because it very often is not closed by the /PRE. You can create your own "PRE" tag class derived from CompositeTag, and register it with a PrototypicalNodeFactory you give to the parser. To answer your previous question about filters for: <div>Good Morning</div> <h3>Description</h3> <pre> *Text to extract Line1* *Text to extract Line2* </pre> <div>Good Morning</div> ... find the H3 node (with Description as the contents), ... get it's parent ... and extract all text from the parent's children (after the Heading) so it would be something like ExtractTextFromChildrenOf (HasSibling (And (TagName(H3), HasChild (String(Description))))) This is a lot easier to construct with the FilterBuilder application. ... or alternatively I had thought of making a 'TriggerFilter' that would set a member flag when it's subordinate filter went true, and after that would always return true because the flag was set... but then this member would need to be reset or you would need to build the filter fresh for each parse. Derrick Dave wrote: > <pre> > text1 > </pre> > > <table> > <tr><td>text2</td><tr> > </table> > > >parse http://web-site table > show the whole table structure > >parse http://web-site <http://web-site/> pre > show the tag "pre" only, no text inside the pre tag. > > It seems that pre is not treated as the parent node of "text1". > > Is this a bug? > > Thanks! > > > > |
From: Dave <jav...@ya...> - 2006-11-04 11:52:33
|
<pre> text1 </pre> <table> <tr><td>text2</td><tr> </table> >parse http://web-site table show the whole table structure >parse http://web-site pre show the tag "pre" only, no text inside the pre tag. It seems that pre is not treated as the parent node of "text1". Is this a bug? Thanks! --------------------------------- Check out the New Yahoo! Mail - Fire up a more powerful email and get things done faster. |
From: Dave <jav...@ya...> - 2006-11-04 09:52:24
|
I am new to htmlparser. For example, <div>Good Morning</div> <h3>Description</h3> <pre> Text to extract Line1 Text to extract Line2 </pre> <div>Good Morning</div> My question: How to extract Text to extract Line1 Text to extract Line2 after "Description" using filters? I tried Sibling and HasChild filters, it does not work. Also I noticed that <pre> is not treated as a tag. Thanks for help! Dave --------------------------------- Access over 1 million songs - Yahoo! Music Unlimited Try it today. |
From: Derrick O. <Der...@Ro...> - 2006-10-27 11:32:55
|
Kit, I'm not aware of any limiting values. Are you just running out of memeory or is there a specific exception? Maybe try something like -Xmx256M on the java command line to up the memory for the java heap. Derrick ktm...@be... wrote: >I can't retrieve more than 256k when downloading a page via StringBean. I'm using htmlparser version 1.6. Is this limit configurable? > >Thanks. >Kit > > > > > > |
From: <ktm...@be...> - 2006-10-27 02:54:54
|
I can't retrieve more than 256k when downloading a page via StringBean. I'm using htmlparser version 1.6. Is this limit configurable? Thanks. Kit |
From: Jan H. <jan...@gm...> - 2006-10-17 20:06:09
|
Derrick, thanks for Your advise! I tried setEncoding(), but then I get ParserExceptions. I also tried my code with all kinds of public pages, all using ISO-8859-1, but whenever they have characters specific to ISO-8859-1 (as I mentioned, for example the lower and upper quotation marks) I have problems. I debugged the code in eclipse. This is how I retrieve the link-text: <code> LinkTag linkTag = (LinkTag) linkNode; String linktext = linkTag.getLinkText(); </code> The method "linkTag.getLinkText()()" returns the text with little "boxes" instad of the quotation-marks (I can't put them in this plain-text mail). So it seems like the getLinkText() method does return these characters wrongly encoded? Thanks and regards Jan Hempel Derrick Oswald wrote: > Jan, > > It may be that the site is lying (in the HTTP header or even in the META > tag of the page) and it really is in another encoding - maybe UTF-8 already. > Try setEncoding() on the Parser before asking for nodes or filtering. > > Derrick > > Jan Hempel wrote: > >> Hi guys, >> >> I'm trying to parse a website which is encoded in ISO-8859-1. I need to >> store extracted link-texts in UTF-8 format. >> >> My code looks like this: >> >> <code> >> Parser myParser = new Parser(); >> myParser.setURL(url); >> >> // I created a filter named "myLinkFilter" which filters LinkNodes >> NodeList myLinkNodeList = myParser.parse(myLinkFilter); >> >> Node myLinkNode = myLinkNodeList.elementAt(0); >> >> LinkTag linkTag = (LinkTag) myLinkNode; >> >> String linkText = linkTag.getLinkText(); >> </code> >> >> The problem now is, that certain characters (like the lower quotation >> marks: „Quote“) are converted to question marks. >> >> So I tried a coding like this: >> >> <code> >> >> String isoString = linkTag.getLinkText(); >> String utf8String = null; >> >> try >> { >> byte[] stringBytesISO = isoString.getBytes("ISO-8859-1"); >> utf8String = new String(stringBytesISO, "UTF-8"); >> } >> catch (UnsupportedEncodingException e) >> { >> // do something... >> } >> </code> >> >> But this still returns question marks in the utf8String. >> Any ideas what I need to change? >> >> Thanks and regards >> Jan Hempel >> >> >> >> >> >> ------------------------------------------------------------------------- >> Using Tomcat but need to do more? Need to support web services, security? >> Get stuff done quickly with pre-integrated technology to make your job easier >> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo >> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 >> _______________________________________________ >> Htmlparser-user mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> >> > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Derrick O. <Der...@Ro...> - 2006-10-16 12:13:46
|
Jan, It may be that the site is lying (in the HTTP header or even in the META tag of the page) and it really is in another encoding - maybe UTF-8 already. Try setEncoding() on the Parser before asking for nodes or filtering. Derrick Jan Hempel wrote: >Hi guys, > >I'm trying to parse a website which is encoded in ISO-8859-1. I need to >store extracted link-texts in UTF-8 format. > >My code looks like this: > ><code> >Parser myParser = new Parser(); >myParser.setURL(url); > >// I created a filter named "myLinkFilter" which filters LinkNodes >NodeList myLinkNodeList = myParser.parse(myLinkFilter); > >Node myLinkNode = myLinkNodeList.elementAt(0); > >LinkTag linkTag = (LinkTag) myLinkNode; > >String linkText = linkTag.getLinkText(); ></code> > >The problem now is, that certain characters (like the lower quotation >marks: „Quote“) are converted to question marks. > >So I tried a coding like this: > ><code> > >String isoString = linkTag.getLinkText(); >String utf8String = null; > >try >{ > byte[] stringBytesISO = isoString.getBytes("ISO-8859-1"); > utf8String = new String(stringBytesISO, "UTF-8"); >} >catch (UnsupportedEncodingException e) >{ > // do something... >} ></code> > >But this still returns question marks in the utf8String. >Any ideas what I need to change? > >Thanks and regards >Jan Hempel > > > > > >------------------------------------------------------------------------- >Using Tomcat but need to do more? Need to support web services, security? >Get stuff done quickly with pre-integrated technology to make your job easier >Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo >http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Jan H. <jan...@gm...> - 2006-10-15 20:47:02
|
Hi guys, I'm trying to parse a website which is encoded in ISO-8859-1. I need to store extracted link-texts in UTF-8 format. My code looks like this: <code> Parser myParser = new Parser(); myParser.setURL(url); // I created a filter named "myLinkFilter" which filters LinkNodes NodeList myLinkNodeList = myParser.parse(myLinkFilter); Node myLinkNode = myLinkNodeList.elementAt(0); LinkTag linkTag = (LinkTag) myLinkNode; String linkText = linkTag.getLinkText(); </code> The problem now is, that certain characters (like the lower quotation marks: „Quote“) are converted to question marks. So I tried a coding like this: <code> String isoString = linkTag.getLinkText(); String utf8String = null; try { byte[] stringBytesISO = isoString.getBytes("ISO-8859-1"); utf8String = new String(stringBytesISO, "UTF-8"); } catch (UnsupportedEncodingException e) { // do something... } </code> But this still returns question marks in the utf8String. Any ideas what I need to change? Thanks and regards Jan Hempel |
From: Derrick O. <Der...@Ro...> - 2006-10-10 11:58:17
|
The Parser class has a setEncoding() method that can be used. Unfortunately the parser used by the SiteCapturer is not exposed publicly, so you will need to add accessors and rebuild the library or subclass from SiteCapturer and override process() to set the encoding before proceeding. Derrick Ke Deng wrote: > Hi, > I use htmlparser v1.6. After I use SiteCapturer to download a site, > I found the charset of page is changed: if the charset of page is not > ISO-8859-1 but multiple bytes charset, the page captured by htmlparser > contains many confused code. > How to resolve this problem? Is there a way to set correct charset > before capture? > Regards, > Karl. > > > |
From: Derrick O. <Der...@Ro...> - 2006-10-10 11:52:02
|
It looks like a bug in ParserUtils.getLinks() method, so you should file this as a bug. You may be able to fix it by inserting a guard around the cast to CompositeTag: CompositeTag jStartTag = (CompositeTag)links.elementAt(j); and rebuilding the library. Something like: if (links.elementAt(j) instanceof CompositeTag) Derrick Kevin R. Gutch wrote: >Hello, > >I am trying to do the following: > >htmlRow = pu.trimTags(htmlRow, new String[] { "a" ,"img"}, false, false); > >However, I am receving this error. > >Caused by: java.lang.ClassCastException: org.htmlparser.tags.ImageTag > at org.htmlparser.util.ParserUtils.getLinks(ParserUtils.java:1167) > at org.htmlparser.util.ParserUtils.trimTags(ParserUtils.java:984) > at com.protech.util.Table2Excel.exportHtmlTableAsExcel(Table2Excel.java:192) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) > >The error only occurs on the "img" string. If I remove it everything works fine. > > >How can I solve this issue. > >Thanks in advance. > > > > |
From: Jeffrey B. <jb...@cs...> - 2006-10-09 15:38:27
|
> I would like to parse several pages of a website using the same session along. My understanding is that this is the default behavior. Just be sure to use the same HTMLParser object (that is, don't reinitialize) and you'll keep the same cookies and other state information. -Jeff > How is it possible with HTMLParser ? > > Thank you. > > Best regards, > Guillaume > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Ke D. <la...@ya...> - 2006-10-09 10:13:21
|
Hi, I use htmlparser v1.6. After I use SiteCapturer to download a site, I found the charset of page is changed: if the charset of page is not ISO-8859-1 but multiple bytes charset, the page captured by htmlparser contains many confused code. How to resolve this problem? Is there a way to set correct charset before capture? Regards, Karl. --------------------------------- Stay in the know. Pulse on the new Yahoo.com. Check it out. |
From: <gui...@fr...> - 2006-10-09 06:58:46
|
Hello, I would like to parse several pages of a website using the same session a= long. How is it possible with HTMLParser ? Thank you. Best regards, Guillaume |
From: Kevin R. G. <kg...@pr...> - 2006-10-05 13:44:28
|
Hello, I am trying to do the following: htmlRow =3D pu.trimTags(htmlRow, new String[] { "a" ,"img"}, false, = false); However, I am receving this error.=20 Caused by: java.lang.ClassCastException: org.htmlparser.tags.ImageTag at org.htmlparser.util.ParserUtils.getLinks(ParserUtils.java:1167) at org.htmlparser.util.ParserUtils.trimTags(ParserUtils.java:984) at = com.protech.util.Table2Excel.exportHtmlTableAsExcel(Table2Excel.java:192)= at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) The error only occurs on the "img" string. If I remove it everything = works fine. How can I solve this issue.=20 Thanks in advance. |
From: <ric...@gm...> - 2006-09-15 11:51:27
|
I will try that... thanks Derreck! Jos=E9 R. Zim On 9/14/06, Derrick Oswald <Der...@ro...> wrote: > > > <area> tags are non-standard. > You would need to create your own "AREA" tag class derived from > CompositeTag, and register it with a PrototypicalNodeFactory you give to > the parser. > > > Jos=E9 Ricardo Zim Nunez wrote: > > > hi! i'd like to extract from an html the addresses of the links which > > are included in the area tags. i'm already parsing the document and > > it's going ok. just the <area href=3D...> are ignored. how can i do tha= t?? > > > > cheers! > > > > > > > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronim= o > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D120709&bid=3D263057&dat= =3D121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Derrick O. <Der...@Ro...> - 2006-09-15 02:26:14
|
<area> tags are non-standard. You would need to create your own "AREA" tag class derived from CompositeTag, and register it with a PrototypicalNodeFactory you give to the parser. José Ricardo Zim Nunez wrote: > hi! i'd like to extract from an html the addresses of the links which > are included in the area tags. i'm already parsing the document and > it's going ok. just the <area href=...> are ignored. how can i do that?? > > cheers! > > > |
From: <ric...@gm...> - 2006-09-14 13:55:49
|
hi! i'd like to extract from an html the addresses of the links which are included in the area tags. i'm already parsing the document and it's going ok. just the <area href=...> are ignored. how can i do that?? cheers! |
From: Jeffrey B. <jb...@cs...> - 2006-09-11 22:28:08
|
Thanks Derrick, Your suggestion worked perfectly! -Jeff On 9/11/06, Derrick Oswald <Der...@ro...> wrote: > > I believe you need to use setBaseUrl on the Page object. > parser.getLexer ().getPage ().setBaseUrl ("http://www.bar.com"); > > Jeffrey Bigham wrote: > > >On 9/11/06, Garry Huang <ga...@gm...> wrote: > > > > > >>Did you try my_parser.setURL("http://www.bar.com/"); ? > >> > >> > > > >Yeah, I tried that. > > > >If it's inserted before I call extractAllNodesThatMatch(img_filter); > >then http://www.bar.com is downloaded. If it's called after then the > >relative links aren't fixed. > > > >It's possible that there's something subtle with the ordering that I > >could change, but I couldn't get it to work and it seems like it would > >be a hack... > > > >Thanks for the suggestion though. > > > >-Jeff > > > > > > > >>Just a thought. > >> > >>Cheers, > >>Garry > >> > >>On Sep 12, 2006, at 12:58 AM, jpdogg wrote: > >> > >> > >> > >>>Hello, > >>> > >>>I've cached some HTML pages in local files and would like to tell the > >>>Parser object what the original URLs were so that it can correctly > >>>interpret relative links. > >>> > >>>As a simple example, say I do this: > >>> > >>>Parser my_parser = new Parser("<html><img src='foo.jpg'></html>"); > >>> > >>>If I construct a filter to give me all of the ImageTags in this simple > >>>document, I get one. Unfortunately, it has the URL foo.jpg. If I > >>>know that this file was originally located at > >>>http://www.bar.com/foo.html, how do I inform the parser module? I > >>>want it to be able to report that the above image is located at > >>>http://www.bar.com/foo.jpg. > >>> > >>>Thanks! > >>>Jeff > >>> > >>>---------------------------------------------------------------------- > >>>--- > >>>Using Tomcat but need to do more? Need to support web services, > >>>security? > >>>Get stuff done quickly with pre-integrated technology to make your > >>>job easier > >>>Download IBM WebSphere Application Server v.1.0.1 based on Apache > >>>Geronimo > >>>http://sel.as-us.falkag.net/sel? > >>>cmd=lnk&kid=120709&bid=263057&dat=121642 > >>>_______________________________________________ > >>>Htmlparser-user mailing list > >>>Htm...@li... > >>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user > >>> > >>> > >>------------------------------------------------------------------------- > >>Using Tomcat but need to do more? Need to support web services, security? > >>Get stuff done quickly with pre-integrated technology to make your job easier > >>Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > >>http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > >>_______________________________________________ > >>Htmlparser-user mailing list > >>Htm...@li... > >>https://lists.sourceforge.net/lists/listinfo/htmlparser-user > >> > >> > >> > > > >------------------------------------------------------------------------- > >Using Tomcat but need to do more? Need to support web services, security? > >Get stuff done quickly with pre-integrated technology to make your job easier > >Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > >http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Derrick O. <Der...@Ro...> - 2006-09-11 22:18:46
|
I believe you need to use setBaseUrl on the Page object. parser.getLexer ().getPage ().setBaseUrl ("http://www.bar.com"); Jeffrey Bigham wrote: >On 9/11/06, Garry Huang <ga...@gm...> wrote: > > >>Did you try my_parser.setURL("http://www.bar.com/"); ? >> >> > >Yeah, I tried that. > >If it's inserted before I call extractAllNodesThatMatch(img_filter); >then http://www.bar.com is downloaded. If it's called after then the >relative links aren't fixed. > >It's possible that there's something subtle with the ordering that I >could change, but I couldn't get it to work and it seems like it would >be a hack... > >Thanks for the suggestion though. > >-Jeff > > > >>Just a thought. >> >>Cheers, >>Garry >> >>On Sep 12, 2006, at 12:58 AM, jpdogg wrote: >> >> >> >>>Hello, >>> >>>I've cached some HTML pages in local files and would like to tell the >>>Parser object what the original URLs were so that it can correctly >>>interpret relative links. >>> >>>As a simple example, say I do this: >>> >>>Parser my_parser = new Parser("<html><img src='foo.jpg'></html>"); >>> >>>If I construct a filter to give me all of the ImageTags in this simple >>>document, I get one. Unfortunately, it has the URL foo.jpg. If I >>>know that this file was originally located at >>>http://www.bar.com/foo.html, how do I inform the parser module? I >>>want it to be able to report that the above image is located at >>>http://www.bar.com/foo.jpg. >>> >>>Thanks! >>>Jeff >>> >>>---------------------------------------------------------------------- >>>--- >>>Using Tomcat but need to do more? Need to support web services, >>>security? >>>Get stuff done quickly with pre-integrated technology to make your >>>job easier >>>Download IBM WebSphere Application Server v.1.0.1 based on Apache >>>Geronimo >>>http://sel.as-us.falkag.net/sel? >>>cmd=lnk&kid=120709&bid=263057&dat=121642 >>>_______________________________________________ >>>Htmlparser-user mailing list >>>Htm...@li... >>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>> >>> >>------------------------------------------------------------------------- >>Using Tomcat but need to do more? Need to support web services, security? >>Get stuff done quickly with pre-integrated technology to make your job easier >>Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo >>http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 >>_______________________________________________ >>Htmlparser-user mailing list >>Htm...@li... >>https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> >> > >------------------------------------------------------------------------- >Using Tomcat but need to do more? Need to support web services, security? >Get stuff done quickly with pre-integrated technology to make your job easier >Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo >http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Jeffrey B. <jb...@cs...> - 2006-09-11 18:21:05
|
On 9/11/06, Garry Huang <ga...@gm...> wrote: > Did you try my_parser.setURL("http://www.bar.com/"); ? Yeah, I tried that. If it's inserted before I call extractAllNodesThatMatch(img_filter); then http://www.bar.com is downloaded. If it's called after then the relative links aren't fixed. It's possible that there's something subtle with the ordering that I could change, but I couldn't get it to work and it seems like it would be a hack... Thanks for the suggestion though. -Jeff > Just a thought. > > Cheers, > Garry > > On Sep 12, 2006, at 12:58 AM, jpdogg wrote: > > > Hello, > > > > I've cached some HTML pages in local files and would like to tell the > > Parser object what the original URLs were so that it can correctly > > interpret relative links. > > > > As a simple example, say I do this: > > > > Parser my_parser = new Parser("<html><img src='foo.jpg'></html>"); > > > > If I construct a filter to give me all of the ImageTags in this simple > > document, I get one. Unfortunately, it has the URL foo.jpg. If I > > know that this file was originally located at > > http://www.bar.com/foo.html, how do I inform the parser module? I > > want it to be able to report that the above image is located at > > http://www.bar.com/foo.jpg. > > > > Thanks! > > Jeff > > > > ---------------------------------------------------------------------- > > --- > > Using Tomcat but need to do more? Need to support web services, > > security? > > Get stuff done quickly with pre-integrated technology to make your > > job easier > > Download IBM WebSphere Application Server v.1.0.1 based on Apache > > Geronimo > > http://sel.as-us.falkag.net/sel? > > cmd=lnk&kid=120709&bid=263057&dat=121642 > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Garry H. <ga...@gm...> - 2006-09-11 17:07:28
|
Did you try my_parser.setURL("http://www.bar.com/"); ? Just a thought. Cheers, Garry On Sep 12, 2006, at 12:58 AM, jpdogg wrote: > Hello, > > I've cached some HTML pages in local files and would like to tell the > Parser object what the original URLs were so that it can correctly > interpret relative links. > > As a simple example, say I do this: > > Parser my_parser = new Parser("<html><img src='foo.jpg'></html>"); > > If I construct a filter to give me all of the ImageTags in this simple > document, I get one. Unfortunately, it has the URL foo.jpg. If I > know that this file was originally located at > http://www.bar.com/foo.html, how do I inform the parser module? I > want it to be able to report that the above image is located at > http://www.bar.com/foo.jpg. > > Thanks! > Jeff > > ---------------------------------------------------------------------- > --- > Using Tomcat but need to do more? Need to support web services, > security? > Get stuff done quickly with pre-integrated technology to make your > job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache > Geronimo > http://sel.as-us.falkag.net/sel? > cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |