htmlparser-user Mailing List for HTML Parser (Page 23)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Derrick O. <der...@ro...> - 2007-11-23 17:33:23
|
You should be able to use the Page.setBaseUrl (string base) method to set the URL used as a prefix for relative links, i.e. parser.getLexer ().getPage ().setBaseUrl ("http://yadda.yadda"); ----- Original Message ---- From: Jurgen Voorneveld <j.e...@st...> To: htm...@li... Sent: Friday, November 23, 2007 11:13:33 AM Subject: [Htmlparser-user] Link Location resolving List, I've recently started using htmlparser as part of a webspidering tool that I have written and I've run into a small problem. My spider downloads files from webservers using HttpClient from the Apache Commons project. These files are then stored locally in a temporary location. If a file contains HTML it is then parsed by htmlparser. During parsing the parser resolves relative links to other files by adding the location of the file to the relative link. Which of course completely screws up the links. Is there any way to turn this feature off or some way of telling the parser that the location of the data is not where it gets the data from. thanks Jurgen Voorneveld ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Jurgen V. <j.e...@st...> - 2007-11-23 16:13:00
|
List, I've recently started using htmlparser as part of a webspidering tool that I have written and I've run into a small problem. My spider downloads files from webservers using HttpClient from the Apache Commons project. These files are then stored locally in a temporary location. If a file contains HTML it is then parsed by htmlparser. During parsing the parser resolves relative links to other files by adding the location of the file to the relative link. Which of course completely screws up the links. Is there any way to turn this feature off or some way of telling the parser that the location of the data is not where it gets the data from. thanks Jurgen Voorneveld |
From: Derrick O. <der...@ro...> - 2007-11-21 23:21:54
|
The test (nl.extractAllNodesThatMatch(objFilter,false).size() > 0) appears to be redundant if you are going to recurse through all nodes anyway. If you get rid of that clause and it's else and instead recurse on the else of the (subnode instanceof ObjectTag) test, you should get all nodes. ----- Original Message ---- From: Randy Paries <rtp...@gm...> To: htm...@li... Sent: Wednesday, November 21, 2007 1:32:13 PM Subject: [Htmlparser-user] getting close. trying to replace one tag with another Hello i am trying to replace a <object> tag with an <img> tag reading an html file and sending the html to an editor. It was working great unless the object tag was inside of a <div> or nested inside some other tag. So on the mailing list i found a recursive function example that i have modified. It is working kindof. As i am going thru the nodes if it is not an <object> tag and am just adding that node.toHtml() to the StringBuffer Object that i am passing to the editor. if it is an object and a flash object i need to replace the object tag with an image tag. I am finding the object tag and replacing it ok, but as i traverse the document, i am not adding everything to the StringBuffer Object. Like tables in tables are not being added, and probably other tags within tags. if anyone can see something obvious that i am doing something wrong. thanks StringBuffer retStrBuf(); public void parse (Parser parser, NodeIterator i) throws ParserException { Node node; Node subnode; if( i == null ) { i = parser.elements(); } while( i.hasMoreNodes() ){ node = i.nextNode(); NodeList nl=null; if(node != null) { nl = node.getChildren(); if ( nl != null ){ if (nl.extractAllNodesThatMatch(objFilter,false).size() > 0) { NodeIterator x = nl.elements(); while ( x.hasMoreNodes() ){ subnode = x.nextNode(); if ( subnode instanceof ObjectTag){ ObjectTag ObjTag = (ObjectTag) subnode; if ( ObjTag.getAttribute("classid") != null && ObjTag.getAttribute("classid").equals("clsid:d27cdb6e-ae6d-11cf-96b8-444553540000") ){ //remove of verbose stuff ImageTag it = new ImageTag(); //remove of somemore verbose stuff //System.out.println("--->"+it.toHtml()); retStrBuf.append(it.toHtml()); } }else{ //System.out.println("--->"+subnode.toHtml()); retStrBuf.append(subnode.toHtml()); } } }else{ parse(parser,nl.elements()); } }else{ //System.out.println(node.toHtml()); retStrBuf.append(node.toHtml()); } }//end of null node }//end of while }//end of function ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Randy P. <rtp...@gm...> - 2007-11-21 18:32:14
|
Hello i am trying to replace a <object> tag with an <img> tag reading an html file and sending the html to an editor. It was working great unless the object tag was inside of a <div> or nested inside some other tag. So on the mailing list i found a recursive function example that i have modified. It is working kindof. As i am going thru the nodes if it is not an <object> tag and am just adding that node.toHtml() to the StringBuffer Object that i am passing to the editor. if it is an object and a flash object i need to replace the object tag with an image tag. I am finding the object tag and replacing it ok, but as i traverse the document, i am not adding everything to the StringBuffer Object. Like tables in tables are not being added, and probably other tags within tags. if anyone can see something obvious that i am doing something wrong. thanks StringBuffer retStrBuf(); public void parse (Parser parser, NodeIterator i) throws ParserException { Node node; Node subnode; if( i == null ) { i = parser.elements(); } while( i.hasMoreNodes() ){ node = i.nextNode(); NodeList nl=null; if(node != null) { nl = node.getChildren(); if ( nl != null ){ if (nl.extractAllNodesThatMatch(objFilter,false).size() > 0) { NodeIterator x = nl.elements(); while ( x.hasMoreNodes() ){ subnode = x.nextNode(); if ( subnode instanceof ObjectTag){ ObjectTag ObjTag = (ObjectTag) subnode; if ( ObjTag.getAttribute("classid") != null && ObjTag.getAttribute("classid").equals("clsid:d27cdb6e-ae6d-11cf-96b8-444553540000") ){ //remove of verbose stuff ImageTag it = new ImageTag(); //remove of somemore verbose stuff //System.out.println("--->"+it.toHtml()); retStrBuf.append(it.toHtml()); } }else{ //System.out.println("--->"+subnode.toHtml()); retStrBuf.append(subnode.toHtml()); } } }else{ parse(parser,nl.elements()); } }else{ //System.out.println(node.toHtml()); retStrBuf.append(node.toHtml()); } }//end of null node }//end of while }//end of function |
From: Derrick O. <der...@ro...> - 2007-11-17 01:27:46
|
I don't know of an easy way. Just create a new NodeList with what you want (from the old children list and any new nodes) and set it as the new child list with setChildren (NodeList children). ----- Original Message ---- From: Randy Paries <rtp...@gm...> To: htm...@li... Sent: Friday, November 16, 2007 6:58:20 PM Subject: [Htmlparser-user] is there a way to replace a childNode with different content hello, I am parsing an html doc. what i need to do is when i find a particular tag (<object>.....</object>) i need to replace it with an Image Tag i can see how to remove a child children.remove(c) and there is a prepend, but i want to replace in the same place in the tree Any idea? thanks ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Randy P. <rtp...@gm...> - 2007-11-16 23:58:24
|
hello, I am parsing an html doc. what i need to do is when i find a particular tag (<object>.....</object>) i need to replace it with an Image Tag i can see how to remove a child children.remove(c) and there is a prepend, but i want to replace in the same place in the tree Any idea? thanks |
From: Randy P. <rtp...@gm...> - 2007-11-16 22:40:41
|
Thanks that was it On Nov 16, 2007 4:18 PM, Derrick Oswald <der...@ro...> wrote: > > I think it's case sensitive - upper case that is. > Try: > private static final String[] mIds = new String[] {"PARAM"}; > and: > String [] tagsToBeFound = {"PARAM"}; > > > > ----- Original Message ---- > From: Randy Paries <rtp...@gm...> > To: htm...@li... > Sent: Friday, November 16, 2007 4:21:17 PM > Subject: [Htmlparser-user] trying to get the params from a flash object > > from an html page, i am trying to get the params name and value pairs > the html snippet below:: > > So i can find the ObjectTag. > > when i do a > NodeList flashchildren = ObjTag.getChildren(); > all the flashchildren are of a type of Tag. > > So i though i would make my own tag. > > public class FlashParams extends CompositeTag{ > private static final String[] mIds = new String[] {"param"}; > //bunch of other stuff deleted for this email > } > > then i wrote this litte test function; > > Parser parser = new Parser(); > parser.setInputHTML(ObjTag.toHtml()); > PrototypicalNodeFactory factory = new PrototypicalNodeFactory (); > factory.registerTag (new FlashParams()); > parser.setNodeFactory (factory); > String [] tagsToBeFound = {"param"}; > TagFindingVisitor visitor = new TagFindingVisitor (tagsToBeFound); > parser.visitAllNodesWith (visitor); > Node[] fpp = visitor.getTags(0); > > but all the fpp nodes are type "Tag" > > Any suggestions what i may be missing or doing wrong? > > thanks > > ==================================================================================================== > <object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" > codebase="http://fpdownload.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=7,0,0,0" > id="SimpleGallery_AlbumSelect" name="SimpleGallery_AlbumSelect" > width="435" height="400" align="top"> > <param name="play" value="false" /> > <param name="loop" value="false" /> > <param name="menu" value="false" /> > <param name="quality" value="high" /> > <param name="wmode" value="transparent" /> > <param name="flashvars" > value="HTMLDirectory=kam&node=http://192.168.10.50:8080&path=/k/kam/" > > /> > <embed type="application/x-shockwave-flash" play="false" loop="false" > menu="false" quality="high" wmode="transparent" > flashvars="HTMLDirectory=kam&node=http://192.168.10.50:8080&path=/k/kam/" > > id="SimpleGallery_AlbumSelect" name="SimpleGallery_AlbumSelect" > src="/flashgadgets/SimpleGallery.swf" align="top" bgcolor="#333333" > width="435" height="400"></embed></object> > ==================================================================================================== > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Derrick O. <der...@ro...> - 2007-11-16 22:18:36
|
I think it's case sensitive - upper case that is. Try: private static final String[] mIds = new String[] {"PARAM"}; and: String [] tagsToBeFound = {"PARAM"}; ----- Original Message ---- From: Randy Paries <rtp...@gm...> To: htm...@li... Sent: Friday, November 16, 2007 4:21:17 PM Subject: [Htmlparser-user] trying to get the params from a flash object from an html page, i am trying to get the params name and value pairs the html snippet below:: So i can find the ObjectTag. when i do a NodeList flashchildren = ObjTag.getChildren(); all the flashchildren are of a type of Tag. So i though i would make my own tag. public class FlashParams extends CompositeTag{ private static final String[] mIds = new String[] {"param"}; //bunch of other stuff deleted for this email } then i wrote this litte test function; Parser parser = new Parser(); parser.setInputHTML(ObjTag.toHtml()); PrototypicalNodeFactory factory = new PrototypicalNodeFactory (); factory.registerTag (new FlashParams()); parser.setNodeFactory (factory); String [] tagsToBeFound = {"param"}; TagFindingVisitor visitor = new TagFindingVisitor (tagsToBeFound); parser.visitAllNodesWith (visitor); Node[] fpp = visitor.getTags(0); but all the fpp nodes are type "Tag" Any suggestions what i may be missing or doing wrong? thanks ==================================================================================================== <object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" codebase="http://fpdownload.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=7,0,0,0" id="SimpleGallery_AlbumSelect" name="SimpleGallery_AlbumSelect" width="435" height="400" align="top"> <param name="play" value="false" /> <param name="loop" value="false" /> <param name="menu" value="false" /> <param name="quality" value="high" /> <param name="wmode" value="transparent" /> <param name="flashvars" value="HTMLDirectory=kam&node=http://192.168.10.50:8080&path=/k/kam/" /> <embed type="application/x-shockwave-flash" play="false" loop="false" menu="false" quality="high" wmode="transparent" flashvars="HTMLDirectory=kam&node=http://192.168.10.50:8080&path=/k/kam/" id="SimpleGallery_AlbumSelect" name="SimpleGallery_AlbumSelect" src="/flashgadgets/SimpleGallery.swf" align="top" bgcolor="#333333" width="435" height="400"></embed></object> ==================================================================================================== ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Randy P. <rtp...@gm...> - 2007-11-16 21:21:21
|
from an html page, i am trying to get the params name and value pairs the html snippet below:: So i can find the ObjectTag. when i do a NodeList flashchildren = ObjTag.getChildren(); all the flashchildren are of a type of Tag. So i though i would make my own tag. public class FlashParams extends CompositeTag{ private static final String[] mIds = new String[] {"param"}; //bunch of other stuff deleted for this email } then i wrote this litte test function; Parser parser = new Parser(); parser.setInputHTML(ObjTag.toHtml()); PrototypicalNodeFactory factory = new PrototypicalNodeFactory (); factory.registerTag (new FlashParams()); parser.setNodeFactory (factory); String [] tagsToBeFound = {"param"}; TagFindingVisitor visitor = new TagFindingVisitor (tagsToBeFound); parser.visitAllNodesWith (visitor); Node[] fpp = visitor.getTags(0); but all the fpp nodes are type "Tag" Any suggestions what i may be missing or doing wrong? thanks ==================================================================================================== <object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" codebase="http://fpdownload.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=7,0,0,0" id="SimpleGallery_AlbumSelect" name="SimpleGallery_AlbumSelect" width="435" height="400" align="top"> <param name="play" value="false" /> <param name="loop" value="false" /> <param name="menu" value="false" /> <param name="quality" value="high" /> <param name="wmode" value="transparent" /> <param name="flashvars" value="HTMLDirectory=kam&node=http://192.168.10.50:8080&path=/k/kam/" /> <embed type="application/x-shockwave-flash" play="false" loop="false" menu="false" quality="high" wmode="transparent" flashvars="HTMLDirectory=kam&node=http://192.168.10.50:8080&path=/k/kam/" id="SimpleGallery_AlbumSelect" name="SimpleGallery_AlbumSelect" src="/flashgadgets/SimpleGallery.swf" align="top" bgcolor="#333333" width="435" height="400"></embed></object> ==================================================================================================== |
From: Derrick O. <der...@ro...> - 2007-11-16 12:15:46
|
Method 1 uses the ConnectionManager class which does some conditioning of the connection besides proxies - which I assume you aren't using. The code looks like this: HttpURLConnection http; if (getRedirectionProcessingEnabled ()) http.setInstanceFollowRedirects (false); // set the fixed request properties properties = getRequestProperties (); if (null != properties) for (enumeration = properties.keys (); enumeration.hasMoreElements ();) { key = (String)enumeration.nextElement (); value = (String)properties.get (key); http.setRequestProperty (key, value); } // set the proxy name and password if ((null != getProxyUser ()) && (null != getProxyPassword ())) { auth = getProxyUser () + ":" + getProxyPassword (); encoded = encode (auth.getBytes("ISO-8859-1")); http.setRequestProperty ("Proxy-Authorization", "Basic " + encoded); } // set the URL name and password if ((null != getUser ()) && (null != getPassword ())) { auth = getUser () + ":" + getPassword (); encoded = encode (auth.getBytes("ISO-8859-1")); http.setRequestProperty ("Authorization", "Basic " + encoded); } if (getCookieProcessingEnabled ()) // set the cookies based on the url addCookies (http); Of these, it's probably the request properties that are supplied by default that change the returned page (unless you're doing something else different yourself). The default request properties are only two: "User-Agent", "HTMLParser/2.0" "Accept-Encoding", "gzip, deflate" You can add these to your own URLConnection and see if that changes the returned page. ----- Original Message ---- From: Marcel <ta...@gm...> To: htmlparser user list <htm...@li...> Sent: Friday, November 16, 2007 12:33:44 AM Subject: [Htmlparser-user] help on parser's constructor Hi, I used htmlparser to parse certain web pages. I found some weird thing about parser's two constructors. Say, I have a urlString ----------- method 1 ------------ Parser parser = new Parser(urlString); -------- method 2 ------------ URL url = new URL(urlString); Parser parser = new Parser(url.openConnection()); These two methods got different page contents for the same urlString. Anybody knows the reason? What is the difference between those two constructors? Thanks -marcel |
From: Marcel <ta...@gm...> - 2007-11-16 05:33:49
|
Hi, I used htmlparser to parse certain web pages. I found some weird thing about parser's two constructors. Say, I have a urlString ----------- method 1 ------------ Parser parser = new Parser(urlString); -------- method 2 ------------ URL url = new URL(urlString); Parser parser = new Parser(url.openConnection()); These two methods got different page contents for the same urlString. Anybody knows the reason? What is the difference between those two constructors? Thanks -marcel |
From: James M. <jam...@a-...> - 2007-11-16 00:27:19
|
I found the solution to my problem! lNodes = lDocumentNodeList.extractAllNodesThatMatch(new TagNameFilter ("BODY"),true); The BODY tag was buried underneath another element, and by default the boolean recursive flag is set to false, meaning that nested elements will not be returned. After setting this flag -- the second parameter -- to true, the problem was resolved and I was able to retrieve my content. Hope this helps someone! -- James Mortensen A-CTI Development Team |
From: James M. <jam...@a-...> - 2007-11-16 00:16:35
|
For more clarification, here is what I tried: Parser lParser = new Parser(); try { lParser.setInputHTML(pHTML); //as instructed in the JavaDocs } catch(ParserException e) { mLogger.info("getContent():: Caught ParsingException..."); } NodeList lDocumentNodeList; NodeList lNodes; try { lDocumentNodeList = lParser.parse (null); //I want to start with the entire document lNodes = lDocumentNodeList.extractAllNodesThatMatch (new TagNameFilter ("BODY")); //I want the BODY tag mLogger.info("lNodes.size() = " + lNodes.size()); //Using Log4J, I see that the size returned is 0 when it should be 1. if(lNodes.size() > 0) { //none of this code executes because size = 0 String lText = lNodes.toString(); //I'm not sure if I'm doing this right or not, but until the NodeList problem is resolved I can't troubleshoot it String lasString = lNodes.asString(); mLogger.info("lTExt = " + lText); mLogger.info("lasString = " + lasString); } } catch (ParserException e) { mLogger.info("ResponseParser:: Parsing exception caught."); } Thanks again for your help. -- James Mortensen A-CTI Development Team |
From: James M. <jam...@a-...> - 2007-11-16 00:08:14
|
Hello, I'm trying to pull the body content from an HTML String using your parsing utilities. The problem I'm having is not how to GET the HTML. I have the HTML stored in a String. I am using Web Services, and the content that I need is provided to me via third-party code as a String object. Therefore, I need your parser to take HTML as a String object, parse it for the body tag, and return the innerHTML of the body tag as a String. Below is the content that I retrieve in a String object: <html><head></head> <body>Hello World</body> </html> String myHTML = myWebServices.getHTMLContent(); //this returns the above HTML in a String object .... ... .. //this is the missing piece, which is how to load the HTML into the parser and return the innerHTML of the BODY tag. ... .... String bodyContent = //This is the "Hello World" text that I'm looking for so that I can use it without the HTML. The FAQ does not appear to address this question. Thanks in advance for your help in clearing up these issues. James Mortensen -- James Mortensen A-CTI Development Team |
From: Derrick O. <der...@ro...> - 2007-11-13 11:58:12
|
You probably want the StringBean. The main() method of StringBean is an example of its use. ----- Original Message ---- From: cash cash <ca...@ya...> To: htm...@li... Sent: Tuesday, November 13, 2007 1:07:33 AM Subject: [Htmlparser-user] Help on extracting clean body content from web page Hi all, I am new to htmlparser. have download it and tried a few examples. However, i am having problem knowing the" correct way" to achieve my goal. I'm looking for a way to extract body content from web page, exclude all script sections. For example, using the following text <html> <head><title>title</title> <style> css style </style> </head> <body> Hello world <?php phpinfo() ?> </body> The correct code should only extract Hello world. Can any one help me on this? Thanks in advance. ____________________________________________________________________________________ Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try it now. http://mobile.yahoo.com/sports;_ylt=At9_qDKvtAbMuh1G1SQtBI7ntAcJ ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: cash c. <ca...@ya...> - 2007-11-13 06:07:44
|
Hi all, I am new to htmlparser. have download it and tried a few examples. However, i am having problem knowing the" correct way" to achieve my goal. I'm looking for a way to extract body content from web page, exclude all script sections. For example, using the following text <html> <head><title>title</title> <style> css style </style> </head> <body> Hello world <?php phpinfo() ?> </body> The correct code should only extract Hello world. Can any one help me on this? Thanks in advance. ____________________________________________________________________________________ Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try it now. http://mobile.yahoo.com/sports;_ylt=At9_qDKvtAbMuh1G1SQtBI7ntAcJ |
From: Derrick O. <der...@ro...> - 2007-11-12 22:18:28
|
Hmmm, works for me.=0ABy default it puts the file in a .htmlparser director= y in your home directory (e.g. C:\Documents and Settings\Derrick\.htmlparse= r).=0ADid you look there?=0A=0A----- Original Message ----=0AFrom: Ali <tos= ha...@ya...>=0ATo: htmlparser user list <htm...@li...urcefor= ge.net>=0ASent: Monday, November 12, 2007 12:47:46 PM=0ASubject: Re: [Htmlp= arser-user] Selection using HTML parser=0A=0AThank for the help, but when I= create a filter using=0Afilterbuilder it doesn=92t generates the java cod= e for=0Ame with simple save option. Guide me how can I now=0Agenerate the j= ava code!.=0A=0A--- Derrick Oswald <der...@ro...> wrote:=0A=0A>= There is a filter mechanism built into the HTML=0A> Parser.=0A> Look into = the FilterBuilder application, which will=0A> generate code to select a spe= cific node, based on=0A> your input via a GUI.=0A> =0A> ----- Original Mess= age ----=0A> From: Ali <to...@ya...>=0A> To: htmlparser user list=0A>= <htm...@li...>=0A> Sent: Monday, November 12, 200= 7 12:49:52 AM=0A> Subject: [Htmlparser-user] Selection using HTML=0A> parse= r=0A> =0A> Is there any way by which I can directly select a=0A> tag=0A> us= ing some sort of addressing eg =93/html/body/h1=94 to=0A> select H1 one ta= g inside HTML file, using HTML=0A> parser.=0A> Thanks=0A> =0A> =0A> _______= ___________________________________________=0A> Do You Yahoo!?=0A> Tired of= spam? Yahoo! Mail has the best spam=0A> protection around =0A> http://mai= l.yahoo.com =0A> =0A>=0A---------------------------------------------------= ----------------------=0A> This SF.net email is sponsored by: Splunk Inc.= =0A> Still grepping through log files to find problems? =0A> Stop.=0A> Now = Search log events and configuration files using=0A> AJAX and a browser.=0A>= Download your FREE copy of Splunk now >>=0A> http://get.splunk.com/=0A> __= _____________________________________________=0A> Htmlparser-user mailing l= ist=0A> Htm...@li...=0A>=0Ahttps://lists.sourcefor= ge.net/lists/listinfo/htmlparser-user=0A> =0A> =0A> =0A> >=0A--------------= -----------------------------------------------------------=0A> This SF.net= email is sponsored by: Splunk Inc.=0A> Still grepping through log files to= find problems? =0A> Stop.=0A> Now Search log events and configuration file= s using=0A> AJAX and a browser.=0A> Download your FREE copy of Splunk now >= >=0Ahttp://get.splunk.com/>=0A_____________________________________________= __=0A> Htmlparser-user mailing list=0A> Htm...@li...urceforge.n= et=0A>=0Ahttps://lists.sourceforge.net/lists/listinfo/htmlparser-user=0A> = =0A=0A=0A__________________________________________________=0ADo You Yahoo!= ?=0ATired of spam? Yahoo! Mail has the best spam protection around =0Ahttp= ://mail.yahoo.com =0A=0A---------------------------------------------------= ----------------------=0AThis SF.net email is sponsored by: Splunk Inc.=0AS= till grepping through log files to find problems? Stop.=0ANow Search log e= vents and configuration files using AJAX and a browser.=0ADownload your FRE= E copy of Splunk now >> http://get.splunk.com/=0A__________________________= _____________________=0AHtmlparser-user mailing list=0AHtmlparser-user@list= s.sourceforge.net=0Ahttps://lists.sourceforge.net/lists/listinfo/htmlparser= -user=0A=0A=0A=0A |
From: Ali <to...@ya...> - 2007-11-12 17:47:53
|
Thank for the help, but when I create a filter using filterbuilder it doesnt generates the java code for me with simple save option. Guide me how can I now generate the java code!. --- Derrick Oswald <der...@ro...> wrote: > There is a filter mechanism built into the HTML > Parser. > Look into the FilterBuilder application, which will > generate code to select a specific node, based on > your input via a GUI. > > ----- Original Message ---- > From: Ali <to...@ya...> > To: htmlparser user list > <htm...@li...> > Sent: Monday, November 12, 2007 12:49:52 AM > Subject: [Htmlparser-user] Selection using HTML > parser > > Is there any way by which I can directly select a > tag > using some sort of addressing eg /html/body/h1 to > select H1 one tag inside HTML file, using HTML > parser. > Thanks > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam > protection around > http://mail.yahoo.com > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? > Stop. > Now Search log events and configuration files using > AJAX and a browser. > Download your FREE copy of Splunk now >> > http://get.splunk.com/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? > Stop. > Now Search log events and configuration files using > AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/> _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
From: Derrick O. <der...@ro...> - 2007-11-12 15:58:18
|
There is a filter mechanism built into the HTML Parser.=0ALook into the Fil= terBuilder application, which will generate code to select a specific node,= based on your input via a GUI.=0A=0A----- Original Message ----=0AFrom: Al= i <to...@ya...>=0ATo: htmlparser user list <htm...@li...u= rceforge.net>=0ASent: Monday, November 12, 2007 12:49:52 AM=0ASubject: [Htm= lparser-user] Selection using HTML parser=0A=0AIs there any way by which I = can directly select a tag=0Ausing some sort of addressing eg =93/html/body/= h1=94 to=0Aselect H1 one tag inside HTML file, using HTML=0Aparser.=0AThan= ks=0A=0A=0A__________________________________________________=0ADo You Yaho= o!?=0ATired of spam? Yahoo! Mail has the best spam protection around =0Aht= tp://mail.yahoo.com =0A=0A-------------------------------------------------= ------------------------=0AThis SF.net email is sponsored by: Splunk Inc.= =0AStill grepping through log files to find problems? Stop.=0ANow Search l= og events and configuration files using AJAX and a browser.=0ADownload your= FREE copy of Splunk now >> http://get.splunk.com/=0A______________________= _________________________=0AHtmlparser-user mailing list=0AHtmlparser-user@= lists.sourceforge.net=0Ahttps://lists.sourceforge.net/lists/listinfo/htmlpa= rser-user=0A=0A=0A=0A |
From: Derrick O. <der...@ro...> - 2007-11-12 15:56:14
|
It doesn't seem like the htmlparser.jar is being found. You might need to add a path: java WhoIs yahoo.com -classpath "C:\path_to_jar\htmlparser.jar" ----- Original Message ---- From: Ali <to...@ya...> To: htmlparser user list <htm...@li...> Sent: Monday, November 12, 2007 12:43:56 AM Subject: Re: [Htmlparser-user] ERROR I am unable to execute the code of faq.htm I am getting fowling errors C:\project>java WhoIs yahoo.com -classpath htmlparser.jar Exception in thread "main" java.lang.NoClassDefFoundError: org/htmlparser/beans/ StringBean at WhoIs.<init>(WhoIs.java:55) at WhoIs.main(WhoIs.java:80) C:\project>java WhoIs yahoo.com -classpath "htmlparser.jar" Exception in thread "main" java.lang.NoClassDefFoundError: org/htmlparser/beans/ StringBean at WhoIs.<init>(WhoIs.java:55) at WhoIs.main(WhoIs.java:80) Kindly if some can tell me what is the problem --- Derrick Oswald <der...@ro...> wrote: > The FAQ has an example of how to use POST: > http://htmlparser.sourceforge.net/faq.html#post > > ----- Original Message ---- > From: Ali <to...@ya...> > To: htm...@li... > Sent: Sunday, November 11, 2007 9:13:57 AM > Subject: [Htmlparser-user] Need help > > Hi everyone! > I am new user to HTML parser, I have problem I want > to > POST a form request and then parse the resultant > HTML, > but so far I have failed to do so. > Ali > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam > protection around > http://mail.yahoo.com > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? > Stop. > Now Search log events and configuration files using > AJAX and a browser. > Download your FREE copy of Splunk now >> > http://get.splunk.com/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? > Stop. > Now Search log events and configuration files using > AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/> _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Derrick O. <der...@ro...> - 2007-11-12 15:54:34
|
As the example sys: The following sample program illustrates the principles using a StringBean, but the same code could be used with a Parser by replacing the last three lines in the try block with: parser = new Parser (); parser.setConnection (connection); // ... do parser operations Parser operations could be using a filter, visiting all nodes, examining each tag in the returned NodeList, etc. Your question is too nebulous to answer in more detail. ----- Original Message ---- From: Ali <to...@ya...> To: htmlparser user list <htm...@li...> Sent: Monday, November 12, 2007 12:04:33 AM Subject: Re: [Htmlparser-user] Need help Hi everyone! Thanks but, there is nothing about parsing in that example, kindly if some can extend that example for me by including HTML parsing in it. Ali --- Derrick Oswald <der...@ro...> wrote: > The FAQ has an example of how to use POST: > http://htmlparser.sourceforge.net/faq.html#post > > ----- Original Message ---- > From: Ali <to...@ya...> > To: htm...@li... > Sent: Sunday, November 11, 2007 9:13:57 AM > Subject: [Htmlparser-user] Need help > > Hi everyone! > I am new user to HTML parser, I have problem I want > to > POST a form request and then parse the resultant > HTML, > but so far I have failed to do so. > Ali > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam > protection around > http://mail.yahoo.com > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? > Stop. > Now Search log events and configuration files using > AJAX and a browser. > Download your FREE copy of Splunk now >> > http://get.splunk.com/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? > Stop. > Now Search log events and configuration files using > AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/> _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Ali <to...@ya...> - 2007-11-12 05:49:59
|
Is there any way by which I can directly select a tag using some sort of addressing eg /html/body/h1 to select H1 one tag inside HTML file, using HTML parser. Thanks __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
From: Ali <to...@ya...> - 2007-11-12 05:44:04
|
I am unable to execute the code of faq.htm I am getting fowling errors C:\project>java WhoIs yahoo.com -classpath htmlparser.jar Exception in thread "main" java.lang.NoClassDefFoundError: org/htmlparser/beans/ StringBean at WhoIs.<init>(WhoIs.java:55) at WhoIs.main(WhoIs.java:80) C:\project>java WhoIs yahoo.com -classpath "htmlparser.jar" Exception in thread "main" java.lang.NoClassDefFoundError: org/htmlparser/beans/ StringBean at WhoIs.<init>(WhoIs.java:55) at WhoIs.main(WhoIs.java:80) Kindly if some can tell me what is the problem --- Derrick Oswald <der...@ro...> wrote: > The FAQ has an example of how to use POST: > http://htmlparser.sourceforge.net/faq.html#post > > ----- Original Message ---- > From: Ali <to...@ya...> > To: htm...@li... > Sent: Sunday, November 11, 2007 9:13:57 AM > Subject: [Htmlparser-user] Need help > > Hi everyone! > I am new user to HTML parser, I have problem I want > to > POST a form request and then parse the resultant > HTML, > but so far I have failed to do so. > Ali > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam > protection around > http://mail.yahoo.com > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? > Stop. > Now Search log events and configuration files using > AJAX and a browser. > Download your FREE copy of Splunk now >> > http://get.splunk.com/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? > Stop. > Now Search log events and configuration files using > AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/> _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
From: Ali <to...@ya...> - 2007-11-12 05:04:42
|
Hi everyone! Thanks but, there is nothing about parsing in that example, kindly if some can extend that example for me by including HTML parsing in it. Ali --- Derrick Oswald <der...@ro...> wrote: > The FAQ has an example of how to use POST: > http://htmlparser.sourceforge.net/faq.html#post > > ----- Original Message ---- > From: Ali <to...@ya...> > To: htm...@li... > Sent: Sunday, November 11, 2007 9:13:57 AM > Subject: [Htmlparser-user] Need help > > Hi everyone! > I am new user to HTML parser, I have problem I want > to > POST a form request and then parse the resultant > HTML, > but so far I have failed to do so. > Ali > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam > protection around > http://mail.yahoo.com > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? > Stop. > Now Search log events and configuration files using > AJAX and a browser. > Download your FREE copy of Splunk now >> > http://get.splunk.com/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? > Stop. > Now Search log events and configuration files using > AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/> _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
From: Derrick O. <der...@ro...> - 2007-11-12 01:38:49
|
The FAQ has an example of how to use POST: http://htmlparser.sourceforge.net/faq.html#post ----- Original Message ---- From: Ali <to...@ya...> To: htm...@li... Sent: Sunday, November 11, 2007 9:13:57 AM Subject: [Htmlparser-user] Need help Hi everyone! I am new user to HTML parser, I have problem I want to POST a form request and then parse the resultant HTML, but so far I have failed to do so. Ali __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |