htmlparser-user Mailing List for HTML Parser (Page 45)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Madhur K. T. <mad...@gm...> - 2005-12-07 12:21:24
|
Hi, I'm facing a problem using HTMLParser 1.6 (integration release) to parse an HTML document, described here. I'm using the getNextSibling and getPrevSibling function from the new Node interface to to back and forward from a a text node. The snippet of the HTML page causing the problem is here (table tag inserted into a body tag). ><body> ><TABLE WIDTH="651" CELLPADDING="0" CELLSPACING="0" BORDER="0"> <TR VALIGN="TOP"> <TD BGCOLOR="#FFFFFF" ALIGN="LEFT"> <FONT face="helvetica, arial" size="1"> ><IMG SRC="http://www.comics.com/comics/dilbert/daily_dilbert/images/bullet2.gif" WIDTH="14" HEIGHT="11" ALT="" BORDER="0"> ><A HREF="https://members.comics.com/members/registration/showDilbertLogin.do?aid=1" target="_blank"> Unsubscribe </A>/ ><A HREF="https://members.comics.com/members/registration/showDilbertLogin.do?aid=1" target="_blank" >> Modify </A></FONT></TD></TR></TABLE></body> The code that I am using is as follows :- (in my custom visitor class) >public void visitStringNode(Text string) { > if(string.getText().contains("Unsubscribe")) { > Node prevSibling = string; //.getPreviousSibling(); > while(prevSibling != null) { > System.out.println("Prev Sibling " + prevSibling); > prevSibling = prevSibling.getPreviousSibling(); > } > > Node nextSibling = string; > while(nextSibling != null) { > System.out.println("Next Sibling " + nextSibling); > nextSibling = nextSibling.getNextSibling(); > } > } >} However the output that is seen when the code runs is as follows :- >String : Unsubscribe >Prev Sibling Txt (389[3,100],402[3,113]): Unsubscribe >Next Sibling Txt (389[3,100],402[3,113]): Unsubscribe I expected that the parser would treat the <A> tag and the <IMG> just before the text "Unsubscribe" as siblings and wold return those. Please could you tell me where I;m going wrong? Or is it that the Parser is not correctly getting the siblings? Thanks, -- Madhur Kumar Tanwani "If opportunity knocks only once then build more doors"...... |
From: Ian M. <ian...@gm...> - 2005-12-07 12:11:23
|
I think the best way to explain it is with a diagram, so I've attached a quick one I've just written. In the tree in the diagram, when currenty at node a, the method I've written knows that the next node in the tree (breadth-first) is b. Depth-first (which I've not written but is easy to do from the breadth-first one) it would instead be c. Likewise, if we are currently at node b, and we would want the previous node, using breadth-first search it is a, and using depth-first it would be d. Getting the previous node is not going to be anywhere near as common as getting the next node. I'm afraid I'm not terribly familiar with how the filters in HTMLParser work as I've not really used them in my code, so I probably can't compare it directly. What I needed in my project was a was to traverse the complete document tree regardless of Tag type, and then run a comparison over each node based on the contents of an XML config file, and then for its children and children's children with a defined recursion limit (which in my case was to be breadth-first), so for these reasons I couldn't easily use the existing filters. Yes, I imagine, if it were to be included, that it would form part of a new class that accepts Node's as parameters rather than being in the Node class itself, and as an alternative rather than replacing the existing filters. By the way, the method also works on sub-trees, but I'm not sure if it works on a complete document where there is more than one root node (e.g. doctype and html, though you could definitely just pass it the html tag). It could probably be modified to take a NodeList instead, or now I think about it you could easily just create a new Node and set the document NodeList as its children and pass that. Ian Macfarlane On 12/6/05, Derrick Oswald <der...@ro...> wrote: > How is the search different than a filter? > Would it be better integrated as an alternative to the normal filter > processing. > > Ian Macfarlane <ian...@gm...> wrote: > The next/previous sibling methods indeed solve this particular problem. > > This reminds me, for the internal company project I was working on > that used HTMLParser, I also wrote a breadth-first search algorithm > that would fetch the next node in the complete tree (or a section of > the tree) based on a breadth-first algorithm. I imagine it's probably > not considered a proprietary part of the project I was working on, > although I would probably have to clear it first with the company > before releasing it to the project as it was one of the more complex > parts and took me a large percentage of the development time. > > It's non-recursive. I've not written a depth-first version (I feel > both would need to be included as they may be needed in different > situations) but that is easier than the breadth-first one, nor have I > got a previous-node method, but I feel that could also be useful. > > Would the project be interested in this code? I do have to check > first, but I don't want to go asking my manager unless it's actually > wanted :) > > Ian Macfarlane > > > On 12/6/05, Derrick Oswald wrote: > > The latest Integration Build > > (http://sourceforge.net/forum/forum.php?forum_id=3D510668) > > has functionality to do this... > > get first/last child, previous/next sibling from Ian Macfarlane > > > > Madhur Kumar Tanwani wrote: > > Hi all, > > I'm implementing a custom parser for a HTML page. I'm using HTMLParser > > 1.5 to assist me in the same and its great to use. > > > > T here is this requirement (should I say so?), that I have which wold > > greatly enhance my processing time and make coding for the same very ea= sy > :- > > > > - when I arrive on a Text node, I search the text for some predefined > > strings / content. In case this matches, I need to modify a link in > > close proximity of this Text node. > > - clearly, it is easy to find all nodes coming AFTER this string node. > > - however, finding nodes tha t were parsed just before this text node, > > for now requires me to maintain a record of the past tags and use the > > previous information as an when required. > > - I think, if there was some interface (like NodeIterator which provide= s > > the nextNode() function to get to the next node), which allows the > > parser / visitor to move to the previous node (something like > > prevNode()) would make the requirement mentioned above very simple to > solve. > > > > I do not claim that the requirement is a genuine or critical one, but > > definitively its a good addition to the parser power. > > > > In case you have already faced this problem (though my searches in the > > archives returned no results), please do direct me to any sources to > > solve the same. > > > > Any pointers to any solution hints would be greatly appreciated, > > Thanks, > > > > -- > > Madhur Kumar Tanwani > > "Nothing is impossible..I do nothing" > > > > > > > > |
From: Madhur K. T. <mad...@gm...> - 2005-12-07 12:11:02
|
Ian Macfarlane wrote: >The next/previous sibling methods indeed solve this particular problem. > > hey thanks, this is great news. I did see that there is this release available but did not download them sinc wwas not marked as stable. >Would the project be interested in this code? I do have to check >first, but I don't want to go asking my manager unless it's actually >wanted :) > > Thanks a lot for the offer, but I think i'll manage with the functionality that HTMLParser is providing for the moment. However, if needed, i'll get back to you. Thanks a lot. OK... Now i'm facing a different problem with the parsnig. I'll ppost something in a new thread. Thanks, -- __________________________ Madhur Kumar Tanwani mad...@gm... Ph.: 0253-5614792. __________________________ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + A bus station is where a bus stops. A train station + is where train stops. On my desk, I have a work station... + What more can I say ! ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
From: Daniel C. <dc...@fi...> - 2005-12-07 08:24:28
|
thks derrick now it works good. Derrick Oswald wrote: > I'm not sure if this is getting through (I keep getting errors > reported from qmail), so I'm sending it again. Sorry if this is a repeat. > >Daniel, > >The StringBean class implements NodeVisitor, so use that instead of the >TextExtractingVisitor. > > String parsed; > Parser p = new Parser(); > p.setInputHTML(data); > StringBean visitor = new StringBean(); > p.visitAllNodesWith(visitor); > parsed = visitor.getStrings(); > >Derrick > |
From: Derrick O. <der...@ro...> - 2005-12-07 00:38:09
|
How is the search different than a filter? Would it be better integrated as an alternative to the normal filter processing. Ian Macfarlane <ian...@gm...> wrote: The next/previous sibling methods indeed solve this particular problem. This reminds me, for the internal company project I was working on that used HTMLParser, I also wrote a breadth-first search algorithm that would fetch the next node in the complete tree (or a section of the tree) based on a breadth-first algorithm. I imagine it's probably not considered a proprietary part of the project I was working on, although I would probably have to clear it first with the company before releasing it to the project as it was one of the more complex parts and took me a large percentage of the development time. It's non-recursive. I've not written a depth-first version (I feel both would need to be included as they may be needed in different situations) but that is easier than the breadth-first one, nor have I got a previous-node method, but I feel that could also be useful. Would the project be interested in this code? I do have to check first, but I don't want to go asking my manager unless it's actually wanted :) Ian Macfarlane On 12/6/05, Derrick Oswald wrote: > The latest Integration Build > (http://sourceforge.net/forum/forum.php?forum_id=510668) > has functionality to do this... > get first/last child, previous/next sibling from Ian Macfarlane > > Madhur Kumar Tanwani wrote: > Hi all, > I'm implementing a custom parser for a HTML page. I'm using HTMLParser > 1.5 to assist me in the same and its great to use. > > There is this requirement (should I say so?), that I have which wold > greatly enhance my processing time and make coding for the same very easy :- > > - when I arrive on a Text node, I search the text for some predefined > strings / content. In case this matches, I need to modify a link in > close proximity of this Text node. > - clearly, it is easy to find all nodes coming AFTER this string node. > - however, finding nodes tha t were parsed just before this text node, > for now requires me to maintain a record of the past tags and use the > previous information as an when required. > - I think, if there was some interface (like NodeIterator which provides > the nextNode() function to get to the next node), which allows the > parser / visitor to move to the previous node (something like > prevNode()) would make the requirement mentioned above very simple to solve. > > I do not claim that the requirement is a genuine or critical one, but > definitively its a good addition to the parser power. > > In case you have already faced this problem (though my searches in the > archives returned no results), please do direct me to any sources to > solve the same. > > Any pointers to any solution hints would be greatly appreciated, > Thanks, > > -- > Madhur Kumar Tanwani > "Nothing is impossible..I do nothing" > > > |
From: Ian M. <ian...@gm...> - 2005-12-06 22:47:31
|
The next/previous sibling methods indeed solve this particular problem. This reminds me, for the internal company project I was working on that used HTMLParser, I also wrote a breadth-first search algorithm that would fetch the next node in the complete tree (or a section of the tree) based on a breadth-first algorithm. I imagine it's probably not considered a proprietary part of the project I was working on, although I would probably have to clear it first with the company before releasing it to the project as it was one of the more complex parts and took me a large percentage of the development time. It's non-recursive. I've not written a depth-first version (I feel both would need to be included as they may be needed in different situations) but that is easier than the breadth-first one, nor have I got a previous-node method, but I feel that could also be useful. Would the project be interested in this code? I do have to check first, but I don't want to go asking my manager unless it's actually wanted :) Ian Macfarlane On 12/6/05, Derrick Oswald <der...@ro...> wrote: > The latest Integration Build > (http://sourceforge.net/forum/forum.php?forum_id=3D510668) > has functionality to do this... > get first/last child, previous/next sibling from Ian Macfarlane > > Madhur Kumar Tanwani <mad...@gm...> wrote: > Hi all, > I'm implementing a custom parser for a HTML page. I'm using HTMLParser > 1.5 to assist me in the same and its great to use. > > There is this requirement (should I say so?), that I have which wold > greatly enhance my processing time and make coding for the same very easy= :- > > - when I arrive on a Text node, I search the text for some predefined > strings / content. In case this matches, I need to modify a link in > close proximity of this Text node. > - clearly, it is easy to find all nodes coming AFTER this string node. > - however, finding nodes tha t were parsed just before this text node, > for now requires me to maintain a record of the past tags and use the > previous information as an when required. > - I think, if there was some interface (like NodeIterator which provides > the nextNode() function to get to the next node), which allows the > parser / visitor to move to the previous node (something like > prevNode()) would make the requirement mentioned above very simple to sol= ve. > > I do not claim that the requirement is a genuine or critical one, but > definitively its a good addition to the parser power. > > In case you have already faced this problem (though my searches in the > archives returned no results), please do direct me to any sources to > solve the same. > > Any pointers to any solution hints would be greatly appreciated, > Thanks, > > -- > Madhur Kumar Tanwani > "Nothing is impossible..I do nothing" > > > |
From: Derrick O. <der...@ro...> - 2005-12-06 16:44:11
|
I'm not sure if this is getting through (I keep getting errors reported from qmail), so I'm sending it again. Sorry if this is a repeat. Daniel, The StringBean class implements NodeVisitor, so use that instead of the TextExtractingVisitor. String parsed; Parser p = new Parser(); p.setInputHTML(data); StringBean visitor = new StringBean(); p.visitAllNodesWith(visitor); parsed = visitor.getStrings(); Derrick |
From: Derrick O. <der...@ro...> - 2005-12-06 16:29:16
|
The latest Integration Build (http://sourceforge.net/forum/forum.php?forum_id=510668) has functionality to do this... get first/last child, previous/next sibling from Ian Macfarlane Madhur Kumar Tanwani <mad...@gm...> wrote: Hi all, I'm implementing a custom parser for a HTML page. I'm using HTMLParser 1.5 to assist me in the same and its great to use. There is this requirement (should I say so?), that I have which wold greatly enhance my processing time and make coding for the same very easy :- - when I arrive on a Text node, I search the text for some predefined strings / content. In case this matches, I need to modify a link in close proximity of this Text node. - clearly, it is easy to find all nodes coming AFTER this string node. - however, finding nodes that were parsed just before this text node, for now requires me to maintain a record of the past tags and use the previous information as an when required. - I think, if there was some interface (like NodeIterator which provides the nextNode() function to get to the next node), which allows the parser / visitor to move to the previous node (something like prevNode()) would make the requirement mentioned above very simple to solve. I do not claim that the requirement is a genuine or critical one, but definitively its a good addition to the parser power. In case you have already faced this problem (though my searches in the archives returned no results), please do direct me to any sources to solve the same. Any pointers to any solution hints would be greatly appreciated, Thanks, -- Madhur Kumar Tanwani "Nothing is impossible..I do nothing" |
From: Madhur K. T. <mad...@gm...> - 2005-12-06 10:50:42
|
Hi all, I'm implementing a custom parser for a HTML page. I'm using HTMLParser 1.5 to assist me in the same and its great to use. There is this requirement (should I say so?), that I have which wold greatly enhance my processing time and make coding for the same very easy :- - when I arrive on a Text node, I search the text for some predefined strings / content. In case this matches, I need to modify a link in close proximity of this Text node. - clearly, it is easy to find all nodes coming AFTER this string node. - however, finding nodes that were parsed just before this text node, for now requires me to maintain a record of the past tags and use the previous information as an when required. - I think, if there was some interface (like NodeIterator which provides the nextNode() function to get to the next node), which allows the parser / visitor to move to the previous node (something like prevNode()) would make the requirement mentioned above very simple to solve. I do not claim that the requirement is a genuine or critical one, but definitively its a good addition to the parser power. In case you have already faced this problem (though my searches in the archives returned no results), please do direct me to any sources to solve the same. Any pointers to any solution hints would be greatly appreciated, Thanks, -- Madhur Kumar Tanwani "Nothing is impossible..I do nothing" |
From: Pepelu l. <ag...@ya...> - 2005-12-04 17:48:31
|
Thanks a lot, has been quite usefull. I think i'll try both. Rajat Sharma <rs...@ai...> wrote: Use jakarta's common HttpClient and use HtmlParser to do the rest. Way Simpler. -----Original Message----- From: htm...@li... [mailto:htm...@li...]On Behalf Of Fairy Eneried Sent: Tuesday, October 25, 2005 12:45 PM To: htm...@li... Subject: Re: [Htmlparser-user] Managing session with htmlparser Just make a program to control it. Look at the code of programs like "The Grinder" and "Jakarta Jmeter". Here are the links... http://grinder.sourceforge.net/ http://jakarta.apache.org/jmeter/ good luck (-_^) --------------------------------- Yahoo! Personals Let fate take it's course directly to your email. See who's waiting for you Yahoo! Personals |
From: Daniel C. <dc...@fi...> - 2005-12-02 12:00:57
|
Thks for your reply . I do this code: String parsed; Parser p = new Parser(); p.setInputHTML(data); TextExtractingVisitor visitor = new TextExtractingVisitor(); p.visitAllNodesWith(visitor); parsed = visitor.getExtractedText(); the only thing is : the result that I obtain is good but in this case the javascript code isn't it omitted, but in the example of StringExtractor this code is ommited. What I need to have the same result? Thks for all. Daniel Cortes wrote: > Hi everybody, I choose HTMLParser in my indexation of html documents > to Lucene. How can I doif I have the code of html pages like and > String in my BD, and I want to obtain all the good information > (without css, tags,..)? I see the execution of StringExtractor and it > works good ( this is that I want to obtain of my String that contains > html, but it works by a URL or a file. > Thanks for any reply, and excuse if my question is solved before in > the mailling list. > > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log > files > for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Derrick O. <Der...@Ro...> - 2005-12-01 00:47:00
|
Use parser.setInputHtml (String). Daniel Cortes wrote: > Hi everybody, I choose HTMLParser in my indexation of html documents > to Lucene. How can I doif I have the code of html pages like and > String in my BD, and I want to obtain all the good information > (without css, tags,..)? I see the execution of StringExtractor and it > works good ( this is that I want to obtain of my String that contains > html, but it works by a URL or a file. > Thanks for any reply, and excuse if my question is solved before in > the mailling list. > |
From: Daniel C. <dc...@fi...> - 2005-11-30 17:09:16
|
Hi everybody, I choose HTMLParser in my indexation of html documents to Lucene. How can I doif I have the code of html pages like and String in my BD, and I want to obtain all the good information (without css, tags,..)? I see the execution of StringExtractor and it works good ( this is that I want to obtain of my String that contains html, but it works by a URL or a file. Thanks for any reply, and excuse if my question is solved before in the mailling list. |
From: Rahul J. <rj...@ya...> - 2005-11-28 05:59:17
|
Hi, I am using the following code to read and parse from a URL. String url = "http://www.os4depot.net/?function=showfile&file=game/roleplaying/cwmmoria.lha ";//example url StringExtractor se = new StringExtractor (url); content = se.extractStrings(false); If the file at the url is very large (several dozen MBs), then I get java.lang.OutOfMemoryError: Java heap space. I can increase the memory using -Xmx but I do not want to parse such a large file. Is there a way to be able to skip files above some pre-defined size or to break from reading such a file after some pre-defined time? If there isn't, then I think we can look at such options for future versions. Thanks, Rahul. PS: HTML parser is a great api package and I appreciate the efforts put in by the developers! __________________________________ Start your day with Yahoo! - Make it your home page! http://www.yahoo.com/r/hs |
From: Rahul J. <rj...@ya...> - 2005-11-27 19:57:09
|
P.S.: The reason why I want to know this is that even the 'unparseable' files are actually 'parsed' by StringExtractor.extractStrings method without throwing an Exception but the contents don't make sense. (similar to opening the file in Notepad.) The idea is to skip such files. Thanks! --- Rahul Joshi <rj...@ya...> wrote: > Hi, > > For files like PDF, PS, DOC, etc. which are not > HTML/XML/plain text, and cannot be parsed by HTML > parser, is there a way to know that a file is of > such > type i.e., unparseable? > > One can always check for the extension but does the > parser provide any method to check this? Or is there > any other simpler way? > > Thanks, > Rahul. > > --- Derrick Oswald <Der...@Ro...> wrote: > > > No. It's for html/xml only. > > > > prince prakash wrote: > > > > > hi friends, > > > i need possible to extract text > > from the power point > > > files,word file.Is it possible to extract text > > from the those files > > > with html parser?if yes please give me the > > solution or how to extract > > > from the from the ppt,ms word files.i would be > > very thankful if u > > > provide the solution. > > > regards, > > > prakash. > > > > > > > > > > > ------------------------------------------------------- > > This SF.Net email is sponsored by the JBoss Inc. > > Get Certified Today * Register for a JBoss > Training > > Course > > Free Certification Exam for All Training Attendees > > Through End of 2005 > > Visit http://www.jboss.com/services/certification > > for more information > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > __________________________________________ > Yahoo! DSL Something to write home about. > Just $16.99/mo. or less. > dsl.yahoo.com > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do > you grep through log files > for problems? Stop! Download the new AJAX search > engine that makes > searching your log files as easy as surfing the > web. DOWNLOAD SPLUNK! > http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > __________________________________ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com |
From: Rahul J. <rj...@ya...> - 2005-11-27 19:48:26
|
Hi, For files like PDF, PS, DOC, etc. which are not HTML/XML/plain text, and cannot be parsed by HTML parser, is there a way to know that a file is of such type i.e., unparseable? One can always check for the extension but does the parser provide any method to check this? Or is there any other simpler way? Thanks, Rahul. --- Derrick Oswald <Der...@Ro...> wrote: > No. It's for html/xml only. > > prince prakash wrote: > > > hi friends, > > i need possible to extract text > from the power point > > files,word file.Is it possible to extract text > from the those files > > with html parser?if yes please give me the > solution or how to extract > > from the from the ppt,ms word files.i would be > very thankful if u > > provide the solution. > > regards, > > prakash. > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by the JBoss Inc. > Get Certified Today * Register for a JBoss Training > Course > Free Certification Exam for All Training Attendees > Through End of 2005 > Visit http://www.jboss.com/services/certification > for more information > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > __________________________________________ Yahoo! DSL Something to write home about. Just $16.99/mo. or less. dsl.yahoo.com |
From: Derrick O. <Der...@Ro...> - 2005-11-23 12:58:35
|
You might want to look at the org.htmlparser.sax package which provides a thin XML veneer over the parser. Ian Macfarlane wrote: >Please don't post multiple times. All the people here are volunteers >and may not be able to answer questions quickly. > >With regards to your particular question, I'm afraid this isn't >exactly a simple task, as the two languages are different. What you >probably want to do is try converting HTML to XHTML as XHTML is a >subset of XML and a valid XHTML document is also therefore valid XML. >However, as I said, it's not entirely a simple task, as XHTML is much >stricter than HTML is. > >I'm not 100% sure that the HTMLParser project is the best tool to use >for this particular job, but I'll leave that to the people here who >know the project's full capabilities better than myself as I'm only >familiar with a subsection of the project. > >Ian > >On 11/20/05, grace <k65...@ms...> wrote: > > >>I want to know coversion HTML to XML. >>Thanks! >> >> > > >------------------------------------------------------- >This SF.Net email is sponsored by the JBoss Inc. Get Certified Today >Register for a JBoss Training Course. Free Certification Exam >for All Training Attendees Through End of 2005. For more info visit: >http://ads.osdn.com/?ad_idv28&alloc_id845&op=click >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Ian M. <ian...@gm...> - 2005-11-22 12:32:30
|
Please don't post multiple times. All the people here are volunteers and may not be able to answer questions quickly. With regards to your particular question, I'm afraid this isn't exactly a simple task, as the two languages are different. What you probably want to do is try converting HTML to XHTML as XHTML is a subset of XML and a valid XHTML document is also therefore valid XML. However, as I said, it's not entirely a simple task, as XHTML is much stricter than HTML is. I'm not 100% sure that the HTMLParser project is the best tool to use for this particular job, but I'll leave that to the people here who know the project's full capabilities better than myself as I'm only familiar with a subsection of the project. Ian On 11/20/05, grace <k65...@ms...> wrote: > > I want to know coversion HTML to XML. > Thanks! |
From: grace <k65...@ms...> - 2005-11-20 20:28:48
|
I want to know coversion HTML to XML. Thanks! |
From: Derrick O. <Der...@Ro...> - 2005-11-19 00:28:12
|
The link you're trying to access is the wiki and it's been broken for a while... sorry. You probably want to get the correct table (or the only table) using a filter (see the FilterBuilder program -- bin/filterbuilder), and then traverse the table: NodeList tables = parser.parse (new TagNameFilter ("TABLE")); for (int i = 0; i < tables.size (); i++) { TableTag table = tables.elementAt (i); TableRows[] rows = table.getRows (); for (int j = 0; j < rows.length; j++) { TableColumn[] columns = rows[j].getColumns (); TextNode text = columns.getChildren ().elementAt (0); // if it's the only content otherwise you have to be smarter System.out.println (text.getText ()); } } Derrick natasha s wrote: > Hi, > This is Natasha. I needed some help with the HTML Parser. > > I was able to install and get started with a small program. My project > involves reading a webpage that is a result of a search and parse the > HTML page and get the data within the TD tags. > > <TD width = "80">2003-01-02</TD> <TD width = "80">12:51:16</TD> <TD > width = "130">user</TD><TD width = "*">directory</TD><TD > width = "200" >filename</TD><TD width = "100">version</TD<input > type="checkbox"</TD></TR></TBODY></TABLE>.... > > So I need to retrieve the following values: > 2003-01-02 > 12:51:16 > user > directory > filename > version > The value of checkbox field (whether checked or not) > > I am trying ot go through the JavaDoc but it is very confusing as it > does not have sample programs. > > And the sample program link is not working. > > Does anyone have any quick start guides. Or even some pointers as to > how to accomplish this. > > The following link does not work and fails with the following message. > > lib/WikiDB/backend/PearDB.php:32: Fatal[256]: Can't connect to > database: wikidb_backend_mysql: fatal database error > > * DB Error: connect failed > * ( [nativecode=Can't connect to MySQL server on > 'mysql.sourceforge.net' (111)] ** > mysql://htmlparser:XXX...@my.../htmlparser) > * > > http://htmlparser.sourceforge.net/docs/ > > Any help or guidance is appreciated. > > Thank you. > Natasha > > > ------------------------------------------------------------------------ > Yahoo! FareChase - Search multiple travel sites in one click. > <http://us.lrd.yahoo.com/_ylc=X3oDMTFqODRtdXQ4BF9TAzMyOTc1MDIEX3MDOTY2ODgxNjkEcG9zAzEEc2VjA21haWwtZm9vdGVyBHNsawNmYw--/SIG=110oav78o/**http%3a//farechase.yahoo.com/> |
From: grace <k65...@ms...> - 2005-11-18 20:44:11
|
Dear : I want to know covert the HTML to XML sample code. Thanks! |
From: natasha s <mai...@ya...> - 2005-11-18 18:46:44
|
Hi, This is Natasha. I needed some help with the HTML Parser. I was able to install and get started with a small program. My project involves reading a webpage that is a result of a search and parse the HTML page and get the data within the TD tags. <TD width = "80">2003-01-02</TD> <TD width = "80">12:51:16</TD> <TD width = "130">user</TD><TD width = "*">directory</TD><TD width = "200" >filename</TD><TD width = "100">version</TD<input type="checkbox"</TD></TR></TBODY></TABLE>.... So I need to retrieve the following values: 2003-01-02 12:51:16 user directory filename version The value of checkbox field (whether checked or not) I am trying ot go through the JavaDoc but it is very confusing as it does not have sample programs. And the sample program link is not working. Does anyone have any quick start guides. Or even some pointers as to how to accomplish this. The following link does not work and fails with the following message. lib/WikiDB/backend/PearDB.php:32: Fatal[256]: Can't connect to database: wikidb_backend_mysql: fatal database error DB Error: connect failed ( [nativecode=Can't connect to MySQL server on 'mysql.sourceforge.net' (111)] ** mysql://htmlparser:XXX...@my.../htmlparser) http://htmlparser.sourceforge.net/docs/ Any help or guidance is appreciated. Thank you. Natasha --------------------------------- Yahoo! FareChase - Search multiple travel sites in one click. |
From: Derrick O. <Der...@Ro...> - 2005-11-08 23:29:49
|
You probably need to set some request parameters so the site thinks you are a browser. Either use the ConnectionManager.setRequestProperties (), or manually create your UrlConnection and pass it to the StringBean: url = new URL ("http://yadda"); connection = (HttpURLConnection)url.openConnection (); connection.setDoOutput (true); connection.setDoInput (true); connection.setUseCaches (false); // more or less of these may be required // see Request Header Definitions: http://www.ietf.org/rfc/rfc2616.txt connection.setRequestProperty ("Accept-Charset", "*"); connection.setRequestProperty ("Referer", "http://Nadda"); bean = new StringBean (); bean.setConnection (connection); mText = bean.getStrings (); R, Ananda Krishnan (Cognizant) wrote: > Hi, > > When I try to run string extractor on a particular site, the content > extracted is nothing but a message to enable the java script. > > Since I am running string extractor within a java code. Kindly let me > know how to enable java script from java code. Or is there any option > in java command like –D option while running the program to enable script. > > - Anand > > |
From: R, A. K. \(Cognizant\) <Ana...@co...> - 2005-11-08 16:25:36
|
Hi, =0D When I try to run string extractor on a particular site, the content extracted is nothing but a message to enable the java script. =0D Since I am running string extractor within a java code. Kindly let me know how to enable java script from java code. Or is there any option in java command like -D option while running the program to enable script. =0D - Anand This e-mail and any files transmitted with it are for the sole use of the= intended recipient(s) and may contain confidential and privileged= information. If you are not the intended recipient, please contact the sender by reply= e-mail and destroy all copies of the original message.=0D Any unauthorized review, use, disclosure, dissemination, forwarding,= printing or copying of this email or any action taken in reliance on this= e-mail is strictly=0D prohibited and may be unlawful. Visit us at http://www.cognizant.com |
From: Rajat S. <rs...@ai...> - 2005-11-08 15:04:59
|
Thanks, I will try it and hopefully it works out fine. Thnx Derrick and others. -----Original Message----- From: htm...@li... [mailto:htm...@li...]On Behalf Of Derrick Oswald Sent: Monday, November 07, 2005 5:59 PM To: htm...@li... Subject: Re: [Htmlparser-user] How to get Parse read a stream. The Page class has a constructor that could be useful: public Page (InputStream stream, String charset) The page is then passed into the Lexer constructor and the Lexer passed=20 into the Parser constructor. Rajat Sharma wrote: >Hi Guys, > >I am using httpClient to get the http data from a webServer. I am using = inputStream to get the response. This inputStream data is being written = into a local file which is then read by the html Parser.=20 > >I don't want to write the local file for the html parser. Is there a = way I can use \ redirect the input Stream object.=20 > >Could someone help.=20 > >Thanks, >Rajat > > =20 > ------------------------------------------------------- SF.Net email is sponsored by: Tame your development challenges with Apache's Geronimo App Server. = Download it for free - -and be entered to win a 42" plasma tv or your very own Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |