htmlparser-user Mailing List for HTML Parser (Page 10)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Asish S. <asi...@ho...> - 2012-01-16 06:20:11
|
...Hi! Baby, you wont be disappointed! http://www.os-bc.de/new-year.link.php?dgoogleId=50e0 |
From: Derrick O. <der...@gm...> - 2012-01-05 15:19:12
|
Hi Steve, The HTTP header information can be inspected, using the class ConnectionMonitor and/or ConnectionManager, but often there are misconfigured or malicious web servers that say one mime type in the header while serving up a different type in the content. Derrick On Wed, Jan 4, 2012 at 8:51 PM, Stefan Schindler <sch...@gm...> wrote: > Hi, > I was wondering, if there is the possibility to check, IF the file to > inspect is a html file (and not, for instance, pdf). > > Greets, > Steve > |
From: Derrick O. <der...@gm...> - 2012-01-05 15:15:22
|
Hi Ido, The project is not very active any more. The version 2.1 was a upgrade for people building with Maven, but had no substantial changes. It's a remarkably stable project. There were over 60,000 downloads last year and only 8 opened tickets. It seems you have already started editing the code locally. If you upload the patch (you seem to have a specific line in a specific file) to the patches area, it can be tracked and others can benefit from your effort. In the fullness of time it may be incorporated into a release. Alternatively, if you log a bug with the test case that is failing it could also help. http://sourceforge.net/tracker/?group_id=24399 Derrick On Wed, Jan 4, 2012 at 4:29 PM, Ido Barav <ido...@sy...> wrote: > I'm trying to use stringbean to extract text from a short html. > > > > I have the following problem: > > When looking at an html that starts with 1 letter in one paragraph, and > then it ends and another paragraph starts, then a CR is not added. > > I think the carriagereturn adding function has a bug there (It should be an > || instead of the second &&). > > My questions are: > > 1. Is the project still active? I've seen a 2.1 version hidden > somewhere, but can't see any update on the sourceforge update. (I don't want > to start installing patches and editing the code locally). > > 2. I actually wish to read an html and when encountering a text tag, > extract the text from it, while using the text editing capabilities of > StringBean. Is there any good way to do this? > > > > Thanks, > > Ido > > > ------------------------------------------------------------------------------ > Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex > infrastructure or vast IT resources to deliver seamless, secure access to > virtual desktops. With this all-in-one solution, easily deploy virtual > desktops for less than the cost of PCs and save 60% on VDI infrastructure > costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Stefan S. <sch...@gm...> - 2012-01-04 19:51:22
|
Hi, I was wondering, if there is the possibility to check, IF the file to inspect is a html file (and not, for instance, pdf). Greets, Steve |
From: Ido B. <ido...@sy...> - 2012-01-04 15:45:14
|
I'm trying to use stringbean to extract text from a short html. I have the following problem: When looking at an html that starts with 1 letter in one paragraph, and then it ends and another paragraph starts, then a CR is not added. I think the carriagereturn adding function has a bug there (It should be an || instead of the second &&). My questions are: 1. Is the project still active? I've seen a 2.1 version hidden somewhere, but can't see any update on the sourceforge update. (I don't want to start installing patches and editing the code locally). 2. I actually wish to read an html and when encountering a text tag, extract the text from it, while using the text editing capabilities of StringBean. Is there any good way to do this? Thanks, Ido |
From: Vu N. i. <ngu...@gm...> - 2011-11-23 01:22:36
|
Good day Sir/Madam, I browse through your contact and I find some items which we have interest in purchasing to our store in Romania for urgent supply, I will like to know the FOB prices per each items plus the shipping cost,I also want to know the kind of method you accept for payment.I await your quick response so I can proceed with my needed items and quantity. Thanks and Regards, Vu Nguyen Address: Rivium Boulevard 427 2909 LK Capelle aan den IJssel Postbus 1131,BC Rotterdam, Romania |
From: Jessop, I. R <isa...@hp...> - 2011-10-17 21:04:35
|
The on click event triggers JavaScript In this case a function called redirectUrl and passes it a reference to the html element in this case the image tag. My guess ( and it is only a guess as I don't have the source of you're the page you are parsing) Is that this function handles all the image clicks on the page and uses the reference passed to determine what url to redirect to In order for you " perform the on click " you would need to parse out the JavaScript function and determine what action it would take ( url redirect) when this image is clicked then do the redirect. Isaac Jessop From: tubin gen [mailto:fac...@gm...] Sent: Monday, October 17, 2011 1:51 PM To: htm...@li... Subject: [Htmlparser-user] performing onclick I was using html parser to parse some html ,Now My html has an image here is the html <img src="Repository/Movie%20Section/Telugu%20Movies/Gundamma%20G-Gundamma%20Gari%20Krishnulu%20VCD_T.jpg" id="Movies_dlMovies_ctl14_imgMovieImage" class="imgStyle" onclick="javascript:return redirectUrl(this);" alt="Gundamma Gari Krishnulu"> this img tag has onClick function , so when I clikc the image the new page whose url is not in the html but the function generates it is opened , using htmlparser can I perform onclick on this image ? If not what library should I use to perform onclick? |
From: tubin g. <fac...@gm...> - 2011-10-17 20:51:18
|
I was using html parser to parse some html ,Now My html has an image here is the html <img src="Repository/Movie%20Section/Telugu%20Movies/Gundamma%20G-Gundamma%20Gari%20Krishnulu%20VCD_T.jpg" id="Movies_dlMovies_ctl14_imgMovieImage" class="imgStyle" onclick="javascript:return redirectUrl(this);" alt="Gundamma Gari Krishnulu"> this img tag has onClick function , so when I clikc the image the new page whose url is not in the html but the function generates it is opened , using htmlparser can I perform onclick on this image ? If not what library should I use to perform onclick? |
From: <pul...@ya...> - 2011-10-16 11:51:13
|
hey htm...@li... wow this is awesome http://www.web10i.com |
From: Asutosh P. <as...@gm...> - 2011-10-04 05:32:37
|
Hi, zhouyang I don't know what is your requirement,but i used HTML parser to get the body content of a html file the code sample is down there ...Hope it will help you.... ****************************************************************** public static String getBodyOfResumeAsText(String path) { final String METHOD_NAME = "getBodyOfResumeAsText :"; String plainText = ""; NodeFilter filter = null; Parser parser = null; try { parser = new Parser(path); filter = new TagNameFilter ("body"); plainText = parser.parse(filter).asString(); plainText = plainText.replaceAll("\\r\\n|\\r|\\n|\\s|\\s+", " "); plainText = plainText.replaceAll(" {2,}", " "); logger.debug(CLASS_NAME + METHOD_NAME + ":generating plainText for :" + path); logger.debug(CLASS_NAME + METHOD_NAME + ":plainText :" + plainText); } catch (Exception e) { e.printStackTrace(); } return plainText; } ****************************************************************** 2011/10/3 <zho...@si...> > Hello, > > There is a scentence "Although some example programs are provided that > may be useful as they stand..." on HTML Parser home page.But I can't > find the example programs in the Web site HTML Parser.Could you send a link > or the examples' src to me? Thank you very much. > > I'm a Chinese student and my English is poor, if there is something > wrong in my email, please forgive me.I will try my best to improve my > English. > > I think HTML Parser is greate, she gives me so much help.Thank you very > much again. > > Zhou Yang > > Oct 3 2011 > > > > > > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure contains a > definitive record of customers, application performance, security > threats, fraudulent activity and more. Splunk takes this data and makes > sense of it. Business sense. IT sense. Common sense. > http://p.sf.net/sfu/splunk-d2dcopy1 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > Thanks & Regards Asutosh . |
From: Derrick O. <der...@gm...> - 2011-10-03 19:00:44
|
Just check for: public void main (String[] args) signatures in the source code. 2011/10/3 <zho...@si...> > Hello, > > There is a scentence "Although some example programs are provided that > may be useful as they stand..." on HTML Parser home page.But I can't > find the example programs in the Web site HTML Parser.Could you send a link > or the examples' src to me? Thank you very much. > > I'm a Chinese student and my English is poor, if there is something > wrong in my email, please forgive me.I will try my best to improve my > English. > > I think HTML Parser is greate, she gives me so much help.Thank you very > much again. > > Zhou Yang > > Oct 3 2011 > > > > > > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure contains a > definitive record of customers, application performance, security > threats, fraudulent activity and more. Splunk takes this data and makes > sense of it. Business sense. IT sense. Common sense. > http://p.sf.net/sfu/splunk-d2dcopy1 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: <zho...@si...> - 2011-10-03 13:37:27
|
Hello, There is a scentence "Although some example programs are provided that may be useful as they stand..." on HTML Parser home page.But I can't find the example programs in the Web site HTML Parser.Could you send a link or the examples' src to me? Thank you very much. I'm a Chinese student and my English is poor, if there is something wrong in my email, please forgive me.I will try my best to improve my English. I think HTML Parser is greate, she gives me so much help.Thank you very much again. Zhou Yang Oct 3 2011 |
From: Derrick O. <der...@gm...> - 2011-08-18 18:40:42
|
Did you try the StringBean? Same code except: StringBean visitor = new StringBean (); parser.visitAllNodesWith(visitor); String textInPage = visitor.getStrings (); Or you can use some of the other facilities - like it will make it's own parser if you don't want to - as shown in the mainline: StringBean sb = new StringBean (); sb.setLinks (false); sb.setReplaceNonBreakingSpaces (true); sb.setCollapse (true); sb.setURL (args[0]); System.out.println (sb.getStrings ()); On Wed, Aug 17, 2011 at 10:25 PM, ernest cronin <ern...@gm...>wrote: > Hi, > > I have been trying to use the parser for some time and I have been unable > to get it to do exactly what I want, which is to gather only the plaintext > without javascript or style stuff. Here is the code I've been running: > > public class Test > { > public static void main (String[] args) > { > try > { > Parser parser = new Parser (args[0]); > TextExtractingVisitor visitor = new TextExtractingVisitor(); > parser.visitAllNodesWith(visitor); > String textInPage = visitor.getExtractedText(); > System.out.println(textInPage); > } > catch (ParserException pe) > { > pe.printStackTrace (); > } > } > } > > I could really use some help with this! > > Thanks, > Ernest > > > > ------------------------------------------------------------------------------ > Get a FREE DOWNLOAD! and learn more about uberSVN rich system, > user administration capabilities and model configuration. Take > the hassle out of deploying and managing Subversion and the > tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: ernest c. <ern...@gm...> - 2011-08-17 20:25:40
|
Hi, I have been trying to use the parser for some time and I have been unable to get it to do exactly what I want, which is to gather only the plaintext without javascript or style stuff. Here is the code I've been running: public class Test { public static void main (String[] args) { try { Parser parser = new Parser (args[0]); TextExtractingVisitor visitor = new TextExtractingVisitor(); parser.visitAllNodesWith(visitor); String textInPage = visitor.getExtractedText(); System.out.println(textInPage); } catch (ParserException pe) { pe.printStackTrace (); } } } I could really use some help with this! Thanks, Ernest |
From: Tamizh V. (L. Invitations) <inv...@li...> - 2011-08-08 18:43:17
|
LinkedIn ------------ This invitation is awaiting your response: From Tamizh Vendan -- (c) 2011, LinkedIn Corporation |
From: Derrick O. <der...@gm...> - 2011-08-08 18:11:34
|
I don't think it's possible to help without a stack trace. Are you sure you are checking for null if there are no links returned? On Mon, Aug 8, 2011 at 4:08 PM, Krishna Arjun <kri...@gm...>wrote: > Marcin <bigger <at> op.pl> writes: > > > > > Dear Derrick, > > > > > >I get the following error: > > > > > > > >org.htmlparser.util.EncodingChangeException: character mismatch (new: > ? > > != > > > >old: > > > >¬) for encoding change from ISO-8859-2 to ISO-8859-1 at character > offset > > > >4162 > > > >Output from LinkExtractor example. > > > > > > > >If I'll try-catch it I won't get any resoult. What can I do with it? > > > > > The exception is thrown because some of the nodes already given out are > > > in error. You can try a second time after discarding the information > > > you've gained so far, like StringBean does: > > > > Thank you for answer but I it's no good solution :( Please try LinkBean > > example with that code: > > > > import java.net.URL; > > import org.htmlparser.beans.LinkBean; > > > > public class LinkDemo > > { > > public static void main (String[] args) > > { > > LinkBean lb = new LinkBean (); > > lb.setURL ("http://www.puszta.pl"); > > URL[] urls = lb.getLinks (); > > for (int i = 0; i < urls.length; i++) > > System.out.println (urls[i]); > > } > > } > > > > Exception in thread "main" java.lang.NullPointerException > > at LinkDemo.main(LinkDemo.java:11) > > > > I can deal with that page with low level lexer but there must by a way to > > extract links from pages with mixed up encodings with NodeVisitor. Is it? > > > > Greets, > > B > > > > ------------------------------------------------------- > > This SF.Net email is sponsored by: IBM Linux Tutorials > > Free Linux tutorial presented by Daniel Robbins, President and CEO of > > GenToo technologies. Learn everything from fundamentals to system > > administration.http://ads.osdn.com/?ad_id 70&alloc_id638&op=click > > > > > hi, > > this is regarding java.lang.nullpointerException > > i am extracting urls using LinkBean > > LinkBean lb = new LinkBean (); > lb.setURL ("http://www.puszta.pl"); > URL[] urls = lb.getLinks (); > > Instead of "http://www.puszta.pl" i am giving input from DB. Here am > repeatedly > executing the above code to extract urls of given website name from DB. In > this > case, its get executing well for around 1500 inputs when it goes more than > that > it throws java.lang.nullpointerException error. > > I am trying to fix this problem since last one week but i didn't get. I > shall be > grateful to you if you provide me solution for this. > > Thank indeed,,, > > > > > > > > > > ------------------------------------------------------------------------------ > BlackBerry® DevCon Americas, Oct. 18-20, San Francisco, CA > The must-attend event for mobile developers. Connect with experts. > Get tools for creating Super Apps. See the latest technologies. > Sessions, hands-on labs, demos & much more. Register early & save! > http://p.sf.net/sfu/rim-blackberry-1 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Krishna A. <kri...@gm...> - 2011-08-08 14:15:31
|
Marcin <bigger <at> op.pl> writes: > > Dear Derrick, > > > >I get the following error: > > > > > >org.htmlparser.util.EncodingChangeException: character mismatch (new: ? > != > > >old: > > >¬) for encoding change from ISO-8859-2 to ISO-8859-1 at character offset > > >4162 > > >Output from LinkExtractor example. > > > > > >If I'll try-catch it I won't get any resoult. What can I do with it? > > > The exception is thrown because some of the nodes already given out are > > in error. You can try a second time after discarding the information > > you've gained so far, like StringBean does: > > Thank you for answer but I it's no good solution :( Please try LinkBean > example with that code: > > import java.net.URL; > import org.htmlparser.beans.LinkBean; > > public class LinkDemo > { > public static void main (String[] args) > { > LinkBean lb = new LinkBean (); > lb.setURL ("http://www.puszta.pl"); > URL[] urls = lb.getLinks (); > for (int i = 0; i < urls.length; i++) > System.out.println (urls[i]); > } > } > > Exception in thread "main" java.lang.NullPointerException > at LinkDemo.main(LinkDemo.java:11) > > I can deal with that page with low level lexer but there must by a way to > extract links from pages with mixed up encodings with NodeVisitor. Is it? > > Greets, > B > > ------------------------------------------------------- > This SF.Net email is sponsored by: IBM Linux Tutorials > Free Linux tutorial presented by Daniel Robbins, President and CEO of > GenToo technologies. Learn everything from fundamentals to system > administration.http://ads.osdn.com/?ad_id70&alloc_id638&op=click > hi, this is regarding java.lang.nullpointerException i am extracting urls using LinkBean LinkBean lb = new LinkBean (); lb.setURL ("http://www.puszta.pl"); URL[] urls = lb.getLinks (); Instead of "http://www.puszta.pl" i am giving input from DB. Here am repeatedly executing the above code to extract urls of given website name from DB. In this case, its get executing well for around 1500 inputs when it goes more than that it throws java.lang.nullpointerException error. I am trying to fix this problem since last one week but i didn't get. I shall be grateful to you if you provide me solution for this. Thank indeed,,, |
From: Duh ¨ <edu...@ho...> - 2011-08-01 21:24:36
|
Hello, I've been trying to set and use the SiteCapturer with proxy settings, to do so I use this: ConnectionManager manager = new ConnectionManager (); manager.setProxyHost("..."); manager.setProxyPort(8080); manager.setProxyUser("..."); manager.setProxyPassword("..."); mParser.setConnectionManager(manager); But all I ve got so far is this message: org.htmlparser.util.ParserException: Connection timed out: connect; java.net.ConnectException: Connection timed out: connect. how do I should procede to use the siteCapturer application with proxy? Thanks |
From: Derrick O. <der...@gm...> - 2011-07-31 13:51:21
|
Using the FilterBuilder tool<http://htmlparser.sourceforge.net/samples.html>is a good way to play with filters. Using that for a minute I got this code which fetches your storybook text: import org.htmlparser.*; import org.htmlparser.filters.*; import org.htmlparser.beans.*; import org.htmlparser.util.*; public class StorytextFilter { public static void main (String args[]) { TagNameFilter filter0 = new TagNameFilter (); filter0.setName ("DIV"); HasAttributeFilter filter1 = new HasAttributeFilter (); filter1.setAttributeName ("id"); filter1.setAttributeValue ("storytext"); NodeFilter[] array0 = new NodeFilter[2]; array0[0] = filter0; array0[1] = filter1; AndFilter filter2 = new AndFilter (); filter2.setPredicates (array0); NodeFilter[] array1 = new NodeFilter[1]; array1[0] = filter2; FilterBean bean = new FilterBean (); bean.setFilters (array1); if (0 != args.length) { bean.setURL (args[0]); System.out.println (bean.getNodes ().toHtml ()); } else System.out.println ("Usage: java -classpath .;htmlparser.jar;htmllexer.jar StorytextFilter <url>"); } } Then you can apply the StringBuiler to the NodeList using the visitor pattern. 2011/7/30 Jan Sokołowski <net...@gm...> > Thanks for answering! However, I'm afraid it didn't help me much :( > > So, all I've changed in the code is the nodeFilter object ( now > constructed as new AndFilter(new TagNameFilter("div"),new > HasAttributeFilter("storytext")); ) > Then, I do the > for(NodeIterator e = parser.elements(); e.hasMoreNodes();){ > e.nextNode().collectInto(nodeList, nodeFilter); > } > > And according to nodeLIst.toNodeArray().lenght, there are no matching > nodes. > > Therefore, I don't have anything to pass to anything you've said, not > to mention I don't know, for example, what a StringBean is (that > means, I've read the javadoc on your page, but I don't have the > foggiest idea how to use it there) (And why couldn't I use the > toPlainTextString() method? I'd like to get the inner HTML of div > without removing any tags there, which StringBean removes, as I've > noticed, unless I've misunderstood it) :( > I'd be very thankful if you could elaborate more on what should I do > there to make it work, please. > > By the way, how do I respond to the posts on that mailing list? I > can't find the response option anywhere? > > > ------------------------------------------------------------------------------ > Got Input? Slashdot Needs You. > Take our quick survey online. Come on, we don't ask for help often. > Plus, you'll get a chance to win $100 to spend on ThinkGeek. > http://p.sf.net/sfu/slashdot-survey > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Jan S. <net...@gm...> - 2011-07-30 20:21:07
|
Thanks for answering! However, I'm afraid it didn't help me much :( So, all I've changed in the code is the nodeFilter object ( now constructed as new AndFilter(new TagNameFilter("div"),new HasAttributeFilter("storytext")); ) Then, I do the for(NodeIterator e = parser.elements(); e.hasMoreNodes();){ e.nextNode().collectInto(nodeList, nodeFilter); } And according to nodeLIst.toNodeArray().lenght, there are no matching nodes. Therefore, I don't have anything to pass to anything you've said, not to mention I don't know, for example, what a StringBean is (that means, I've read the javadoc on your page, but I don't have the foggiest idea how to use it there) (And why couldn't I use the toPlainTextString() method? I'd like to get the inner HTML of div without removing any tags there, which StringBean removes, as I've noticed, unless I've misunderstood it) :( I'd be very thankful if you could elaborate more on what should I do there to make it work, please. By the way, how do I respond to the posts on that mailing list? I can't find the response option anywhere? |
From: Derrick O. <der...@gm...> - 2011-07-30 06:14:22
|
You should maybe filter for new AndFilter (new TagNameFilter("div"), new HasAttributeFilter("storytext")) and then pass the resulting (single) node to the StringBean for extracting the text: nodelist.visitAllNodesWith (stringbean) The contents of the string bean after that should be the text you're looking for. 2011/7/29 Jan Sokołowski <net...@gm...> > I've got a small problem there, and I'd like to ask you to help me, please. > Ok, so I'm trying to use HTMLParser in my project, and there's the problem > - > Example page that I'm trying to process: > http://www.fanfiction.net/s/7229512/1/A_Horse_With_No_Name > > Looking at the source code, there's a div with id and class > 'storytext' within a div with id and class 'storytextp', and there's a > lot of <p> tags within the 'storytext' div. I want to extract the > contents of that 'storytext' div to plain text string. > That's what I'm trying to do: > NodeList nodeList = new NodeList(); > NodeFilter nodeFilter = new AndFilter(new > TagNameFilter("div"),new HasChildFilter(new TagNameFilter("p"))); > > for(NodeIterator e = parser.elements(); e.hasMoreNodes();){ > e.nextNode().collectInto(nodeList, nodeFilter); > } > > System.out.println(nodeList.toNodeArray().length); > > for(Node node : nodeList.toNodeArray()){ > System.out.println(node.toPlainTextString()); > } > > The result? Lenght of nodeList.toNodeArray is equal to zero. > Therefore, it means that I'm screwing something up there. I also tried > using RegexFilter("storytext"), but this isn't working anyway. > The question is, how should I do it? > Please, help, I've been trying to run it past the last week :p > > > ------------------------------------------------------------------------------ > Got Input? Slashdot Needs You. > Take our quick survey online. Come on, we don't ask for help often. > Plus, you'll get a chance to win $100 to spend on ThinkGeek. > http://p.sf.net/sfu/slashdot-survey > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Jan S. <net...@gm...> - 2011-07-29 07:44:41
|
I've got a small problem there, and I'd like to ask you to help me, please. Ok, so I'm trying to use HTMLParser in my project, and there's the problem - Example page that I'm trying to process: http://www.fanfiction.net/s/7229512/1/A_Horse_With_No_Name Looking at the source code, there's a div with id and class 'storytext' within a div with id and class 'storytextp', and there's a lot of <p> tags within the 'storytext' div. I want to extract the contents of that 'storytext' div to plain text string. That's what I'm trying to do: NodeList nodeList = new NodeList(); NodeFilter nodeFilter = new AndFilter(new TagNameFilter("div"),new HasChildFilter(new TagNameFilter("p"))); for(NodeIterator e = parser.elements(); e.hasMoreNodes();){ e.nextNode().collectInto(nodeList, nodeFilter); } System.out.println(nodeList.toNodeArray().length); for(Node node : nodeList.toNodeArray()){ System.out.println(node.toPlainTextString()); } The result? Lenght of nodeList.toNodeArray is equal to zero. Therefore, it means that I'm screwing something up there. I also tried using RegexFilter("storytext"), but this isn't working anyway. The question is, how should I do it? Please, help, I've been trying to run it past the last week :p |
From: UnEpgPj2 <UnE...@v8...> - 2011-07-13 01:03:22
|
pfxqxj 你好 Htmlparser-user: qalp xhobr 2011年07月13日wxsuyf 此致 祝商祺!dtdubxfuzklokh |
From: Derrick O. <der...@gm...> - 2011-06-30 13:48:15
|
Hi Chris, Although you might find it difficult to work with, the way the tags and text are returned makes sense. To avoid splitting the word, the program would need to make a judgement call regarding which text is a word and belongs together and which of the tags to ignore. You can probably get what you want by using something like the StringBean class which extracts just the text. I think you'll find the output of the StringBean will have the word "northern" rather than separating it, i.e. it doesn't inject any whitespace. You can use similar code in your program to paste the text back together. Derrick On Thu, Jun 30, 2011 at 11:39 AM, Chris Bamford <cba...@mi...>wrote: > Hi there, > > I use Aperture to extract text which runs Htmlparser when processing HTML. > My question relates to the handling of presentation tags such as <u>, <b>, > <i> when embedded within words - for example: > > <html><body><u>north</u>ern</body></html> > > What I would expect is that I should be delivered the word "northern" - but > instead I get two tokens: "north" and "ern", which is clearly wrong in this > context. > It seems that Htmlparser is replacing tags with whitespace - why is this? > > Thanks for any help. > > - Chris > > > > ------------------------------------------------------------------------------ > All of the data generated in your IT infrastructure is seriously valuable. > Why? It contains a definitive record of application performance, security > threats, fraudulent activity, and more. Splunk takes this data and makes > sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-d2d-c2 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |