htmlparser-user Mailing List for HTML Parser (Page 40)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: abhishek m. <mis...@gm...> - 2006-03-29 19:54:24
|
HI all, I need to parse a html string and render it using swt. Any pointers would be appreciated. Is there any way to store constructs in a map format using html parser?? I believe i would have to use event driven mechanism. I don't want to invent the wheel and use sax to do that. If there is an easy way to do it please help. thanks, abhi |
From: Antony S. <ant...@gm...> - 2006-03-29 19:42:17
|
Hi In my code I have a http fetcher that puts received content into a file for a set of urls. The content includes all the headers received (the complete stream of data for a request from the server). At a later time I parse it using the html parser. This part of the code seems to work, in the sense I am able to extract links in those pages using the parser. My question is - how do I get hold of the http status code , specifically when it is a 302 kind and then get hold of the new location. Thanks, -Antony |
From: Derrick O. <Der...@Ro...> - 2006-03-29 12:52:26
|
You may need to 'POST' to the login form using the ConnectionManager with your credentials. See the doc-comments for src/org/htmlparser/tests/ParserTest.testPOST() for an example. ?? wrote: > I want to parse a web page that need to log in.so I use the wiki > example but can not work. the cookie expired when the browser shut down. > Can you tell me how to handle this situation. > > -- > Best Regards. > > Xiaodong Han > MSN:hx...@ho... <mailto:MSN:hx...@ho...> |
From: <xia...@gm...> - 2006-03-29 07:39:38
|
I want to parse a web page that need to log in.so I use the wiki example bu= t can not work. the cookie expired when the browser shut down. Can you tell me how to handle this situation. -- Best Regards. Xiaodong Han MSN:hx...@ho... |
From: Derrick O. <Der...@Ro...> - 2006-03-28 12:59:48
|
The ConnectionManager's openConnection(URL) method may be useful. OwenM wrote: >I'm hoping someone can point to which part of htmlparser I should >use to write a simple links validator. I can use LinkExtractor to >get a list of links on a page, but what technique is best to attempt >to follow a link, and get a success, or failure in the case of invalid >links? > >many thanks, >Owen. > > > > > > |
From: OwenM <ow...@ow...> - 2006-03-28 10:37:33
|
I'm hoping someone can point to which part of htmlparser I should use to write a simple links validator. I can use LinkExtractor to get a list of links on a page, but what technique is best to attempt to follow a link, and get a success, or failure in the case of invalid links? many thanks, Owen. |
From: Derrick O. <Der...@Ro...> - 2006-03-26 16:42:13
|
Andy, You might want to start with the StringBean (see bin/stringextractor example application), and extend it to handle the <p> tags specially. I believe if you override visitTag (Tag tag) and check for <P> before calling super.visitTag (Tag tag), you can get the strings you want. Derrick Andy Wickson wrote: > Hi, > I am stuggling with what I assume should be a simple operation. > The html file I am parsing has each line ending with a <P> tag (not > </P> as you might expect). > There is a random amount of bold tags in each line - I am interested > in the text in each line without the tags - one String for each <P> tag. > > Are there any decent examples anywhere apart from the ones on the > htmlparser site? > > Thanks, > Andy > > |
From: Andy W. <an...@aw...> - 2006-03-26 09:35:27
|
Hi, I am stuggling with what I assume should be a simple operation. The html file I am parsing has each line ending with a <P> tag (not </P> as you might expect). There is a random amount of bold tags in each line - I am interested in the text in each line without the tags - one String for each <P> tag. Are there any decent examples anywhere apart from the ones on the htmlparser site? Thanks, Andy |
From: Derrick O. <Der...@Ro...> - 2006-03-25 18:50:32
|
You will probably need to modify your regular expression to match one or more whitespace charscters between the day, month and year. v.sudhakarreddy ch wrote: > Hi, > iam using Regular expression filter to extract dates from a html > document. When i extract dates in > the format like 23/4/2004 , 21 march 2005 etc.. using following > regular expression Regex filter is not working. iam also giving the > code here. > > try > { > Parser parser = new Parser ("sample.html"); > RegexFilter filter = new RegexFilter > ("([1-3][0-9]?)(th|rd|st|nd)?,? [\\s|-|/] > (jan|feb|mar|april|may|jun|jul|aug|sep|oct|nov|dec|january|february|march|april|may|june|july|august|september|october|november|december|[0-9][1-9]?),? > [\\s|-|/] ([0-9]|[0-9]) ([0-9]{2})? ,?"); > NodeList list = parser.extractAllNodesThatMatch (filter); > int i=0; > while(i<list.size()){ > System.out.println("date->" + (i+1)); > String str = ((Node)list.elementAt(i)).toPlainTextString(); > i++; > System.out.println(str + "-"); > } > } > catch (ParserException e) > { e.printStackTrace (); } > > code of sample.html is.. > > < html> <head></head> <body> > ><b>< >font color=brown>Important Dates</font></b> >< >ul> > <li>Last date to apply for registration and travel support: <b>21 > March 2005</b> > > <li>Notification regarding registration request: <b>23 March > 2005</b> > </ >ul> > > ><hr> ></body> </html> > > > can anyone tell me what was wrong with above code? The above regular > expression worked correctly to extract the dates from simple text file.. > > by > sudhakar > > > |
From: v.sudhakarreddy c. <sud...@gm...> - 2006-03-25 13:27:28
|
Hi, iam using Regular expression filter to extract dates from a html document. When i extract dates in the format like 23/4/2004 , 21 march 2005 etc.. using following regular expression Regex filter is not working. iam also giving the code here. try { Parser parser =3D new Parser ("sample.html"); RegexFilter filter =3D new RegexFilter ("([1-3][0-9]?)(th|rd|st|nd)?,? [\\s|-|/] (jan|feb|mar|april|may|jun|jul|aug|sep|oct|nov|dec|january|february|march|a= pril|may|june|july|august|september|october|november|december|[0-9][1-9]?),= ? [\\s|-|/] ([0-9]|[0-9]) ([0-9]{2})? ,?"); NodeList list =3D parser.extractAllNodesThatMatch (filter); int i=3D0; while(i<list.size()){ System.out.println("date->" + (i+1)); String str =3D ((Node)list.elementAt(i)).toPlainTextString(); i++; System.out.println(str + "-"); } } catch (ParserException e) { e.printStackTrace (); } code of sample.html is.. <html> <head></head> <body> <b><font color=3Dbrown>Important Dates</font></b> <ul> <li>Last date to apply for registration and travel support: <b>21 March 2005</b> <li>Notification regarding registration request: <b>23 March 2005</b> </ul> <hr> </body> </html> can anyone tell me what was wrong with above code? The above regular expression worked correctly to extract the dates from simple text file.. by sudhakar |
From: abhishek m. <mis...@gm...> - 2006-03-24 00:57:55
|
Hi All, I needed to use this parser in a event driven way. I was wondering how can we parse a string and render it using swt ( sort of my own stripped down browser). All i needed was a text to token matching which is sequential. I don't know if someone can help me with this. I will be really greatful. Thanks, Abhi |
From: Antony S. <ant...@gm...> - 2006-03-23 23:38:11
|
Hi Subramanya, My solution to the same problem is Parser parser =3D new Parser(urlOb.openConnection()); NodeList nl =3D null; for (int i =3D 0; i < 4; i++) { // 4 decoding tries max try { nl =3D parser.parse(null); } catch (EncodingChangeException e) { String s =3D parser.getEncoding(); // use detected encoding log.fine("restarting parse with " + s + " for " + url); continue; } break; } If yours is better I'd like to use it. I hope someone who knows better can tell. I am sort of stuck doing bunch of other things at this time and not paying attention to this particular issue. Is mine buggy cause I don't call reset ? -Antony Sequeira > Code snippet below: > --------------------------------------------------------------------- > private static void IgnoreCharSetChanges(Parser p) > { > PrototypicalNodeFactory factory =3D new PrototypicalNodeFactory (); > factory.unregisterTag(new MetaTag()); > // Unregister meta tag so that char set changes are ignored! > p.setNodeFactory (factory); > } > > private static String ParseNow(Parser p, MyVisitor visitor) throws org= .htmlparser.util.ParserException > { > try { > System.out.println("START encoding is " + p.getEncoding()); > p.visitAllNodesWith(visitor); > } > catch (org.htmlparser.util.EncodingChangeException e) { > try { > System.out.println("Caught you! CURRENT encoding is " + p.get= Encoding()); > visitor.Init(); > p.reset(); > p.visitAllNodesWith(visitor); > } > catch (org.htmlparser.util.EncodingChangeException e2) { > System.out.println("CURRENT encoding is " + p.getEncoding()); > System.out.println("--- CAUGHT you yet again! IGNORE meta tag= s now! ---"); > visitor.Init(); > p.reset(); > IgnoreCharSetChanges(p); > p.visitAllNodesWith(visitor); > } > } > System.out.println("ENCODING IS " + p.getEncoding()); > return p.getEncoding(); > } > --------------------------------------------------------------------- > > If, in future versions of HTMLParser, the MetaTag class starts doing othe= r > important things in future besides setting text encoding, then, a new cla= ss > could be derived from the existing MetaTag class whose "doSemanticAction(= )" > code simply ignores char set changes for "content-type" meta tags and > calls super.doSemanticAction for others ... > > If there are gotchas in this technique, I would appreciate feedback on > that front too! > > Thanks, > > Best, > Subbu. > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting langua= ge > that extends applications into web and mobile media. Attend the live webc= ast > and join the prime developer group breaking into this new coding territor= y! > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D110944&bid=3D241720&dat= =3D121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Subramanya S. <sa...@cs...> - 2006-03-23 19:06:13
|
Hello everyone, My name is Subbu (Subramanya Sastry). For one of my projects, I had been using the Swing inbuilt parser and had managed to set up workarounds to deal with its inadequacies (mostly because of it being based on HTML 3.2). Anyway, I had looked at HTMLParser few months back, but since all was working fine for me with the Swing parser, I hadn't switched over to HTMLParser, and also because I didn't have to ship another library with the application. But, for various reasons, including the fact that I am multi-lingualizing my application, I decided to check out HTMLParser last week. However, I quickly ran into problems because of EncodingChangeException -- and this was on plain-old "English content" HTML files. I scouted around and read about the "parser.reset()" trick. However, that didn't solve my problem because even after reset, the same exception was being thrown at the same place. When I looked into the HTML, I noticed that the publishers had *TWO* content-type meta tags <meta http-equiv="Content-Type" content="text/html"> and a little while later <meta content="text/html; charset=UTF-8" http-equiv="Content-Type"> The presence of these multiple meta tags renders the resetting useless because the parser will trip on the second meta tag each time! (Check out pages on http://www.economictimes.com for this kind of HTML) I couldn't think of a work-around for this, and so, I reverted back to the Swing parser which allows me to ignore character-set changes, which helps me deal with the above problem. Since I knew of no easy way of telling HTMLParser to ingore char-set changes, I couldn't use HTMLParser. But, after racking my head for a while, I finally went through the source code of HTMLParser, and then, finally hit the solution/hack when going through the Javadoc for "PrototypicalNodeFactory"! I saw on the user mailing list that a couple of times, people have run into this problem of not being able to parse HTML even after resetting the parser. So, I am sharing this in the interest of those who might run into this problem in the future. The solution/hack is as follows: I simply unregistered the meta tag from the PrototypicalNodeFactory the third time around which means both the above meta tags won't get parsed. But, since the parser has already picked up the UTF-8 encoding, the entire file will be parsed with UTF-8 encoding. Obviously, this is not a bullet-proof solution, but this helps me get through several HTML files which were otherwise getting rejected. Code snippet below: --------------------------------------------------------------------- private static void IgnoreCharSetChanges(Parser p) { PrototypicalNodeFactory factory = new PrototypicalNodeFactory (); factory.unregisterTag(new MetaTag()); // Unregister meta tag so that char set changes are ignored! p.setNodeFactory (factory); } private static String ParseNow(Parser p, MyVisitor visitor) throws org.htmlparser.util.ParserException { try { System.out.println("START encoding is " + p.getEncoding()); p.visitAllNodesWith(visitor); } catch (org.htmlparser.util.EncodingChangeException e) { try { System.out.println("Caught you! CURRENT encoding is " + p.getEncoding()); visitor.Init(); p.reset(); p.visitAllNodesWith(visitor); } catch (org.htmlparser.util.EncodingChangeException e2) { System.out.println("CURRENT encoding is " + p.getEncoding()); System.out.println("--- CAUGHT you yet again! IGNORE meta tags now! ---"); visitor.Init(); p.reset(); IgnoreCharSetChanges(p); p.visitAllNodesWith(visitor); } } System.out.println("ENCODING IS " + p.getEncoding()); return p.getEncoding(); } --------------------------------------------------------------------- If, in future versions of HTMLParser, the MetaTag class starts doing other important things in future besides setting text encoding, then, a new class could be derived from the existing MetaTag class whose "doSemanticAction()" code simply ignores char set changes for "content-type" meta tags and calls super.doSemanticAction for others ... If there are gotchas in this technique, I would appreciate feedback on that front too! Thanks, Best, Subbu. |
From: Ian M. <ian...@gm...> - 2006-03-22 10:20:03
|
I'd recommend exporting the data into CSV format, which is really easy to write to and can be read by Excel. On 3/22/06, @ java <jav...@ya...> wrote: > > Hi , > > Iam have to read html table data into excel file > > I downloaded HtmlParser latest version . > > But how to get table data into excel sheet > > Plz Tell me some clases and interfaces to solve this > > or else plz send some sample examples worked on html parsers > > Byeeeeeeeeeeeeeee > > Bhavya > > > > > > > ________________________________ > Jiyo cricket on Yahoo! India cricket > Yahoo! Messenger Mobile Stay in touch with your buddies all the time. > > |
From: @ j. <jav...@ya...> - 2006-03-22 09:50:59
|
Hi , Iam have to read html table data into excel file I downloaded HtmlParser latest version . But how to get table data into excel sheet Plz Tell me some clases and interfaces to solve this or else plz send some sample examples worked on html parsers Byeeeeeeeeeeeeeee Bhavya --------------------------------- Jiyo cricket on Yahoo! India cricket Yahoo! Messenger Mobile Stay in touch with your buddies all the time. |
From: Wen <log...@ya...> - 2006-03-22 07:59:51
|
Hi Derrick, Thank you for your reply. It does improve the speed. Thanks a lot. wen --- Derrick Oswald <Der...@Ro...> wrote: > Wen, > > I'm not sure it would be faster but... > If you don't care about nesting or other types of nodes, you > can supply > the LinkTag as the only prototype for the node factory: > > PrototypicalNodeFactory factory = new > PrototypicalNodeFactory (new > LinkTag ()); > Parser parser = new Parser (); > parser.setNodeFactory (factory); > NodeFilter filter = new NodeClassFilter (LinkTag.class); > for (20 documents) > { > parser.setURL (url); > NodeList links = parser.extractAllNodesThatMatch (filter); > for (int in = 0; in < links.size (); in++) > ... > > In this way there will be no attempt at nesting the tags, so > it should > be faster. > You also don't need to allocate a parser and filter within > your loop. > > Derrick > > Wen wrote: > > > Hi, > > > > I'm using HTMLParser to parse a link that contains specific > file type. > > ex. pdf files. > > It works fine but takes around 20 seconds to parse 20 > websites. > > I noticed except NodeFilter, LinkExtractor or > LinkRegexFilter may be > > able to achieve the same goal. > > > > Is there other ways to make the extraction process faster > than the way > > I'm using now? > > > > Here is my code: > > for( 20 documents){ > > parser = new Parser(url); > > NodeFilter filter = new NodeClassFilter > (LinkTag.class); > > NodeList links = new NodeList (); > > > > for (NodeIterator e = parser.elements (); > e.hasMoreNodes (); ) > > e.nextNode ().collectInto (links, filter); > > for (int in = 0; in < links.size (); in++) > > { > > LinkTag linkTag = (LinkTag)links.elementAt > (in); > > if(linkTag.getLink().endsWith(".PDF")){ > > doSomething; > > } > > } > > > > Thanks in advanced. > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking > scripting language > that extends applications into web and mobile media. Attend > the live webcast > and join the prime developer group breaking into this new > coding territory! > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
From: Derrick O. <Der...@Ro...> - 2006-03-21 12:27:59
|
try this: parser.extractAllNodesThatMatch (new TagNameFilter ("FORM")) Each of the form tags you get has a getFormInputs() method. 苏李亮 wrote: >Dear members, > > I want to extract the form tag in html page,and extract the input tag in the form, >and read the attribute of input tag, the attribute's value.Thank you! > > > |
From: Derrick O. <Der...@Ro...> - 2006-03-21 12:24:28
|
Wen, I'm not sure it would be faster but... If you don't care about nesting or other types of nodes, you can supply the LinkTag as the only prototype for the node factory: PrototypicalNodeFactory factory = new PrototypicalNodeFactory (new LinkTag ()); Parser parser = new Parser (); parser.setNodeFactory (factory); NodeFilter filter = new NodeClassFilter (LinkTag.class); for (20 documents) { parser.setURL (url); NodeList links = parser.extractAllNodesThatMatch (filter); for (int in = 0; in < links.size (); in++) ... In this way there will be no attempt at nesting the tags, so it should be faster. You also don't need to allocate a parser and filter within your loop. Derrick Wen wrote: > Hi, > > I'm using HTMLParser to parse a link that contains specific file type. > ex. pdf files. > It works fine but takes around 20 seconds to parse 20 websites. > I noticed except NodeFilter, LinkExtractor or LinkRegexFilter may be > able to achieve the same goal. > > Is there other ways to make the extraction process faster than the way > I'm using now? > > Here is my code: > for( 20 documents){ > parser = new Parser(url); > NodeFilter filter = new NodeClassFilter (LinkTag.class); > NodeList links = new NodeList (); > > for (NodeIterator e = parser.elements (); e.hasMoreNodes (); ) > e.nextNode ().collectInto (links, filter); > for (int in = 0; in < links.size (); in++) > { > LinkTag linkTag = (LinkTag)links.elementAt (in); > if(linkTag.getLink().endsWith(".PDF")){ > doSomething; > } > } > > Thanks in advanced. |
From: <asd...@gm...> - 2006-03-21 08:56:06
|
Dear members, I want to extract the form tag in html page,and extract the input tag in the form, and read the attribute of input tag, the attribute's value.Thank you! |
From: Wen <log...@gm...> - 2006-03-21 04:27:16
|
Hi, I'm using HTMLParser to parse a link that contains specific file type. ex. pdf files. It works fine but takes around 20 seconds to parse 20 websites. I noticed except NodeFilter, LinkExtractor or LinkRegexFilter may be able t= o achieve the same goal. Is there other ways to make the extraction process faster than the way I'm using now? Here is my code: for( 20 documents){ parser =3D new Parser(url); NodeFilter filter =3D new NodeClassFilter (LinkTag.class); NodeList links =3D new NodeList (); for (NodeIterator e =3D parser.elements (); e.hasMoreNodes (); = ) e.nextNode ().collectInto (links, filter); for (int in =3D 0; in < links.size (); in++) { LinkTag linkTag =3D (LinkTag)links.elementAt (in); if(linkTag.getLink().endsWith(".PDF")){ doSomething; } } Thanks in advanced. |
From: Derrick O. <Der...@Ro...> - 2006-03-19 13:28:33
|
Vishal, HTML Parser is not really the right tool for creating content -- it's a parser. Building a web page from scratch using the set of node constructors provided would be fairly tedious, and what's the point... you are just going to convert it to text anyway. But, to answer your questions, the end node needs to be set explicitly with something like: TagNode end = new new TagNode (); end.setTagName ("/TABLE"); table_tag.setEndTag (end); and the children would need to be added separately with something like: body_tag.setChildren (new NodeList (new TextNode ("hi"))); or body_tag.getChildren ().add (new TextNode ("hi")); You can also bootstrap yourself into it by parsing fragments and then editing the pieces... parser.setInputHtml ("<body align=valign>hi</body>"); body_tag = parser.parse (null).elementAt (0); body_tag.setChildren (new NodeList (new TextNode ("hi"))); Derrick Vishal Monpara wrote: > Hi All, > > I want to process one html file and according to the content > extracted, I have to build another HTML file from scratch. I tried > hard to find out how to implement this feature, but I couldnt succeed. > If you know any online example, please forward the link to me. If you > have any sample file / sample code / idea of how to build it, please > please forward it to me. I tried some code like > > NodeList ls = new NodeList(new TableTag()); > System.out.println(ls.toHtml()); > > but this tag renders only "TABLE" and it is not showing any end tag > like "</table>". I tried TagNode.setText("<body > align=valign>hi</body>") and if I try "toHtml" function, it shows only > "<body align=valign>". > > Thanks in advance, > > Regards, > Vishal Monpara > |
From: Vishal M. <mon...@ho...> - 2006-03-19 02:01:44
|
Hi All, I want to process one html file and according to the content extracted, I have to build another HTML file from scratch. I tried hard to find out how to implement this feature, but I couldnt succeed. If you know any online example, please forward the link to me. If you have any sample file / sample code / idea of how to build it, please please forward it to me. I tried some code like NodeList ls = new NodeList(new TableTag()); System.out.println(ls.toHtml()); but this tag renders only "TABLE" and it is not showing any end tag like "</table>". I tried TagNode.setText("<body align=valign>hi</body>") and if I try "toHtml" function, it shows only "<body align=valign>". Thanks in advance, Regards, Vishal Monpara _________________________________________________________________ Dont just search. Find. Check out the new MSN Search! http://search.msn.click-url.com/go/onm00200636ave/direct/01/ |
From: Derrick O. <Der...@Ro...> - 2006-03-18 22:13:40
|
A link is not text. Is the word "next" in a Text node within the link children? Then you need a HasChildFilter (with the recursive option if it's not a direct child). You also need an AndFilter because you have two conditions. NodeFilter linkFilter = new TagNameFilter ("A"); NodeFilter nextFilter = new StringFilter ("next"); NodeFilter andFilter = new AndFilter (linkfilter, new HasChildFilter (nextFilter, true)); NodeList next = links.extractAllNodesThatMatch (andFilter); This gives you the link node(s). If you actually want the text itself, use: new AndFilter (nextfilter, new HasParentFilter (linkFilter, true)); Cibi wrote: > How to extract specific link? for example "next" link. > > I use this and it return empty node: > > NodeFilter linkFilter = new TagNameFilter("A"); > NodeFilter nextFilter = new StringFilter("next"); > NodeList links = parser.extractAllNodesThatMatch(linkFilter); > NodeList next = links.extractAllNodesThatMatch(nextFilter); > > do I need to loop thought link notelist then compare each node? > > Thanks > > |
From: Cibi <c1...@ya...> - 2006-03-18 19:19:49
|
How to extract specific link? for example "next" link. I use this and it return empty node: NodeFilter linkFilter = new TagNameFilter("A"); NodeFilter nextFilter = new StringFilter("next"); NodeList links = parser.extractAllNodesThatMatch(linkFilter); NodeList next = links.extractAllNodesThatMatch(nextFilter); do I need to loop thought link notelist then compare each node? Thanks --------------------------------- Yahoo! Travel Find great deals to the top 10 hottest destinations! |
From: Ian M. <ian...@gm...> - 2006-03-17 10:13:49
|
Looks to me like nodelistOfLinks is null. First of all you have this line of code: NodeList nodelistOfLinks =3D null; Then the next line of code that mentions the nodelistOfLinks object is: nodelistOfLinks.add(listOfSpanTags.elementAt(i).getParent()); So you're trying to use a method on a null object. Instead of: NodeList nodelistOfLinks =3D null; You probably want: NodeList nodelistOfLinks =3D new NodeList(); Ian On 3/16/06, Riaz uddin <ru...@ya...> wrote: > The error is occuring at this statement: > > nodelistOfLinks.add(listOfSpanTags.elementAt(i).getParent()); > > > Ian Macfarlane <ian...@gm...> wrote: > > The stack trace will tell you which line the NullPointerException was > thrown on. Why don't you tell us which line it's occuring on? That > will help pin it down. > > Ian > > On 3/15/06, Riaz uddin wrote: > > Hi, > > I have attached the procedure below, now when I call this procedure it > > returns a null pointer excetion in the add method. It was working fine > when > > I had it in the main function, but it does not run when I created this > > procedure, I think I need some java help on this, can someone suggest w= hat > I > > can do? > > > > public static NodeList extractLinkFromSpanTag(String url) throws > > ParserException > > { > > int i =3D0; > > > > NodeList nodelistOfLinks =3D null; > > Parser parser =3D new Parser(url); > > // Step 2. Collecting Tags in a list. > > NodeList list =3D parser.parse (null); > > > > //news links are at the span tag (time), spanList stores the > > span tags > > // Step 3. Keep only the SPAN tags in spanList. > > NodeList listOfSpanTags =3D list.extractAllNodesThatMatch(new > > TagNameFilter ("SPAN"),true); > > > > while(i < listOfSpanTags.size()) > > { // Beginning While loop to extract links > > Span spanTag =3D > > (Span)listOfSpanTags.elementAt(i); > > // System.out.println(listOfSpanTags.size()); > > // We only need SPAN tags with attribute "class =3D > > 'recenttimedate'" > > // Move to the link in the span tag > > if(spanTag.getText().equals("span class=3Drecenttimedate")) > > > > nodelistOfLinks.add(listOfSpanTags.elementAt(i).getParent()); > > i++; > > }// End of while loop to extract links > > while(i < nodelistOfLinks.size()) > > { > > System.out.println(nodelistOfLinks.elementAt(i)); > > i++; > > } > > > > return nodelistOfLinks; > > } > > > > ________________________________ > > Yahoo! Mail > > Bring photos to life! New PhotoMail makes sharing a breeze. > > > > > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting langua= ge > that extends applications into web and mobile media. Attend the live webc= ast > and join the prime developer group breaking into this new coding territor= y! > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=110944&bid$1720&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > ________________________________ > Relax. Yahoo! Mail virus scanning helps detect nasty viruses! > > |