htmlparser-user Mailing List for HTML Parser (Page 83)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Somik R. <so...@ya...> - 2003-02-26 06:44:02
|
Hi Bob, Can you try this - get the data from the url in question into a file (using a post request). Then try to parse the file. If it breaks, we would know why. Regards, Somik ----- Original Message ----- From: "Bob Lewis" <bob...@ya...> To: <htm...@li...> Sent: Tuesday, February 25, 2003 12:07 PM Subject: Re: [Htmlparser-user] Malformed Input Exception > > I tried using the parser directly, as you suggested, > and it seems to work. However, I need to be able work > with the URLConnection to set headers, cookies and > send POST data. > > Typically, this is what I'm doing: > > //create and initialize the URL Connection > HttpURLConnection urlConn = null; > URL url = new URL("http://somedomain/somepath"); > urlConn = (HttpURLConnection)url.openConnection(); > urlConn.setDoInput(true); > urlConn.setDoOutput(true); > urlConn.setUseCaches(false); > urlConn.setAllowUserInteraction(false); > urlConn.setRequestMethod("POST"); > > // ... usually many HTTP Headers and cookie values > set > urlConn.setRequestProperty("someHeader", > "someValue"); > urlConn.setRequestProperty("anotherHeader", > "anotherValue"); > > StringBuffer postData = new StringBuffer(); > // ... generate post data in buffer > > //Send the post data > PrintWriter printWriter = new > PrintWriter(urlConn.getOutputStream()); > printWriter.println(postData.toString()); > printWriter.close(); > > //parse the response > HTMLEnumeration tags = parser.elements(); > > while (parser.hasMoreNodes()) > { > // ... Do Something > } > > This works fine on most URLs. I am normally able to > execute the server-side web application, obtain and > parse the HTML response. However, in the case of > these two URLs, I get the MalformedInputException. > > Is there something I'm missing? > > Thanks, > > Bob Lewis > > --- Somik Raha <so...@ya...> wrote: > > >Date: 2003-02-24 21:33 > >Sender: somik > >Logged In: YES > >user_id=187944 > > > >I ran the parser on these pages and it worked fine. > Try > >runParser.bat http://www.flytango.com/en/index.html. > > > >It could be that you have intialized your > urlconnection > >incorrectly. Have you tried using the parser > directly, like : > > > >HTMLParser parser = new HTMLParser > >("http://www.flytango.com/en/index.html"); > >for (NodeIterator > i=parser.elements();i.hasMoreNodes();) { > > System.out.println(i.nextNode().toHtml()); > >} > > --- Somik Raha <so...@ya...> wrote: > > Hi Bob, > > Sounds like a bug. > > Can you file a bug report at > > http://htmlparser.sourceforge.net? > > > > Regards, > > Somik > > --- Bob Lewis <bob...@ya...> wrote: > > > Hi, > > > > > > I am trying to use htmlparser 1.3 to parse the > > HTML > > > at > > > http://www.flytango.com/en/taschedule.html and > > > http://www.flytango.com/en/index.html. When I > > > attempt > > > to parse these pages, I get > > > com.sun.io.MalformedInputException: > > > > > > sun.io.MalformedInputException > > > at > > > > > > sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:105) > > > at > > > > > > java.io.InputStreamReader.convertInto(InputStreamReader.java:132) > > > at > > > > > > java.io.InputStreamReader.fill(InputStreamReader.java:181) > > > at > > > > > > java.io.InputStreamReader.read(InputStreamReader.java:244) > > > at > > > > > java.io.BufferedReader.fill(BufferedReader.java:134) > > > at > > > > > > java.io.BufferedReader.readLine(BufferedReader.java:294) > > > at > > > > > > java.io.BufferedReader.readLine(BufferedReader.java:357) > > > at > > > > > > org.htmlparser.HTMLReader.getNextLine(HTMLReader.java:139) > > > at > > > > > > org.htmlparser.HTMLReader.readElement(HTMLReader.java:176) > > > at > > > > > > org.htmlparser.util.HTMLEnumerationImpl.peek(HTMLEnumerationImpl.java:60) > > > at > > > > > > org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes(HTMLEnumerationImpl.jav a:91) > > > > > > Now, if I copy the source of these pages from a > > > browser into a file and put them on my own > > > webserver, > > > I can parse them without any errors. > > > > > > It's my guess that there is some strange control > > > character in the source that is causing the > > > exception, > > > but I'm not entirely sure. Any suggestions? If > > it > > > is > > > a bad character, would it be possible to add code > > to > > > HTMLReader that strips offending characters from > > the > > > input stream? > > > > > > Here is the code I am using to parse: > > > > > > DefaultHTMLParserFeedback feedback > > > = new > > > > > > DefaultHTMLParserFeedback(DefaultHTMLParserFeedback.DEBUG); > > > > > > HTMLReader reader = null; > > > HTMLParser parser = null; > > > InputStreamReader isr > > > = new > > > InputStreamReader(urlConn.getInputStream()); > > > reader = new HTMLReader(isr, 8192); > > > parser = new HTMLParser(reader, feedback); > > > boolean inForm = false; > > > > > > parser.addScanner(new > > > HTMLInputTagScanner()); > > > > > > HTMLEnumeration tags = parser.elements(); > > > > > > RequestParameters params = new > > > RequestParameters(); > > > > > > while (tags.hasMoreNodes()) > > > { > > > ... > > > } > > > > > > > > > Thanks, > > > > > > Bob Lewis > > > > > > > > > __________________________________________________ > > > Do you Yahoo!? > > > Yahoo! Tax Center - forms, calculators, tips, more > > > http://taxes.yahoo.com/ > > > > > > > > > > > > ------------------------------------------------------- > > > This sf.net email is sponsored by:ThinkGeek > > > Welcome to geek heaven. > > > http://thinkgeek.com/sf > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > __________________________________________________ > > Do you Yahoo!? > > Yahoo! Tax Center - forms, calculators, tips, more > > http://taxes.yahoo.com/ > > > > > > > ------------------------------------------------------- > > This sf.net email is sponsored by:ThinkGeek > > Welcome to geek heaven. > > http://thinkgeek.com/sf > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Tax Center - forms, calculators, tips, more > http://taxes.yahoo.com/ > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Bob L. <bob...@ya...> - 2003-02-25 20:20:39
|
Sorry, there was a typo in my last message: > while (parser.hasMoreNodes()) > { > // ... Do Something > } should be while (tags.hasMoreNodes()) { // ... Do Something } Also, on another note, if I try to initialize the parser directly, I am unable to work with the URLConnection. For example: HttpURLConnection urlConn = null; HTMLParser parser = new HTMLParser("http://somedomain/somepath"); urlConn = (HttpURLConnection)parser.getConnection(); urlConn.setDoInput(true); // ... This code throws an exception because the HTTP request has already been made. Exception in thread "main" java.lang.IllegalAccessError: Already connected at java.net.URLConnection.setDoInput(URLConnection.java:677) --- Bob Lewis <bob...@ya...> wrote: > > I tried using the parser directly, as you suggested, > and it seems to work. However, I need to be able > work > with the URLConnection to set headers, cookies and > send POST data. > > Typically, this is what I'm doing: > > //create and initialize the URL Connection > HttpURLConnection urlConn = null; > URL url = new URL("http://somedomain/somepath"); > urlConn = > (HttpURLConnection)url.openConnection(); > urlConn.setDoInput(true); > urlConn.setDoOutput(true); > urlConn.setUseCaches(false); > urlConn.setAllowUserInteraction(false); > urlConn.setRequestMethod("POST"); > > // ... usually many HTTP Headers and cookie > values > set > urlConn.setRequestProperty("someHeader", > "someValue"); > urlConn.setRequestProperty("anotherHeader", > "anotherValue"); > > StringBuffer postData = new StringBuffer(); > // ... generate post data in buffer > > //Send the post data > PrintWriter printWriter = new > PrintWriter(urlConn.getOutputStream()); > printWriter.println(postData.toString()); > printWriter.close(); > > //parse the response > HTMLEnumeration tags = parser.elements(); > > while (parser.hasMoreNodes()) > { > // ... Do Something > } > > This works fine on most URLs. I am normally able to > execute the server-side web application, obtain and > parse the HTML response. However, in the case of > these two URLs, I get the MalformedInputException. > > Is there something I'm missing? > > Thanks, > > Bob Lewis > > --- Somik Raha <so...@ya...> wrote: > > >Date: 2003-02-24 21:33 > >Sender: somik > >Logged In: YES > >user_id=187944 > > > >I ran the parser on these pages and it worked fine. > Try > >runParser.bat > http://www.flytango.com/en/index.html. > > > >It could be that you have intialized your > urlconnection > >incorrectly. Have you tried using the parser > directly, like : > > > >HTMLParser parser = new HTMLParser > >("http://www.flytango.com/en/index.html"); > >for (NodeIterator > i=parser.elements();i.hasMoreNodes();) { > > System.out.println(i.nextNode().toHtml()); > >} > > --- Somik Raha <so...@ya...> wrote: > > Hi Bob, > > Sounds like a bug. > > Can you file a bug report at > > http://htmlparser.sourceforge.net? > > > > Regards, > > Somik > > --- Bob Lewis <bob...@ya...> wrote: > > > Hi, > > > > > > I am trying to use htmlparser 1.3 to parse the > > HTML > > > at > > > http://www.flytango.com/en/taschedule.html and > > > http://www.flytango.com/en/index.html. When I > > > attempt > > > to parse these pages, I get > > > com.sun.io.MalformedInputException: > > > > > > sun.io.MalformedInputException > > > at > > > > > > sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:105) > > > at > > > > > > java.io.InputStreamReader.convertInto(InputStreamReader.java:132) > > > at > > > > > > java.io.InputStreamReader.fill(InputStreamReader.java:181) > > > at > > > > > > java.io.InputStreamReader.read(InputStreamReader.java:244) > > > at > > > > > > java.io.BufferedReader.fill(BufferedReader.java:134) > > > at > > > > > > java.io.BufferedReader.readLine(BufferedReader.java:294) > > > at > > > > > > java.io.BufferedReader.readLine(BufferedReader.java:357) > > > at > > > > > > org.htmlparser.HTMLReader.getNextLine(HTMLReader.java:139) > > > at > > > > > > org.htmlparser.HTMLReader.readElement(HTMLReader.java:176) > > > at > > > > > > org.htmlparser.util.HTMLEnumerationImpl.peek(HTMLEnumerationImpl.java:60) > > > at > > > > > > org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes(HTMLEnumerationImpl.java:91) > > > > > > Now, if I copy the source of these pages from a > > > browser into a file and put them on my own > > > webserver, > > > I can parse them without any errors. > > > > > > It's my guess that there is some strange control > > > character in the source that is causing the > > > exception, > > > but I'm not entirely sure. Any suggestions? If > > it > > > is > > > a bad character, would it be possible to add > code > > to > > > HTMLReader that strips offending characters from > > the > > > input stream? > > > > > > Here is the code I am using to parse: > > > > > > DefaultHTMLParserFeedback feedback > > > = new > > > > > > DefaultHTMLParserFeedback(DefaultHTMLParserFeedback.DEBUG); > > > > > > HTMLReader reader = null; > > > HTMLParser parser = null; > > > InputStreamReader isr > > > = new > > > InputStreamReader(urlConn.getInputStream()); > > > reader = new HTMLReader(isr, 8192); > > > parser = new HTMLParser(reader, > feedback); > > > boolean inForm = false; > > > > > > parser.addScanner(new > > > HTMLInputTagScanner()); > > > > > > HTMLEnumeration tags = > parser.elements(); > > > > > > RequestParameters params = new > > > RequestParameters(); > > > > > > while (tags.hasMoreNodes()) > > > { > > > ... > > > } > > > > > > > > > Thanks, > > > > > > Bob Lewis > > > > === message truncated === __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Bob L. <bob...@ya...> - 2003-02-25 20:07:38
|
I tried using the parser directly, as you suggested, and it seems to work. However, I need to be able work with the URLConnection to set headers, cookies and send POST data. Typically, this is what I'm doing: //create and initialize the URL Connection HttpURLConnection urlConn = null; URL url = new URL("http://somedomain/somepath"); urlConn = (HttpURLConnection)url.openConnection(); urlConn.setDoInput(true); urlConn.setDoOutput(true); urlConn.setUseCaches(false); urlConn.setAllowUserInteraction(false); urlConn.setRequestMethod("POST"); // ... usually many HTTP Headers and cookie values set urlConn.setRequestProperty("someHeader", "someValue"); urlConn.setRequestProperty("anotherHeader", "anotherValue"); StringBuffer postData = new StringBuffer(); // ... generate post data in buffer //Send the post data PrintWriter printWriter = new PrintWriter(urlConn.getOutputStream()); printWriter.println(postData.toString()); printWriter.close(); //parse the response HTMLEnumeration tags = parser.elements(); while (parser.hasMoreNodes()) { // ... Do Something } This works fine on most URLs. I am normally able to execute the server-side web application, obtain and parse the HTML response. However, in the case of these two URLs, I get the MalformedInputException. Is there something I'm missing? Thanks, Bob Lewis --- Somik Raha <so...@ya...> wrote: >Date: 2003-02-24 21:33 >Sender: somik >Logged In: YES >user_id=187944 > >I ran the parser on these pages and it worked fine. Try >runParser.bat http://www.flytango.com/en/index.html. > >It could be that you have intialized your urlconnection >incorrectly. Have you tried using the parser directly, like : > >HTMLParser parser = new HTMLParser >("http://www.flytango.com/en/index.html"); >for (NodeIterator i=parser.elements();i.hasMoreNodes();) { > System.out.println(i.nextNode().toHtml()); >} --- Somik Raha <so...@ya...> wrote: > Hi Bob, > Sounds like a bug. > Can you file a bug report at > http://htmlparser.sourceforge.net? > > Regards, > Somik > --- Bob Lewis <bob...@ya...> wrote: > > Hi, > > > > I am trying to use htmlparser 1.3 to parse the > HTML > > at > > http://www.flytango.com/en/taschedule.html and > > http://www.flytango.com/en/index.html. When I > > attempt > > to parse these pages, I get > > com.sun.io.MalformedInputException: > > > > sun.io.MalformedInputException > > at > > > sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:105) > > at > > > java.io.InputStreamReader.convertInto(InputStreamReader.java:132) > > at > > > java.io.InputStreamReader.fill(InputStreamReader.java:181) > > at > > > java.io.InputStreamReader.read(InputStreamReader.java:244) > > at > > > java.io.BufferedReader.fill(BufferedReader.java:134) > > at > > > java.io.BufferedReader.readLine(BufferedReader.java:294) > > at > > > java.io.BufferedReader.readLine(BufferedReader.java:357) > > at > > > org.htmlparser.HTMLReader.getNextLine(HTMLReader.java:139) > > at > > > org.htmlparser.HTMLReader.readElement(HTMLReader.java:176) > > at > > > org.htmlparser.util.HTMLEnumerationImpl.peek(HTMLEnumerationImpl.java:60) > > at > > > org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes(HTMLEnumerationImpl.java:91) > > > > Now, if I copy the source of these pages from a > > browser into a file and put them on my own > > webserver, > > I can parse them without any errors. > > > > It's my guess that there is some strange control > > character in the source that is causing the > > exception, > > but I'm not entirely sure. Any suggestions? If > it > > is > > a bad character, would it be possible to add code > to > > HTMLReader that strips offending characters from > the > > input stream? > > > > Here is the code I am using to parse: > > > > DefaultHTMLParserFeedback feedback > > = new > > > DefaultHTMLParserFeedback(DefaultHTMLParserFeedback.DEBUG); > > > > HTMLReader reader = null; > > HTMLParser parser = null; > > InputStreamReader isr > > = new > > InputStreamReader(urlConn.getInputStream()); > > reader = new HTMLReader(isr, 8192); > > parser = new HTMLParser(reader, feedback); > > boolean inForm = false; > > > > parser.addScanner(new > > HTMLInputTagScanner()); > > > > HTMLEnumeration tags = parser.elements(); > > > > RequestParameters params = new > > RequestParameters(); > > > > while (tags.hasMoreNodes()) > > { > > ... > > } > > > > > > Thanks, > > > > Bob Lewis > > > > > > __________________________________________________ > > Do you Yahoo!? > > Yahoo! Tax Center - forms, calculators, tips, more > > http://taxes.yahoo.com/ > > > > > > > ------------------------------------------------------- > > This sf.net email is sponsored by:ThinkGeek > > Welcome to geek heaven. > > http://thinkgeek.com/sf > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Tax Center - forms, calculators, tips, more > http://taxes.yahoo.com/ > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Somik R. <so...@ya...> - 2003-02-24 18:29:52
|
Hi Bob, Sounds like a bug. Can you file a bug report at http://htmlparser.sourceforge.net? Regards, Somik --- Bob Lewis <bob...@ya...> wrote: > Hi, > > I am trying to use htmlparser 1.3 to parse the HTML > at > http://www.flytango.com/en/taschedule.html and > http://www.flytango.com/en/index.html. When I > attempt > to parse these pages, I get > com.sun.io.MalformedInputException: > > sun.io.MalformedInputException > at > sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:105) > at > java.io.InputStreamReader.convertInto(InputStreamReader.java:132) > at > java.io.InputStreamReader.fill(InputStreamReader.java:181) > at > java.io.InputStreamReader.read(InputStreamReader.java:244) > at > java.io.BufferedReader.fill(BufferedReader.java:134) > at > java.io.BufferedReader.readLine(BufferedReader.java:294) > at > java.io.BufferedReader.readLine(BufferedReader.java:357) > at > org.htmlparser.HTMLReader.getNextLine(HTMLReader.java:139) > at > org.htmlparser.HTMLReader.readElement(HTMLReader.java:176) > at > org.htmlparser.util.HTMLEnumerationImpl.peek(HTMLEnumerationImpl.java:60) > at > org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes(HTMLEnumerationImpl.java:91) > > Now, if I copy the source of these pages from a > browser into a file and put them on my own > webserver, > I can parse them without any errors. > > It's my guess that there is some strange control > character in the source that is causing the > exception, > but I'm not entirely sure. Any suggestions? If it > is > a bad character, would it be possible to add code to > HTMLReader that strips offending characters from the > input stream? > > Here is the code I am using to parse: > > DefaultHTMLParserFeedback feedback > = new > DefaultHTMLParserFeedback(DefaultHTMLParserFeedback.DEBUG); > > HTMLReader reader = null; > HTMLParser parser = null; > InputStreamReader isr > = new > InputStreamReader(urlConn.getInputStream()); > reader = new HTMLReader(isr, 8192); > parser = new HTMLParser(reader, feedback); > boolean inForm = false; > > parser.addScanner(new > HTMLInputTagScanner()); > > HTMLEnumeration tags = parser.elements(); > > RequestParameters params = new > RequestParameters(); > > while (tags.hasMoreNodes()) > { > ... > } > > > Thanks, > > Bob Lewis > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Tax Center - forms, calculators, tips, more > http://taxes.yahoo.com/ > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Somik R. <so...@ya...> - 2003-02-24 18:12:00
|
I was trying to integrate the changes of the latest parser with some existing projects at work - and of course, I had to modify the code to use the new API. I had some suggestions - as I know many of you will be facing the same issue. I use Eclipse, and I hope most of you use a decent IDE that supports refactoring. Get the parser into your IDE, and let all your other project code refer to it (thats how it is setup in my IDE). Then, rename Parser to HTMLParser using your refactoring tool. Rename it back to Parser, and all your existing code will automatically get fixed. Do this for some other classes like HTMLNode/Node, etc.. and within minutes it should be done. Regards, Somik --- Somik Raha <so...@ya...> wrote: > Hi Folks, > This week's release is out. I've finally taken > heed of all the feedback > I had been receiving about the terrible naming > convention, and have removed > "HTML" from all class names. In addition, > HTMLEnumeration is now > NodeIterator and SimpleEnumeration is > SimpleNodeIterator. HTMLParser is just > Parser. > > This is a big step, so to make it easy for > everyone, there have been no > major bug fixes that will require you to upgrade > right away. I apologize in > advance for inconvenience caused - I hope you don't > curse me too much for > having to modify your programs. I had the option of > doing it in stages, and > forcing you to modify some small thing in every > release, or get it over with > in one sweep. I chose the latter bcos there were too > many changes and > suffering over a long period of time didn't make > sense. Hopefully, once you > have migrated to the new names, you will appreciate > not having to type > "HTML" each time. > > The BodyScanner contributed by Dhaval Udani is > finally in (Dhaval - > sorry for the delay). > The interesting part is that the documentation > accompanying the package > is now the latest one on the site - it has been > ripped off a Php Wiki. I am > thinking that the ripping program might be useful > for those who wish to > provide wiki content as offline documentation (any > feedback on this is > welcome). > > From the change log : > Integration build 1.3 - 20030223 > -------------------------------- > [1] Modification of documentation packaging > - the new documentation is actually produced > by a tiny program that coverts wiki pages > into documentation (works with PhpWiki) > [2] Inclusion of BodyScanner, BodyTag > [3] HTMLVisitor is now NodeVisitor - and has an > extra param to > visit itself > [4] HTMLParser is now Parser. No class has HTML > prefix anymore. > [5] HTMLEnumeration is now NodeIterator, > SimpleEnumeration is > SimpleNodeIterator > > Regards, > Somik > > > > ------------------------------------------------------- > This SF.net email is sponsored by: SlickEdit Inc. > Develop an edge. > The most comprehensive and flexible code editor you > can use. > Code faster. C/C++, C#, Java, HTML, XML, many more. > FREE 30-Day Trial. > www.slickedit.com/sourceforge > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Bob L. <bob...@ya...> - 2003-02-24 14:49:22
|
Hi, I am trying to use htmlparser 1.3 to parse the HTML at http://www.flytango.com/en/taschedule.html and http://www.flytango.com/en/index.html. When I attempt to parse these pages, I get com.sun.io.MalformedInputException: sun.io.MalformedInputException at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:105) at java.io.InputStreamReader.convertInto(InputStreamReader.java:132) at java.io.InputStreamReader.fill(InputStreamReader.java:181) at java.io.InputStreamReader.read(InputStreamReader.java:244) at java.io.BufferedReader.fill(BufferedReader.java:134) at java.io.BufferedReader.readLine(BufferedReader.java:294) at java.io.BufferedReader.readLine(BufferedReader.java:357) at org.htmlparser.HTMLReader.getNextLine(HTMLReader.java:139) at org.htmlparser.HTMLReader.readElement(HTMLReader.java:176) at org.htmlparser.util.HTMLEnumerationImpl.peek(HTMLEnumerationImpl.java:60) at org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes(HTMLEnumerationImpl.java:91) Now, if I copy the source of these pages from a browser into a file and put them on my own webserver, I can parse them without any errors. It's my guess that there is some strange control character in the source that is causing the exception, but I'm not entirely sure. Any suggestions? If it is a bad character, would it be possible to add code to HTMLReader that strips offending characters from the input stream? Here is the code I am using to parse: DefaultHTMLParserFeedback feedback = new DefaultHTMLParserFeedback(DefaultHTMLParserFeedback.DEBUG); HTMLReader reader = null; HTMLParser parser = null; InputStreamReader isr = new InputStreamReader(urlConn.getInputStream()); reader = new HTMLReader(isr, 8192); parser = new HTMLParser(reader, feedback); boolean inForm = false; parser.addScanner(new HTMLInputTagScanner()); HTMLEnumeration tags = parser.elements(); RequestParameters params = new RequestParameters(); while (tags.hasMoreNodes()) { ... } Thanks, Bob Lewis __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Somik R. <so...@ya...> - 2003-02-24 06:15:44
|
Hi Folks, This week's release is out. I've finally taken heed of all the feedback I had been receiving about the terrible naming convention, and have removed "HTML" from all class names. In addition, HTMLEnumeration is now NodeIterator and SimpleEnumeration is SimpleNodeIterator. HTMLParser is just Parser. This is a big step, so to make it easy for everyone, there have been no major bug fixes that will require you to upgrade right away. I apologize in advance for inconvenience caused - I hope you don't curse me too much for having to modify your programs. I had the option of doing it in stages, and forcing you to modify some small thing in every release, or get it over with in one sweep. I chose the latter bcos there were too many changes and suffering over a long period of time didn't make sense. Hopefully, once you have migrated to the new names, you will appreciate not having to type "HTML" each time. The BodyScanner contributed by Dhaval Udani is finally in (Dhaval - sorry for the delay). The interesting part is that the documentation accompanying the package is now the latest one on the site - it has been ripped off a Php Wiki. I am thinking that the ripping program might be useful for those who wish to provide wiki content as offline documentation (any feedback on this is welcome). From the change log : Integration build 1.3 - 20030223 -------------------------------- [1] Modification of documentation packaging - the new documentation is actually produced by a tiny program that coverts wiki pages into documentation (works with PhpWiki) [2] Inclusion of BodyScanner, BodyTag [3] HTMLVisitor is now NodeVisitor - and has an extra param to visit itself [4] HTMLParser is now Parser. No class has HTML prefix anymore. [5] HTMLEnumeration is now NodeIterator, SimpleEnumeration is SimpleNodeIterator Regards, Somik |
From: Mohd-Taqiyuddin Z. <mt...@ec...> - 2003-02-23 14:47:17
|
hi, sorry to bother you. I know that the input tag is in the HTMLFormTag. However when I try to parse this page with HTMLFormScanner http://developer.java.sun.com/developer/Quizzes/jbasics1-1/ it returns an error and the process has been terminate. Below is my testing code.(Just to see if HTMLFormTag exist in the page) public String extractStrings() throws HTMLParserException { HTMLParser parser = new HTMLParser(resource); parser.addScanner(new HTMLFormScanner("")); HTMLNode node; String check; StringBuffer results= new StringBuffer(); for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) { node = e.nextHTMLNode(); if (node instanceof HTMLFormTag){//check the existence of HTMLFormTag System.out.print(node.toString());} check=node.toPlainTextString(); results.append(check); } return results.toString(); } however this error printed in the console. Its can compile but generate a runtime error. below is the error: ERROR: HTMLReader.readElement() : Error occurred while trying to decipher the tag using scannersat Line 72 : <form method="get" action="http://servlet.java.sun.com/logRedirect/ frontpage-head/http://search.java.sun.com/search/java/"> Previous Line 71 : <td><table border="0" cellspacing="0" cellpadding="0" width="100%" height="109"> ERROR: HTMLReader.readElement() : Error occurred while trying to read the next element,at Line 72 : <form method="get" action="http://servlet.java.sun.com/logRedirect/ frontpage-head/http://search.java.sun.com/search/java/"> Previous Line 71 : <td><table border="0" cellspacing="0" cellpadding="0" width="100%" height="109"> ERROR: Unexpected Exception occurred while reading http://developer.java.sun.com /developer/Quizzes/jbasics1-1/, in nextHTMLNode at Line 72 : <form method="get" action="http://servlet.java.sun.com/logRedirect/ frontpage-head/http://search.java.sun.com/search/java/"> Previous Line 71 : <td><table border="0" cellspacing="0" cellpadding="0" width="100%" height="109"> org.htmlparser.util.HTMLParserException: Unexpected Exception occurred while reading http://developer.java.sun.com/developer/Quizzes/jbasics1-1/, in nextHTMLNode at Line 72 : <form method="get" action="http://servlet.java.sun.com/logRedirect/ frontpage-head/http://search.java.sun.com/search/java/"> Previous Line 71 : <td><table border="0" cellspacing="0" cellpadding="0" width="100%" height="109">; org.htmlparser.util.HTMLParserException: HTMLReader.readElement() : Error occurred while trying to read the next element, at Line 72 : <form method="get" action="http://servlet.java.sun.com/logRedirect/ frontpage-head/http://search.java.sun.com/search/java/"> Previous Line 71 : <td><table border="0" cellspacing="0" cellpadding="0" width="100%" height="109">; org.htmlparser.util.HTMLParserException: HTMLReader.readElement() : Error occurred while trying to decipher the tag using scanners at Line 72 : <form method="get"action="http://servlet.java.sun.com/logRedirect/ frontpage-head/http://search.java.sun.com/search/java/"> Previous Line 71 : <td><table border="0" cellspacing="0" cellpadding="0" width="100%" height="109">; org.htmlparser.util.HTMLParserException: HTMLTag.scan() : Error while scanning tag, tag contents = form method="get" action="http://servlet.java.sun.com/logRedi rect/frontpage-head/http://search.java.sun.com/search/java/", tagLine = <form method="get" action="http://servlet.java.sun.com/logRedirect/frontpage- head/http://search.java.sun.com/search/java/">; org.htmlparser.util.HTMLParserException: HTMLFormScanner.scan() : Error while scanning the form tag, current line = <form method="get" action="http://servlet.ja va.sun.com/logRedirect/frontpage- head/http://search.java.sun.com/search/java/">; java.lang.NullPointerException at org.htmlparser.HTMLParser.addScanner(HTMLParser.java:863) at org.htmlparser.scanners.HTMLFormScanner.scan (HTMLFormScanner.java:164) at org.htmlparser.scanners.HTMLTagScanner.createScannedNode (HTMLTagScanner.java:193) at org.htmlparser.tags.HTMLTag.scan(HTMLTag.java:266) at org.htmlparser.HTMLReader.readElement(HTMLReader.java:193) at org.htmlparser.util.HTMLEnumerationImpl.peek (HTMLEnumerationImpl.java:60) at org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes (HTMLEnumerationImpl.java:91) at StringExtractor.extractStrings(StringExtractor.java:27) at StringExtractor.main(StringExtractor.java:49) there is two form in the page, one is for the searching part of the site and the other one is what i'm interested in that is form with questions. Please help me on this. Is this a bug? thank you. |
From: Somik R. <so...@ya...> - 2003-02-23 05:22:19
|
You could go thru the docs at http://htmlparser.sourceforge.net/docs/index.php/LinkExtraction Forms and Frames are represented by HTMLFormTag, and HTMLFrameTag. You could write your own visitor that could collect form tags, string nodes, and on encountering a frame tag, could open a new parser object for the frame url and visit it with the same visitor (different object probably). Try out the programs on this page, and it should be easy. Feel free to post here if you face any problems. Regards, Somik ----- Original Message ----- From: "Mohd-Taqiyuddin Zalfan" <mt...@ec...> To: <htm...@li...> Sent: Saturday, February 22, 2003 10:44 AM Subject: [Htmlparser-user] Harvester > hi, > > I would like to write a program that can harvest certain information (mostly > text) on the web page. Some of the web page requires feedback from the user > (existence of <form> tag) to get more information on the page. Some of the > page is just a plain text and some of the page is in frames. How can I wrote > a single harvester that can harvest these three types of pages with one > harvester code. > > below is the sample pages that I want to harvest. (harvest question and get > the correct answers.) > > i)with the form: http://developer.java.sun.com/developer/Quizzes/jbasics1-1/ > ii)plain text: http://www.jchq.net/mockexams/exam3.htm > iii) with frames: http://www.angelfire.com/or/abhilash/Main.html > > hope you can give me some advice on how to do this. thank you. > > > > ------------------------------------------------------- > This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. > The most comprehensive and flexible code editor you can use. > Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. > www.slickedit.com/sourceforge > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Mohd-Taqiyuddin Z. <mt...@ec...> - 2003-02-22 18:45:38
|
hi, I would like to write a program that can harvest certain information (mostly text) on the web page. Some of the web page requires feedback from the user (existence of <form> tag) to get more information on the page. Some of the page is just a plain text and some of the page is in frames. How can I wrote a single harvester that can harvest these three types of pages with one harvester code. below is the sample pages that I want to harvest. (harvest question and get the correct answers.) i)with the form: http://developer.java.sun.com/developer/Quizzes/jbasics1-1/ ii)plain text: http://www.jchq.net/mockexams/exam3.htm iii) with frames: http://www.angelfire.com/or/abhilash/Main.html hope you can give me some advice on how to do this. thank you. |
From: Somik R. <so...@ya...> - 2003-02-19 19:37:15
|
The last line of all mails on this list (including the one you sent) has the link to go to the mailing list admin interface, from which you can unsubscribe yourself. Regards, Somik --- ChennaDulla <che...@go...> wrote: > > > Thanks, > Chenna Dulla, > GoneHome Inc. > 1278 SouthMain St. > Canton, Ohio - 44720 > tel: 330-649-9258 (W) > 440-605-1628 (R) > > > > > ------------------------------------------------------- > This SF.net email is sponsored by: SlickEdit Inc. > Develop an edge. > The most comprehensive and flexible code editor you > can use. > Code faster. C/C++, C#, Java, HTML, XML, many more. > FREE 30-Day Trial. > www.slickedit.com/sourceforge > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Shopping - Send Flowers for Valentine's Day http://shopping.yahoo.com |
From: ChennaDulla <che...@go...> - 2003-02-19 16:44:40
|
Thanks, Chenna Dulla, GoneHome Inc. 1278 SouthMain St. Canton, Ohio - 44720 tel: 330-649-9258 (W) 440-605-1628 (R) |
From: Somik R. <so...@ya...> - 2003-02-19 16:41:50
|
setText() should not be used. We'll probably remove it from the API = asap. Pls use setAttribute(). Regards, Somik ----- Original Message -----=20 From: Aminudin Khalid=20 To: htm...@li...=20 Sent: Wednesday, February 19, 2003 1:41 AM Subject: [Htmlparser-user] getText() and setText() HTMLTag::getText() does work fine but setText() doesn't work ? Is it = true ? If possible I wanna use setText(). |
From: Somik R. <so...@ya...> - 2003-02-19 16:41:12
|
> May I know what is the key for each attribute in <a> tag ? Usually href but you could get all the keys like this : for (Enumeration keys = tag.getAttributes().keys(); keys.hasMoreElements();) { String key = (String)keys.nextElement(); String value = tag.getAttribute(key); //... } > I've been trying to modify HTML tags and attributes but it doesnt work > pretty well If you show your code and tell us whats not working, we might be able to help. Regards, Somik |
From: Aminudin K. <ami...@mi...> - 2003-02-19 09:43:30
|
HTMLTag::getText() does work fine but setText() doesn't work ? Is it true ? If possible I wanna use setText(). |
From: Aminudin K. <ami...@mi...> - 2003-02-19 09:14:59
|
May I know what is the key for each attribute in <a> tag ? I've been trying to modify HTML tags and attributes but it doesnt work pretty well Thanks |
From: <wf...@ma...> - 2003-02-18 04:27:30
|
From: "Somik Raha" <so...@ya...> To: <htm...@li...> Subject: Re: [Htmlparser-user] Anyone around using htmlparser together=20 with=20 >Lotus Domino? >Date: Sat, 15 Feb 2003 20:23:09 -0800 >Thats interesting - can you tell us how you are using the parser with=20 Lotus >Domino, and what your doubt is ? Thank you for your reply, Somik. Since Domino R6 things have changed a little, however it will take some=20 time until this release becomes widely accepted. So what I'm investigatin= g=20 is related with R5 that supports Java 1.1.8 natively. There are several=20 things I'm investigating: 1) Referrer Spamming: This is becoming increasingly popular since referrers can be tweaked so=20 easily. The blogging scene often presents a list of recent referrers w/o=20 any validation. This can trick webmasters and visitors into clicking=20 spammed ones. I'm looking for a way to filter for valid references only. Using Domino one can retrieve a HTML page including a list of hyperlinks=20 however a) performance is not impressive and b) this requires a web=20 interface database (perweb.nsf) is set up on the server. I'd prefer to us= e=20 the HTMLParser class instead. This looks like a simple one. 2) HTML translation/validation/repair Domino's proprietary rich text format dates back to the 80s when HTML=20 wasn't a standard. Domino's rich-text capabilites are impressive,=20 including nested interactive sections, features like hotspots,=20 script-enabled buttons, tabbed forms and alike. Due to compatibility=20 reasons Domino was web-enabled mainly not by downsizing this format to=20 HTML's native capabilites but by adding a richtext-to-html task and addin= g=20 a special URL syntax. Although displayed properly by browsers the=20 generated HTML is not clean, e.g. list tags are not closed, stuff like=20 this. I'm investigating if HTMLParser could be used to do some automatic=20 repair - content will be edited in Domino's RTF for convenience and the=20 resulting HTML is parsed, corrected and seperately stored for web=20 delivery. I assume to parse HTML forgivingly the parser needs to perform=20 some stack correction and I hope this can easily be used for HTML repair=20 as well? --=20 Mit freundlichen Gr=FC=DFen / Kind regards Wolfgang Flamme wf...@ma... Am Jungst=FCck 32 55130 Mainz-Laubenheim Tel.: +49 (6131) 8 74 02 Mobil: +49 163 25 43 166 |
From: Aminudin K. <ami...@mi...> - 2003-02-18 01:04:07
|
You need the latest integration release . HTMLVisitor is not in the version 1.2. ps: make sure ur class path is correct anumodh narayanan kutty wrote: > > > >> public class MyCustomizedVisitor extends HTMLVisitor { >> public MyCustomizedVisitor(HTMLParser parser) { >> super(true); /// Its usually a good idea to perform recursion >> // Add the scanners you want. >> // This decouples your application from having to know which >> scanners are required >> parser.addScanner(new HTMLLinkScanner("")); >> parser.addScanner(new HTMLImageScanner("")); >> // or add all scanners with registerScanners() >> } >> >> public void visitTag(HTMLTag tag) { >> // Collect any tags you want >> // You can also do type checking like so: >> if (tag instanceof HTMLMetaTag) { >> // This tag is a meta tag >> HTMLMetaTag metaTag = (HTMLMetaTag)tag; >> } >> } >> > ***************************************************************** > Hello Somik , > > Thanks ,for the information,but I couldn't find HTMLVisitor class > ,where is it located,plz let me know that. > > regards > ANUMODH > > > > _________________________________________________________________ > Protect your PC - get McAfee.com VirusScan Online > http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963 > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: anumodh n. k. <anu...@ho...> - 2003-02-18 00:50:02
|
> public class MyCustomizedVisitor extends HTMLVisitor { > public MyCustomizedVisitor(HTMLParser parser) { > super(true); /// Its usually a good idea to perform recursion > // Add the scanners you want. > // This decouples your application from having to know which >scanners are required > parser.addScanner(new HTMLLinkScanner("")); > parser.addScanner(new HTMLImageScanner("")); > // or add all scanners with registerScanners() > } > > public void visitTag(HTMLTag tag) { > // Collect any tags you want > // You can also do type checking like so: > if (tag instanceof HTMLMetaTag) { > // This tag is a meta tag > HTMLMetaTag metaTag = (HTMLMetaTag)tag; > } > } > ***************************************************************** Hello Somik , Thanks ,for the information,but I couldn't find HTMLVisitor class ,where is it located,plz let me know that. regards ANUMODH _________________________________________________________________ Protect your PC - get McAfee.com VirusScan Online http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963 |
From: Somik R. <so...@ya...> - 2003-02-17 19:06:46
|
> I need to get only "Title and snippets(summary)" > from Google search page. Write your own visitor. The visitor should override visitTag(). Check if the tag is a HTMLTitleTag, and if it is - you have the title contents. To get snippets, override visitStringNode(), and collect all the string data. To clean them up, use HTMLParserUtils.removeEscapeCharacters(). Check http://htmlparser.sourceforge.net/docs/index.php/LinkExtraction (point 3), for an example of writing your own visitor. Regards, Somik __________________________________________________ Do you Yahoo!? Yahoo! Shopping - Send Flowers for Valentine's Day http://shopping.yahoo.com |
From: anumodh n. k. <anu...@ho...> - 2003-02-17 17:28:18
|
Hello there, I am doing a project in "Intelligent Document Clustering" and I need to get only "Title and snippets(summary)" from Google search page.I used StringExtractor but it is returning all the contents of the page which i again need to clean so as get each link and its corresponding snippets.Is there any method for getting it done through your codes. Waiting for the reply With best regards ANUMODH _________________________________________________________________ The new MSN 8: advanced junk mail protection and 2 months FREE* http://join.msn.com/?page=features/junkmail |
From: Somik R. <so...@ya...> - 2003-02-16 04:33:26
|
Hi Folks, Integration release 1.3-20030215 is out. From the change log: Integration build 1.3 - 20030215 -------------------------------- [1] Added HtmlScanner [2] Removed Table, Div and Span from registry of scanners, can still be added individually [3] Reference test directory of project home page to maybe cure some sporadic errors in BeanTest. [4] Added setAttribute method [5] Cleaned up HTMLNode interface (removed TYPE, getType() and print()) With HtmlScanner, you can now get the entire page - sort of a DOM model in a Html object. Useful for testing. Regards, Somik |
From: Somik R. <so...@ya...> - 2003-02-16 04:24:12
|
You should set the specific attribute of the tag that you wish to modify (setAttribute()), or all the attributes if you wish (setAttributes()). A tag is represented as a collection of key-value pairs, with the tag name having the key - HTMLTag.TAGNAME. Regards, Somik ----- Original Message ----- From: "Aminudin Khalid" <ami...@mi...> To: <htm...@li...> Sent: Thursday, February 13, 2003 11:38 PM Subject: [Fwd: [Htmlparser-user] Modifying Text field in HTML tag - (Machine Translation project)] > I've posted this email before and then I thougt I have solved it but I > was wrong . I haven't solved it. > > Your help is very much appreciated. Thanks > > > |
From: Somik R. <so...@ya...> - 2003-02-16 04:21:47
|
Thats interesting - can you tell us how you are using the parser with Lotus Domino, and what your doubt is ? Regards, Somik |
From: Aminudin K. <ami...@mi...> - 2003-02-14 07:32:03
|
I've posted this email before and then I thougt I have solved it but I was wrong . I haven't solved it. Your help is very much appreciated. Thanks |