Thread: [Htmlparser-user] Malformed Input Exception
Brought to you by:
derrickoswald
From: Bob L. <bob...@ya...> - 2003-02-24 14:49:22
|
Hi, I am trying to use htmlparser 1.3 to parse the HTML at http://www.flytango.com/en/taschedule.html and http://www.flytango.com/en/index.html. When I attempt to parse these pages, I get com.sun.io.MalformedInputException: sun.io.MalformedInputException at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:105) at java.io.InputStreamReader.convertInto(InputStreamReader.java:132) at java.io.InputStreamReader.fill(InputStreamReader.java:181) at java.io.InputStreamReader.read(InputStreamReader.java:244) at java.io.BufferedReader.fill(BufferedReader.java:134) at java.io.BufferedReader.readLine(BufferedReader.java:294) at java.io.BufferedReader.readLine(BufferedReader.java:357) at org.htmlparser.HTMLReader.getNextLine(HTMLReader.java:139) at org.htmlparser.HTMLReader.readElement(HTMLReader.java:176) at org.htmlparser.util.HTMLEnumerationImpl.peek(HTMLEnumerationImpl.java:60) at org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes(HTMLEnumerationImpl.java:91) Now, if I copy the source of these pages from a browser into a file and put them on my own webserver, I can parse them without any errors. It's my guess that there is some strange control character in the source that is causing the exception, but I'm not entirely sure. Any suggestions? If it is a bad character, would it be possible to add code to HTMLReader that strips offending characters from the input stream? Here is the code I am using to parse: DefaultHTMLParserFeedback feedback = new DefaultHTMLParserFeedback(DefaultHTMLParserFeedback.DEBUG); HTMLReader reader = null; HTMLParser parser = null; InputStreamReader isr = new InputStreamReader(urlConn.getInputStream()); reader = new HTMLReader(isr, 8192); parser = new HTMLParser(reader, feedback); boolean inForm = false; parser.addScanner(new HTMLInputTagScanner()); HTMLEnumeration tags = parser.elements(); RequestParameters params = new RequestParameters(); while (tags.hasMoreNodes()) { ... } Thanks, Bob Lewis __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Somik R. <so...@ya...> - 2003-02-24 18:29:52
|
Hi Bob, Sounds like a bug. Can you file a bug report at http://htmlparser.sourceforge.net? Regards, Somik --- Bob Lewis <bob...@ya...> wrote: > Hi, > > I am trying to use htmlparser 1.3 to parse the HTML > at > http://www.flytango.com/en/taschedule.html and > http://www.flytango.com/en/index.html. When I > attempt > to parse these pages, I get > com.sun.io.MalformedInputException: > > sun.io.MalformedInputException > at > sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:105) > at > java.io.InputStreamReader.convertInto(InputStreamReader.java:132) > at > java.io.InputStreamReader.fill(InputStreamReader.java:181) > at > java.io.InputStreamReader.read(InputStreamReader.java:244) > at > java.io.BufferedReader.fill(BufferedReader.java:134) > at > java.io.BufferedReader.readLine(BufferedReader.java:294) > at > java.io.BufferedReader.readLine(BufferedReader.java:357) > at > org.htmlparser.HTMLReader.getNextLine(HTMLReader.java:139) > at > org.htmlparser.HTMLReader.readElement(HTMLReader.java:176) > at > org.htmlparser.util.HTMLEnumerationImpl.peek(HTMLEnumerationImpl.java:60) > at > org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes(HTMLEnumerationImpl.java:91) > > Now, if I copy the source of these pages from a > browser into a file and put them on my own > webserver, > I can parse them without any errors. > > It's my guess that there is some strange control > character in the source that is causing the > exception, > but I'm not entirely sure. Any suggestions? If it > is > a bad character, would it be possible to add code to > HTMLReader that strips offending characters from the > input stream? > > Here is the code I am using to parse: > > DefaultHTMLParserFeedback feedback > = new > DefaultHTMLParserFeedback(DefaultHTMLParserFeedback.DEBUG); > > HTMLReader reader = null; > HTMLParser parser = null; > InputStreamReader isr > = new > InputStreamReader(urlConn.getInputStream()); > reader = new HTMLReader(isr, 8192); > parser = new HTMLParser(reader, feedback); > boolean inForm = false; > > parser.addScanner(new > HTMLInputTagScanner()); > > HTMLEnumeration tags = parser.elements(); > > RequestParameters params = new > RequestParameters(); > > while (tags.hasMoreNodes()) > { > ... > } > > > Thanks, > > Bob Lewis > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Tax Center - forms, calculators, tips, more > http://taxes.yahoo.com/ > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Bob L. <bob...@ya...> - 2003-02-25 20:07:38
|
I tried using the parser directly, as you suggested, and it seems to work. However, I need to be able work with the URLConnection to set headers, cookies and send POST data. Typically, this is what I'm doing: //create and initialize the URL Connection HttpURLConnection urlConn = null; URL url = new URL("http://somedomain/somepath"); urlConn = (HttpURLConnection)url.openConnection(); urlConn.setDoInput(true); urlConn.setDoOutput(true); urlConn.setUseCaches(false); urlConn.setAllowUserInteraction(false); urlConn.setRequestMethod("POST"); // ... usually many HTTP Headers and cookie values set urlConn.setRequestProperty("someHeader", "someValue"); urlConn.setRequestProperty("anotherHeader", "anotherValue"); StringBuffer postData = new StringBuffer(); // ... generate post data in buffer //Send the post data PrintWriter printWriter = new PrintWriter(urlConn.getOutputStream()); printWriter.println(postData.toString()); printWriter.close(); //parse the response HTMLEnumeration tags = parser.elements(); while (parser.hasMoreNodes()) { // ... Do Something } This works fine on most URLs. I am normally able to execute the server-side web application, obtain and parse the HTML response. However, in the case of these two URLs, I get the MalformedInputException. Is there something I'm missing? Thanks, Bob Lewis --- Somik Raha <so...@ya...> wrote: >Date: 2003-02-24 21:33 >Sender: somik >Logged In: YES >user_id=187944 > >I ran the parser on these pages and it worked fine. Try >runParser.bat http://www.flytango.com/en/index.html. > >It could be that you have intialized your urlconnection >incorrectly. Have you tried using the parser directly, like : > >HTMLParser parser = new HTMLParser >("http://www.flytango.com/en/index.html"); >for (NodeIterator i=parser.elements();i.hasMoreNodes();) { > System.out.println(i.nextNode().toHtml()); >} --- Somik Raha <so...@ya...> wrote: > Hi Bob, > Sounds like a bug. > Can you file a bug report at > http://htmlparser.sourceforge.net? > > Regards, > Somik > --- Bob Lewis <bob...@ya...> wrote: > > Hi, > > > > I am trying to use htmlparser 1.3 to parse the > HTML > > at > > http://www.flytango.com/en/taschedule.html and > > http://www.flytango.com/en/index.html. When I > > attempt > > to parse these pages, I get > > com.sun.io.MalformedInputException: > > > > sun.io.MalformedInputException > > at > > > sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:105) > > at > > > java.io.InputStreamReader.convertInto(InputStreamReader.java:132) > > at > > > java.io.InputStreamReader.fill(InputStreamReader.java:181) > > at > > > java.io.InputStreamReader.read(InputStreamReader.java:244) > > at > > > java.io.BufferedReader.fill(BufferedReader.java:134) > > at > > > java.io.BufferedReader.readLine(BufferedReader.java:294) > > at > > > java.io.BufferedReader.readLine(BufferedReader.java:357) > > at > > > org.htmlparser.HTMLReader.getNextLine(HTMLReader.java:139) > > at > > > org.htmlparser.HTMLReader.readElement(HTMLReader.java:176) > > at > > > org.htmlparser.util.HTMLEnumerationImpl.peek(HTMLEnumerationImpl.java:60) > > at > > > org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes(HTMLEnumerationImpl.java:91) > > > > Now, if I copy the source of these pages from a > > browser into a file and put them on my own > > webserver, > > I can parse them without any errors. > > > > It's my guess that there is some strange control > > character in the source that is causing the > > exception, > > but I'm not entirely sure. Any suggestions? If > it > > is > > a bad character, would it be possible to add code > to > > HTMLReader that strips offending characters from > the > > input stream? > > > > Here is the code I am using to parse: > > > > DefaultHTMLParserFeedback feedback > > = new > > > DefaultHTMLParserFeedback(DefaultHTMLParserFeedback.DEBUG); > > > > HTMLReader reader = null; > > HTMLParser parser = null; > > InputStreamReader isr > > = new > > InputStreamReader(urlConn.getInputStream()); > > reader = new HTMLReader(isr, 8192); > > parser = new HTMLParser(reader, feedback); > > boolean inForm = false; > > > > parser.addScanner(new > > HTMLInputTagScanner()); > > > > HTMLEnumeration tags = parser.elements(); > > > > RequestParameters params = new > > RequestParameters(); > > > > while (tags.hasMoreNodes()) > > { > > ... > > } > > > > > > Thanks, > > > > Bob Lewis > > > > > > __________________________________________________ > > Do you Yahoo!? > > Yahoo! Tax Center - forms, calculators, tips, more > > http://taxes.yahoo.com/ > > > > > > > ------------------------------------------------------- > > This sf.net email is sponsored by:ThinkGeek > > Welcome to geek heaven. > > http://thinkgeek.com/sf > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Tax Center - forms, calculators, tips, more > http://taxes.yahoo.com/ > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Bob L. <bob...@ya...> - 2003-02-25 20:20:39
|
Sorry, there was a typo in my last message: > while (parser.hasMoreNodes()) > { > // ... Do Something > } should be while (tags.hasMoreNodes()) { // ... Do Something } Also, on another note, if I try to initialize the parser directly, I am unable to work with the URLConnection. For example: HttpURLConnection urlConn = null; HTMLParser parser = new HTMLParser("http://somedomain/somepath"); urlConn = (HttpURLConnection)parser.getConnection(); urlConn.setDoInput(true); // ... This code throws an exception because the HTTP request has already been made. Exception in thread "main" java.lang.IllegalAccessError: Already connected at java.net.URLConnection.setDoInput(URLConnection.java:677) --- Bob Lewis <bob...@ya...> wrote: > > I tried using the parser directly, as you suggested, > and it seems to work. However, I need to be able > work > with the URLConnection to set headers, cookies and > send POST data. > > Typically, this is what I'm doing: > > //create and initialize the URL Connection > HttpURLConnection urlConn = null; > URL url = new URL("http://somedomain/somepath"); > urlConn = > (HttpURLConnection)url.openConnection(); > urlConn.setDoInput(true); > urlConn.setDoOutput(true); > urlConn.setUseCaches(false); > urlConn.setAllowUserInteraction(false); > urlConn.setRequestMethod("POST"); > > // ... usually many HTTP Headers and cookie > values > set > urlConn.setRequestProperty("someHeader", > "someValue"); > urlConn.setRequestProperty("anotherHeader", > "anotherValue"); > > StringBuffer postData = new StringBuffer(); > // ... generate post data in buffer > > //Send the post data > PrintWriter printWriter = new > PrintWriter(urlConn.getOutputStream()); > printWriter.println(postData.toString()); > printWriter.close(); > > //parse the response > HTMLEnumeration tags = parser.elements(); > > while (parser.hasMoreNodes()) > { > // ... Do Something > } > > This works fine on most URLs. I am normally able to > execute the server-side web application, obtain and > parse the HTML response. However, in the case of > these two URLs, I get the MalformedInputException. > > Is there something I'm missing? > > Thanks, > > Bob Lewis > > --- Somik Raha <so...@ya...> wrote: > > >Date: 2003-02-24 21:33 > >Sender: somik > >Logged In: YES > >user_id=187944 > > > >I ran the parser on these pages and it worked fine. > Try > >runParser.bat > http://www.flytango.com/en/index.html. > > > >It could be that you have intialized your > urlconnection > >incorrectly. Have you tried using the parser > directly, like : > > > >HTMLParser parser = new HTMLParser > >("http://www.flytango.com/en/index.html"); > >for (NodeIterator > i=parser.elements();i.hasMoreNodes();) { > > System.out.println(i.nextNode().toHtml()); > >} > > --- Somik Raha <so...@ya...> wrote: > > Hi Bob, > > Sounds like a bug. > > Can you file a bug report at > > http://htmlparser.sourceforge.net? > > > > Regards, > > Somik > > --- Bob Lewis <bob...@ya...> wrote: > > > Hi, > > > > > > I am trying to use htmlparser 1.3 to parse the > > HTML > > > at > > > http://www.flytango.com/en/taschedule.html and > > > http://www.flytango.com/en/index.html. When I > > > attempt > > > to parse these pages, I get > > > com.sun.io.MalformedInputException: > > > > > > sun.io.MalformedInputException > > > at > > > > > > sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:105) > > > at > > > > > > java.io.InputStreamReader.convertInto(InputStreamReader.java:132) > > > at > > > > > > java.io.InputStreamReader.fill(InputStreamReader.java:181) > > > at > > > > > > java.io.InputStreamReader.read(InputStreamReader.java:244) > > > at > > > > > > java.io.BufferedReader.fill(BufferedReader.java:134) > > > at > > > > > > java.io.BufferedReader.readLine(BufferedReader.java:294) > > > at > > > > > > java.io.BufferedReader.readLine(BufferedReader.java:357) > > > at > > > > > > org.htmlparser.HTMLReader.getNextLine(HTMLReader.java:139) > > > at > > > > > > org.htmlparser.HTMLReader.readElement(HTMLReader.java:176) > > > at > > > > > > org.htmlparser.util.HTMLEnumerationImpl.peek(HTMLEnumerationImpl.java:60) > > > at > > > > > > org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes(HTMLEnumerationImpl.java:91) > > > > > > Now, if I copy the source of these pages from a > > > browser into a file and put them on my own > > > webserver, > > > I can parse them without any errors. > > > > > > It's my guess that there is some strange control > > > character in the source that is causing the > > > exception, > > > but I'm not entirely sure. Any suggestions? If > > it > > > is > > > a bad character, would it be possible to add > code > > to > > > HTMLReader that strips offending characters from > > the > > > input stream? > > > > > > Here is the code I am using to parse: > > > > > > DefaultHTMLParserFeedback feedback > > > = new > > > > > > DefaultHTMLParserFeedback(DefaultHTMLParserFeedback.DEBUG); > > > > > > HTMLReader reader = null; > > > HTMLParser parser = null; > > > InputStreamReader isr > > > = new > > > InputStreamReader(urlConn.getInputStream()); > > > reader = new HTMLReader(isr, 8192); > > > parser = new HTMLParser(reader, > feedback); > > > boolean inForm = false; > > > > > > parser.addScanner(new > > > HTMLInputTagScanner()); > > > > > > HTMLEnumeration tags = > parser.elements(); > > > > > > RequestParameters params = new > > > RequestParameters(); > > > > > > while (tags.hasMoreNodes()) > > > { > > > ... > > > } > > > > > > > > > Thanks, > > > > > > Bob Lewis > > > > === message truncated === __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Somik R. <so...@ya...> - 2003-02-26 06:44:52
|
That sounds like a good feature request. Derrick ->what do you think ? Regards, Somik ----- Original Message ----- From: "Bob Lewis" <bob...@ya...> To: <htm...@li...> Sent: Tuesday, February 25, 2003 12:20 PM Subject: Re: [Htmlparser-user] Malformed Input Exception > Sorry, there was a typo in my last message: > > > while (parser.hasMoreNodes()) > > { > > // ... Do Something > > } > > should be > > while (tags.hasMoreNodes()) > { > // ... Do Something > } > > Also, on another note, if I try to initialize the > parser directly, I am unable to work with the > URLConnection. For example: > > HttpURLConnection urlConn = null; > HTMLParser parser = new > HTMLParser("http://somedomain/somepath"); > urlConn = > (HttpURLConnection)parser.getConnection(); > urlConn.setDoInput(true); > // ... > > This code throws an exception because the HTTP request > has already been made. > > Exception in thread "main" > java.lang.IllegalAccessError: Already connected > at > java.net.URLConnection.setDoInput(URLConnection.java:677) > > --- Bob Lewis <bob...@ya...> wrote: > > > > I tried using the parser directly, as you suggested, > > and it seems to work. However, I need to be able > > work > > with the URLConnection to set headers, cookies and > > send POST data. > > > > Typically, this is what I'm doing: > > > > //create and initialize the URL Connection > > HttpURLConnection urlConn = null; > > URL url = new URL("http://somedomain/somepath"); > > urlConn = > > (HttpURLConnection)url.openConnection(); > > urlConn.setDoInput(true); > > urlConn.setDoOutput(true); > > urlConn.setUseCaches(false); > > urlConn.setAllowUserInteraction(false); > > urlConn.setRequestMethod("POST"); > > > > // ... usually many HTTP Headers and cookie > > values > > set > > urlConn.setRequestProperty("someHeader", > > "someValue"); > > urlConn.setRequestProperty("anotherHeader", > > "anotherValue"); > > > > StringBuffer postData = new StringBuffer(); > > // ... generate post data in buffer > > > > //Send the post data > > PrintWriter printWriter = new > > PrintWriter(urlConn.getOutputStream()); > > printWriter.println(postData.toString()); > > printWriter.close(); > > > > //parse the response > > HTMLEnumeration tags = parser.elements(); > > > > while (parser.hasMoreNodes()) > > { > > // ... Do Something > > } > > > > This works fine on most URLs. I am normally able to > > execute the server-side web application, obtain and > > parse the HTML response. However, in the case of > > these two URLs, I get the MalformedInputException. > > > > Is there something I'm missing? > > > > Thanks, > > > > Bob Lewis > > > > --- Somik Raha <so...@ya...> wrote: > > > > >Date: 2003-02-24 21:33 > > >Sender: somik > > >Logged In: YES > > >user_id=187944 > > > > > >I ran the parser on these pages and it worked fine. > > Try > > >runParser.bat > > http://www.flytango.com/en/index.html. > > > > > >It could be that you have intialized your > > urlconnection > > >incorrectly. Have you tried using the parser > > directly, like : > > > > > >HTMLParser parser = new HTMLParser > > >("http://www.flytango.com/en/index.html"); > > >for (NodeIterator > > i=parser.elements();i.hasMoreNodes();) { > > > System.out.println(i.nextNode().toHtml()); > > >} > > > > --- Somik Raha <so...@ya...> wrote: > > > Hi Bob, > > > Sounds like a bug. > > > Can you file a bug report at > > > http://htmlparser.sourceforge.net? > > > > > > Regards, > > > Somik > > > --- Bob Lewis <bob...@ya...> wrote: > > > > Hi, > > > > > > > > I am trying to use htmlparser 1.3 to parse the > > > HTML > > > > at > > > > http://www.flytango.com/en/taschedule.html and > > > > http://www.flytango.com/en/index.html. When I > > > > attempt > > > > to parse these pages, I get > > > > com.sun.io.MalformedInputException: > > > > > > > > sun.io.MalformedInputException > > > > at > > > > > > > > > > sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:105) > > > > at > > > > > > > > > > java.io.InputStreamReader.convertInto(InputStreamReader.java:132) > > > > at > > > > > > > > > > java.io.InputStreamReader.fill(InputStreamReader.java:181) > > > > at > > > > > > > > > > java.io.InputStreamReader.read(InputStreamReader.java:244) > > > > at > > > > > > > > > java.io.BufferedReader.fill(BufferedReader.java:134) > > > > at > > > > > > > > > > java.io.BufferedReader.readLine(BufferedReader.java:294) > > > > at > > > > > > > > > > java.io.BufferedReader.readLine(BufferedReader.java:357) > > > > at > > > > > > > > > > org.htmlparser.HTMLReader.getNextLine(HTMLReader.java:139) > > > > at > > > > > > > > > > org.htmlparser.HTMLReader.readElement(HTMLReader.java:176) > > > > at > > > > > > > > > > org.htmlparser.util.HTMLEnumerationImpl.peek(HTMLEnumerationImpl.java:60) > > > > at > > > > > > > > > > org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes(HTMLEnumerationImpl.jav a:91) > > > > > > > > Now, if I copy the source of these pages from a > > > > browser into a file and put them on my own > > > > webserver, > > > > I can parse them without any errors. > > > > > > > > It's my guess that there is some strange control > > > > character in the source that is causing the > > > > exception, > > > > but I'm not entirely sure. Any suggestions? If > > > it > > > > is > > > > a bad character, would it be possible to add > > code > > > to > > > > HTMLReader that strips offending characters from > > > the > > > > input stream? > > > > > > > > Here is the code I am using to parse: > > > > > > > > DefaultHTMLParserFeedback feedback > > > > = new > > > > > > > > > > DefaultHTMLParserFeedback(DefaultHTMLParserFeedback.DEBUG); > > > > > > > > HTMLReader reader = null; > > > > HTMLParser parser = null; > > > > InputStreamReader isr > > > > = new > > > > InputStreamReader(urlConn.getInputStream()); > > > > reader = new HTMLReader(isr, 8192); > > > > parser = new HTMLParser(reader, > > feedback); > > > > boolean inForm = false; > > > > > > > > parser.addScanner(new > > > > HTMLInputTagScanner()); > > > > > > > > HTMLEnumeration tags = > > parser.elements(); > > > > > > > > RequestParameters params = new > > > > RequestParameters(); > > > > > > > > while (tags.hasMoreNodes()) > > > > { > > > > ... > > > > } > > > > > > > > > > > > Thanks, > > > > > > > > Bob Lewis > > > > > > > === message truncated === > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Tax Center - forms, calculators, tips, more > http://taxes.yahoo.com/ > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Mohd-Taqiyuddin Z. <mt...@ec...> - 2003-02-26 16:34:06
|
Hi, I'm doing my harvester to harvest information in the formtag. It works find when I parse to any html pages that I need to parse except for this URL http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta.html. It seems that the page that gives the error does not have an endtag for the formtag and the parser loopback to find the endtag for the formtag. Is this a bug? Do you know a solution that I can still parse the page and still get the Vector FormInput for further processing. Hope you can help me on this. below is the generated error. " ERROR: HTMLReader.readElement() : Error occurred while trying to decipher the tag using scanners Tag being processed : FORM Current Tag Line : <form action="earlyadopterjxtaanswers.jsp" method="POST"> at Line 690 : null Previous Line 689 : </HTML> ERROR: HTMLReader.readElement() : Error occurred while trying to read the next element, at Line 690 : null Previous Line 689 : </HTML> ERROR: Unexpected Exception occurred while reading http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta.html, in nextHTMLNode at Line 690 : null Previous Line 689 : </HTML> org.htmlparser.util.ParserException: Unexpected Exception occurred while reading http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta .html, in nextHTMLNode at Line 690 : null Previous Line 689 : </HTML>" |
From: Somik R. <so...@ya...> - 2003-02-26 18:05:06
|
This is a known limitation. The problem is in guessing when a form tag really should have ended. Can you suggest something looking at the page that failed ? Regards, Somik --- Mohd-Taqiyuddin Zalfan <mt...@ec...> wrote: > Hi, > > I'm doing my harvester to harvest information in the > formtag. It works find > when I parse to any html pages that I need to parse > except for this URL > http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta.html. > It seems that the page that gives the error does not > have an endtag for the > formtag and the parser loopback to find the endtag > for the formtag. Is this > a bug? Do you know a solution that I can still parse > the page and still get > the Vector FormInput for further processing. Hope > you can help me on this. > below is the generated error. > " > ERROR: HTMLReader.readElement() : Error occurred > while trying to decipher > the tag using scanners > Tag being processed : FORM > Current Tag Line : <form > action="earlyadopterjxtaanswers.jsp" > method="POST"> > at Line 690 : null > Previous Line 689 : </HTML> > ERROR: HTMLReader.readElement() : Error occurred > while trying to read the > next element, > at Line 690 : null > Previous Line 689 : </HTML> > ERROR: Unexpected Exception occurred while reading > http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta.html, > > in nextHTMLNode > at Line 690 : null > Previous Line 689 : </HTML> > org.htmlparser.util.ParserException: Unexpected > Exception occurred while > reading > http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta > .html, in nextHTMLNode > at Line 690 : null > Previous Line 689 : </HTML>" > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Scholarships for > Techies! > Can't afford IT training? All 2003 ictp students > receive scholarships. > Get hands-on training in Microsoft, Cisco, Sun, > Linux/UNIX, and more. > www.ictp.com/training/sourceforge.asp > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Mohd-Taqiyuddin Z. <mt...@ec...> - 2003-02-27 00:21:15
|
hi there, I think a formtag should end when it sees another formtag although it is not an endtag. Another way of determining the endtag of formtag is to check wether it is the end of the html page by checking the endtag of hmtltag. This is because the in formtag, it's consist of inputtag and the importants information about a form is its method, action, and inputtag, therefore when the parser first see a formtag it will parse the node until it sees the endtag of the formtag, another formtag or the end of html document. therefore, we can logically group Vector of inputtag and other attributes to the appropriate formtag (if there is more than one formtag). I hope my explaination can help us improve htmlparser. thank you. Quoting Somik Raha <so...@ya...>: > This is a known limitation. The problem is in guessing > when a form tag really should have ended. Can you > suggest something looking at the page that failed ? > > Regards, > Somik > --- Mohd-Taqiyuddin Zalfan <mt...@ec...> > wrote: > > Hi, > > > > I'm doing my harvester to harvest information in the > > formtag. It works find > > when I parse to any html pages that I need to parse > > except for this URL > > > http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta.html. > > It seems that the page that gives the error does not > > have an endtag for the > > formtag and the parser loopback to find the endtag > > for the formtag. Is this > > a bug? Do you know a solution that I can still parse > > the page and still get > > the Vector FormInput for further processing. Hope > > you can help me on this. > > below is the generated error. > > " > > ERROR: HTMLReader.readElement() : Error occurred > > while trying to decipher > > the tag using scanners > > Tag being processed : FORM > > Current Tag Line : <form > > action="earlyadopterjxtaanswers.jsp" > > method="POST"> > > at Line 690 : null > > Previous Line 689 : </HTML> > > ERROR: HTMLReader.readElement() : Error occurred > > while trying to read the > > next element, > > at Line 690 : null > > Previous Line 689 : </HTML> > > ERROR: Unexpected Exception occurred while reading > > > http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta.html, > > > > in nextHTMLNode > > at Line 690 : null > > Previous Line 689 : </HTML> > > org.htmlparser.util.ParserException: Unexpected > > Exception occurred while > > reading > > > http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta > > .html, in nextHTMLNode > > at Line 690 : null > > Previous Line 689 : </HTML>" > > > > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: Scholarships for > > Techies! > > Can't afford IT training? All 2003 ictp students > > receive scholarships. > > Get hands-on training in Microsoft, Cisco, Sun, > > Linux/UNIX, and more. > > www.ictp.com/training/sourceforge.asp > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Tax Center - forms, calculators, tips, more > http://taxes.yahoo.com/ > > > ------------------------------------------------------- > This SF.net email is sponsored by: Scholarships for Techies! > Can't afford IT training? All 2003 ictp students receive scholarships. > Get hands-on training in Microsoft, Cisco, Sun, Linux/UNIX, and more. > www.ictp.com/training/sourceforge.asp > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Somik R. <so...@ya...> - 2003-02-26 06:44:02
|
Hi Bob, Can you try this - get the data from the url in question into a file (using a post request). Then try to parse the file. If it breaks, we would know why. Regards, Somik ----- Original Message ----- From: "Bob Lewis" <bob...@ya...> To: <htm...@li...> Sent: Tuesday, February 25, 2003 12:07 PM Subject: Re: [Htmlparser-user] Malformed Input Exception > > I tried using the parser directly, as you suggested, > and it seems to work. However, I need to be able work > with the URLConnection to set headers, cookies and > send POST data. > > Typically, this is what I'm doing: > > //create and initialize the URL Connection > HttpURLConnection urlConn = null; > URL url = new URL("http://somedomain/somepath"); > urlConn = (HttpURLConnection)url.openConnection(); > urlConn.setDoInput(true); > urlConn.setDoOutput(true); > urlConn.setUseCaches(false); > urlConn.setAllowUserInteraction(false); > urlConn.setRequestMethod("POST"); > > // ... usually many HTTP Headers and cookie values > set > urlConn.setRequestProperty("someHeader", > "someValue"); > urlConn.setRequestProperty("anotherHeader", > "anotherValue"); > > StringBuffer postData = new StringBuffer(); > // ... generate post data in buffer > > //Send the post data > PrintWriter printWriter = new > PrintWriter(urlConn.getOutputStream()); > printWriter.println(postData.toString()); > printWriter.close(); > > //parse the response > HTMLEnumeration tags = parser.elements(); > > while (parser.hasMoreNodes()) > { > // ... Do Something > } > > This works fine on most URLs. I am normally able to > execute the server-side web application, obtain and > parse the HTML response. However, in the case of > these two URLs, I get the MalformedInputException. > > Is there something I'm missing? > > Thanks, > > Bob Lewis > > --- Somik Raha <so...@ya...> wrote: > > >Date: 2003-02-24 21:33 > >Sender: somik > >Logged In: YES > >user_id=187944 > > > >I ran the parser on these pages and it worked fine. > Try > >runParser.bat http://www.flytango.com/en/index.html. > > > >It could be that you have intialized your > urlconnection > >incorrectly. Have you tried using the parser > directly, like : > > > >HTMLParser parser = new HTMLParser > >("http://www.flytango.com/en/index.html"); > >for (NodeIterator > i=parser.elements();i.hasMoreNodes();) { > > System.out.println(i.nextNode().toHtml()); > >} > > --- Somik Raha <so...@ya...> wrote: > > Hi Bob, > > Sounds like a bug. > > Can you file a bug report at > > http://htmlparser.sourceforge.net? > > > > Regards, > > Somik > > --- Bob Lewis <bob...@ya...> wrote: > > > Hi, > > > > > > I am trying to use htmlparser 1.3 to parse the > > HTML > > > at > > > http://www.flytango.com/en/taschedule.html and > > > http://www.flytango.com/en/index.html. When I > > > attempt > > > to parse these pages, I get > > > com.sun.io.MalformedInputException: > > > > > > sun.io.MalformedInputException > > > at > > > > > > sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:105) > > > at > > > > > > java.io.InputStreamReader.convertInto(InputStreamReader.java:132) > > > at > > > > > > java.io.InputStreamReader.fill(InputStreamReader.java:181) > > > at > > > > > > java.io.InputStreamReader.read(InputStreamReader.java:244) > > > at > > > > > java.io.BufferedReader.fill(BufferedReader.java:134) > > > at > > > > > > java.io.BufferedReader.readLine(BufferedReader.java:294) > > > at > > > > > > java.io.BufferedReader.readLine(BufferedReader.java:357) > > > at > > > > > > org.htmlparser.HTMLReader.getNextLine(HTMLReader.java:139) > > > at > > > > > > org.htmlparser.HTMLReader.readElement(HTMLReader.java:176) > > > at > > > > > > org.htmlparser.util.HTMLEnumerationImpl.peek(HTMLEnumerationImpl.java:60) > > > at > > > > > > org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes(HTMLEnumerationImpl.jav a:91) > > > > > > Now, if I copy the source of these pages from a > > > browser into a file and put them on my own > > > webserver, > > > I can parse them without any errors. > > > > > > It's my guess that there is some strange control > > > character in the source that is causing the > > > exception, > > > but I'm not entirely sure. Any suggestions? If > > it > > > is > > > a bad character, would it be possible to add code > > to > > > HTMLReader that strips offending characters from > > the > > > input stream? > > > > > > Here is the code I am using to parse: > > > > > > DefaultHTMLParserFeedback feedback > > > = new > > > > > > DefaultHTMLParserFeedback(DefaultHTMLParserFeedback.DEBUG); > > > > > > HTMLReader reader = null; > > > HTMLParser parser = null; > > > InputStreamReader isr > > > = new > > > InputStreamReader(urlConn.getInputStream()); > > > reader = new HTMLReader(isr, 8192); > > > parser = new HTMLParser(reader, feedback); > > > boolean inForm = false; > > > > > > parser.addScanner(new > > > HTMLInputTagScanner()); > > > > > > HTMLEnumeration tags = parser.elements(); > > > > > > RequestParameters params = new > > > RequestParameters(); > > > > > > while (tags.hasMoreNodes()) > > > { > > > ... > > > } > > > > > > > > > Thanks, > > > > > > Bob Lewis > > > > > > > > > __________________________________________________ > > > Do you Yahoo!? > > > Yahoo! Tax Center - forms, calculators, tips, more > > > http://taxes.yahoo.com/ > > > > > > > > > > > > ------------------------------------------------------- > > > This sf.net email is sponsored by:ThinkGeek > > > Welcome to geek heaven. > > > http://thinkgeek.com/sf > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > __________________________________________________ > > Do you Yahoo!? > > Yahoo! Tax Center - forms, calculators, tips, more > > http://taxes.yahoo.com/ > > > > > > > ------------------------------------------------------- > > This sf.net email is sponsored by:ThinkGeek > > Welcome to geek heaven. > > http://thinkgeek.com/sf > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Tax Center - forms, calculators, tips, more > http://taxes.yahoo.com/ > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Bob L. <bob...@ya...> - 2003-02-26 16:16:41
|
Hi, I tried this, as you suggested, and received the same Exception while reading the InputStream. Which led me to discover that I was setting the wrong character set in the InputStreamReader. My app was erroneously using the system default character set (UTF8 in this case), but the actual stream was using ISO-8859-1. The getCharset and getCharacterSet methods in Parser are very useful here. You may want to consider making them static and public, or moving them to a Utility class. That way they can be used by applications which construct their own Readers. Thanks for the help, Bob Lewis --- Somik Raha <so...@ya...> wrote: > Hi Bob, > Can you try this - get the data from the url in > question into a file > (using a post request). Then try to parse the file. > If it breaks, we would > know why. > > Regards, > Somik > ----- Original Message ----- > From: "Bob Lewis" <bob...@ya...> > To: <htm...@li...> > Sent: Tuesday, February 25, 2003 12:07 PM > Subject: Re: [Htmlparser-user] Malformed Input > Exception > > > > > > I tried using the parser directly, as you > suggested, > > and it seems to work. However, I need to be able > work > > with the URLConnection to set headers, cookies and > > send POST data. > > > > Typically, this is what I'm doing: > > > > //create and initialize the URL Connection > > HttpURLConnection urlConn = null; > > URL url = new > URL("http://somedomain/somepath"); > > urlConn = > (HttpURLConnection)url.openConnection(); > > urlConn.setDoInput(true); > > urlConn.setDoOutput(true); > > urlConn.setUseCaches(false); > > urlConn.setAllowUserInteraction(false); > > urlConn.setRequestMethod("POST"); > > > > // ... usually many HTTP Headers and cookie > values > > set > > urlConn.setRequestProperty("someHeader", > > "someValue"); > > urlConn.setRequestProperty("anotherHeader", > > "anotherValue"); > > > > StringBuffer postData = new StringBuffer(); > > // ... generate post data in buffer > > > > //Send the post data > > PrintWriter printWriter = new > > PrintWriter(urlConn.getOutputStream()); > > printWriter.println(postData.toString()); > > printWriter.close(); > > > > //parse the response > > HTMLEnumeration tags = parser.elements(); > > > > while (parser.hasMoreNodes()) > > { > > // ... Do Something > > } > > > > This works fine on most URLs. I am normally able > to > > execute the server-side web application, obtain > and > > parse the HTML response. However, in the case of > > these two URLs, I get the MalformedInputException. > > > > Is there something I'm missing? > > > > Thanks, > > > > Bob Lewis > > > > --- Somik Raha <so...@ya...> wrote: > > > > >Date: 2003-02-24 21:33 > > >Sender: somik > > >Logged In: YES > > >user_id=187944 > > > > > >I ran the parser on these pages and it worked > fine. > > Try > > >runParser.bat > http://www.flytango.com/en/index.html. > > > > > >It could be that you have intialized your > > urlconnection > > >incorrectly. Have you tried using the parser > > directly, like : > > > > > >HTMLParser parser = new HTMLParser > > >("http://www.flytango.com/en/index.html"); > > >for (NodeIterator > > i=parser.elements();i.hasMoreNodes();) { > > > System.out.println(i.nextNode().toHtml()); > > >} > > > > --- Somik Raha <so...@ya...> wrote: > > > Hi Bob, > > > Sounds like a bug. > > > Can you file a bug report at > > > http://htmlparser.sourceforge.net? > > > > > > Regards, > > > Somik > > > --- Bob Lewis <bob...@ya...> wrote: > > > > Hi, > > > > > > > > I am trying to use htmlparser 1.3 to parse the > > > HTML > > > > at > > > > http://www.flytango.com/en/taschedule.html and > > > > http://www.flytango.com/en/index.html. When I > > > > attempt > > > > to parse these pages, I get > > > > com.sun.io.MalformedInputException: > > > > > > > > sun.io.MalformedInputException > > > > at > > > > > > > > > > sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:105) > > > > at > > > > > > > > > > java.io.InputStreamReader.convertInto(InputStreamReader.java:132) > > > > at > > > > > > > > > > java.io.InputStreamReader.fill(InputStreamReader.java:181) > > > > at > > > > > > > > > > java.io.InputStreamReader.read(InputStreamReader.java:244) > > > > at > > > > > > > > java.io.BufferedReader.fill(BufferedReader.java:134) > > > > at > > > > > > > > > > java.io.BufferedReader.readLine(BufferedReader.java:294) > > > > at > > > > > > > > > > java.io.BufferedReader.readLine(BufferedReader.java:357) > > > > at > > > > > > > > > > org.htmlparser.HTMLReader.getNextLine(HTMLReader.java:139) > > > > at > > > > > > > > > > org.htmlparser.HTMLReader.readElement(HTMLReader.java:176) > > > > at > > > > > > > > > > org.htmlparser.util.HTMLEnumerationImpl.peek(HTMLEnumerationImpl.java:60) > > > > at > > > > > > > > > > org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes(HTMLEnumerationImpl.jav > a:91) > > > > > > > > Now, if I copy the source of these pages from > a > > > > browser into a file and put them on my own > > > > webserver, > > > > I can parse them without any errors. > > > > > > > > It's my guess that there is some strange > control > > > > character in the source that is causing the > > > > exception, > > > > but I'm not entirely sure. Any suggestions? > If > > > it > > > > is > > > > a bad character, would it be possible to add > code > > > to > > > > HTMLReader that strips offending characters > from > > > the > > > > input stream? > > > > > === message truncated === __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |