Re: [Htmlparser-user] Malformed Input Exception
Brought to you by:
derrickoswald
From: Bob L. <bob...@ya...> - 2003-02-26 16:16:41
|
Hi, I tried this, as you suggested, and received the same Exception while reading the InputStream. Which led me to discover that I was setting the wrong character set in the InputStreamReader. My app was erroneously using the system default character set (UTF8 in this case), but the actual stream was using ISO-8859-1. The getCharset and getCharacterSet methods in Parser are very useful here. You may want to consider making them static and public, or moving them to a Utility class. That way they can be used by applications which construct their own Readers. Thanks for the help, Bob Lewis --- Somik Raha <so...@ya...> wrote: > Hi Bob, > Can you try this - get the data from the url in > question into a file > (using a post request). Then try to parse the file. > If it breaks, we would > know why. > > Regards, > Somik > ----- Original Message ----- > From: "Bob Lewis" <bob...@ya...> > To: <htm...@li...> > Sent: Tuesday, February 25, 2003 12:07 PM > Subject: Re: [Htmlparser-user] Malformed Input > Exception > > > > > > I tried using the parser directly, as you > suggested, > > and it seems to work. However, I need to be able > work > > with the URLConnection to set headers, cookies and > > send POST data. > > > > Typically, this is what I'm doing: > > > > //create and initialize the URL Connection > > HttpURLConnection urlConn = null; > > URL url = new > URL("http://somedomain/somepath"); > > urlConn = > (HttpURLConnection)url.openConnection(); > > urlConn.setDoInput(true); > > urlConn.setDoOutput(true); > > urlConn.setUseCaches(false); > > urlConn.setAllowUserInteraction(false); > > urlConn.setRequestMethod("POST"); > > > > // ... usually many HTTP Headers and cookie > values > > set > > urlConn.setRequestProperty("someHeader", > > "someValue"); > > urlConn.setRequestProperty("anotherHeader", > > "anotherValue"); > > > > StringBuffer postData = new StringBuffer(); > > // ... generate post data in buffer > > > > //Send the post data > > PrintWriter printWriter = new > > PrintWriter(urlConn.getOutputStream()); > > printWriter.println(postData.toString()); > > printWriter.close(); > > > > //parse the response > > HTMLEnumeration tags = parser.elements(); > > > > while (parser.hasMoreNodes()) > > { > > // ... Do Something > > } > > > > This works fine on most URLs. I am normally able > to > > execute the server-side web application, obtain > and > > parse the HTML response. However, in the case of > > these two URLs, I get the MalformedInputException. > > > > Is there something I'm missing? > > > > Thanks, > > > > Bob Lewis > > > > --- Somik Raha <so...@ya...> wrote: > > > > >Date: 2003-02-24 21:33 > > >Sender: somik > > >Logged In: YES > > >user_id=187944 > > > > > >I ran the parser on these pages and it worked > fine. > > Try > > >runParser.bat > http://www.flytango.com/en/index.html. > > > > > >It could be that you have intialized your > > urlconnection > > >incorrectly. Have you tried using the parser > > directly, like : > > > > > >HTMLParser parser = new HTMLParser > > >("http://www.flytango.com/en/index.html"); > > >for (NodeIterator > > i=parser.elements();i.hasMoreNodes();) { > > > System.out.println(i.nextNode().toHtml()); > > >} > > > > --- Somik Raha <so...@ya...> wrote: > > > Hi Bob, > > > Sounds like a bug. > > > Can you file a bug report at > > > http://htmlparser.sourceforge.net? > > > > > > Regards, > > > Somik > > > --- Bob Lewis <bob...@ya...> wrote: > > > > Hi, > > > > > > > > I am trying to use htmlparser 1.3 to parse the > > > HTML > > > > at > > > > http://www.flytango.com/en/taschedule.html and > > > > http://www.flytango.com/en/index.html. When I > > > > attempt > > > > to parse these pages, I get > > > > com.sun.io.MalformedInputException: > > > > > > > > sun.io.MalformedInputException > > > > at > > > > > > > > > > sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:105) > > > > at > > > > > > > > > > java.io.InputStreamReader.convertInto(InputStreamReader.java:132) > > > > at > > > > > > > > > > java.io.InputStreamReader.fill(InputStreamReader.java:181) > > > > at > > > > > > > > > > java.io.InputStreamReader.read(InputStreamReader.java:244) > > > > at > > > > > > > > java.io.BufferedReader.fill(BufferedReader.java:134) > > > > at > > > > > > > > > > java.io.BufferedReader.readLine(BufferedReader.java:294) > > > > at > > > > > > > > > > java.io.BufferedReader.readLine(BufferedReader.java:357) > > > > at > > > > > > > > > > org.htmlparser.HTMLReader.getNextLine(HTMLReader.java:139) > > > > at > > > > > > > > > > org.htmlparser.HTMLReader.readElement(HTMLReader.java:176) > > > > at > > > > > > > > > > org.htmlparser.util.HTMLEnumerationImpl.peek(HTMLEnumerationImpl.java:60) > > > > at > > > > > > > > > > org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes(HTMLEnumerationImpl.jav > a:91) > > > > > > > > Now, if I copy the source of these pages from > a > > > > browser into a file and put them on my own > > > > webserver, > > > > I can parse them without any errors. > > > > > > > > It's my guess that there is some strange > control > > > > character in the source that is causing the > > > > exception, > > > > but I'm not entirely sure. Any suggestions? > If > > > it > > > > is > > > > a bad character, would it be possible to add > code > > > to > > > > HTMLReader that strips offending characters > from > > > the > > > > input stream? > > > > > === message truncated === __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |