Re: [Htmlparser-user] Malformed Input Exception

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,

I tried this, as you suggested, and received the same
Exception while reading the InputStream.  Which led me
to discover that I was setting the wrong character set
in the InputStreamReader.

My app was erroneously using the system default
character set (UTF8 in this case), but the actual
stream was using  ISO-8859-1.  

The getCharset and getCharacterSet methods in Parser
are very useful here.  You may want to consider making
them static and public, or moving them to a Utility
class.  That way they can be used by applications
which construct their own Readers.

Thanks for the help,

Bob Lewis

--- Somik Raha <so...@ya...> wrote:
> Hi Bob,
>     Can you try this - get the data from the url in
> question into a file
> (using a post request). Then try to parse the file.
> If it breaks, we would
> know why.
> 
> Regards,
> Somik
> ----- Original Message -----
> From: "Bob Lewis" <bob...@ya...>
> To: <htm...@li...>
> Sent: Tuesday, February 25, 2003 12:07 PM
> Subject: Re: [Htmlparser-user] Malformed Input
> Exception
> 
> 
> >
> > I tried using the parser directly, as you
> suggested,
> > and it seems to work.  However, I need to be able
> work
> > with the URLConnection to set headers, cookies and
> > send POST data.
> >
> > Typically, this is what I'm doing:
> >
> >     //create and initialize the URL Connection
> >     HttpURLConnection urlConn = null;
> >     URL url = new
> URL("http://somedomain/somepath");
> >     urlConn =
> (HttpURLConnection)url.openConnection();
> >     urlConn.setDoInput(true);
> >     urlConn.setDoOutput(true);
> >     urlConn.setUseCaches(false);
> >     urlConn.setAllowUserInteraction(false);
> >     urlConn.setRequestMethod("POST");
> >
> >     // ... usually many HTTP Headers and cookie
> values
> > set
> >     urlConn.setRequestProperty("someHeader",
> > "someValue");
> >     urlConn.setRequestProperty("anotherHeader",
> > "anotherValue");
> >
> >     StringBuffer postData = new StringBuffer();
> >      // ... generate post data in buffer
> >
> >     //Send the post data
> >     PrintWriter printWriter = new
> > PrintWriter(urlConn.getOutputStream());
> >     printWriter.println(postData.toString());
> >     printWriter.close();
> >
> >     //parse the response
> >     HTMLEnumeration tags = parser.elements();
> >
> >     while (parser.hasMoreNodes())
> >     {
> >         // ... Do Something
> >     }
> >
> > This works fine on most URLs.  I am normally able
> to
> > execute the server-side web application, obtain
> and
> > parse the HTML response.   However, in the case of
> > these two URLs, I get the MalformedInputException.
> >
> > Is there something I'm missing?
> >
> > Thanks,
> >
> > Bob Lewis
> >
> > --- Somik Raha <so...@ya...> wrote:
> >
> > >Date: 2003-02-24 21:33
> > >Sender: somik
> > >Logged In: YES
> > >user_id=187944
> > >
> > >I ran the parser on these pages and it worked
> fine.
> > Try
> > >runParser.bat
> http://www.flytango.com/en/index.html.
> > >
> > >It could be that you have intialized your
> > urlconnection
> > >incorrectly. Have you tried using the parser
> > directly, like :
> > >
> > >HTMLParser parser = new HTMLParser
> > >("http://www.flytango.com/en/index.html");
> > >for (NodeIterator
> > i=parser.elements();i.hasMoreNodes();) {
> > >   System.out.println(i.nextNode().toHtml());
> > >}
> >
> > --- Somik Raha <so...@ya...> wrote:
> > > Hi Bob,
> > >   Sounds like a bug.
> > >   Can you file a bug report at
> > > http://htmlparser.sourceforge.net?
> > >
> > > Regards,
> > > Somik
> > > --- Bob Lewis <bob...@ya...> wrote:
> > > > Hi,
> > > >
> > > > I am trying to use htmlparser 1.3 to parse the
> > > HTML
> > > > at
> > > > http://www.flytango.com/en/taschedule.html and
> > > > http://www.flytango.com/en/index.html. When I
> > > > attempt
> > > > to parse these pages, I get
> > > > com.sun.io.MalformedInputException:
> > > >
> > > > sun.io.MalformedInputException
> > > >         at
> > > >
> > >
> >
>
sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:105)
> > > >         at
> > > >
> > >
> >
>
java.io.InputStreamReader.convertInto(InputStreamReader.java:132)
> > > >         at
> > > >
> > >
> >
>
java.io.InputStreamReader.fill(InputStreamReader.java:181)
> > > >         at
> > > >
> > >
> >
>
java.io.InputStreamReader.read(InputStreamReader.java:244)
> > > >         at
> > > >
> > >
> java.io.BufferedReader.fill(BufferedReader.java:134)
> > > >         at
> > > >
> > >
> >
>
java.io.BufferedReader.readLine(BufferedReader.java:294)
> > > >         at
> > > >
> > >
> >
>
java.io.BufferedReader.readLine(BufferedReader.java:357)
> > > >         at
> > > >
> > >
> >
>
org.htmlparser.HTMLReader.getNextLine(HTMLReader.java:139)
> > > >         at
> > > >
> > >
> >
>
org.htmlparser.HTMLReader.readElement(HTMLReader.java:176)
> > > >         at
> > > >
> > >
> >
>
org.htmlparser.util.HTMLEnumerationImpl.peek(HTMLEnumerationImpl.java:60)
> > > >         at
> > > >
> > >
> >
>
org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes(HTMLEnumerationImpl.jav
> a:91)
> > > >
> > > > Now, if I copy the source of these pages from
> a
> > > > browser into a file and put them on my own
> > > > webserver,
> > > > I can parse them without any errors.
> > > >
> > > > It's my guess that there is some strange
> control
> > > > character in the source that is causing the
> > > > exception,
> > > > but I'm not entirely sure.  Any suggestions? 
> If
> > > it
> > > > is
> > > > a bad character, would it be possible to add
> code
> > > to
> > > > HTMLReader that strips offending characters
> from
> > > the
> > > > input stream?
> > > >
> 
=== message truncated ===

__________________________________________________
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, more
http://taxes.yahoo.com/