Re: [Htmlparser-user] Malformed Input Exception
Brought to you by:
derrickoswald
From: Bob L. <bob...@ya...> - 2003-02-25 20:20:39
|
Sorry, there was a typo in my last message: > while (parser.hasMoreNodes()) > { > // ... Do Something > } should be while (tags.hasMoreNodes()) { // ... Do Something } Also, on another note, if I try to initialize the parser directly, I am unable to work with the URLConnection. For example: HttpURLConnection urlConn = null; HTMLParser parser = new HTMLParser("http://somedomain/somepath"); urlConn = (HttpURLConnection)parser.getConnection(); urlConn.setDoInput(true); // ... This code throws an exception because the HTTP request has already been made. Exception in thread "main" java.lang.IllegalAccessError: Already connected at java.net.URLConnection.setDoInput(URLConnection.java:677) --- Bob Lewis <bob...@ya...> wrote: > > I tried using the parser directly, as you suggested, > and it seems to work. However, I need to be able > work > with the URLConnection to set headers, cookies and > send POST data. > > Typically, this is what I'm doing: > > //create and initialize the URL Connection > HttpURLConnection urlConn = null; > URL url = new URL("http://somedomain/somepath"); > urlConn = > (HttpURLConnection)url.openConnection(); > urlConn.setDoInput(true); > urlConn.setDoOutput(true); > urlConn.setUseCaches(false); > urlConn.setAllowUserInteraction(false); > urlConn.setRequestMethod("POST"); > > // ... usually many HTTP Headers and cookie > values > set > urlConn.setRequestProperty("someHeader", > "someValue"); > urlConn.setRequestProperty("anotherHeader", > "anotherValue"); > > StringBuffer postData = new StringBuffer(); > // ... generate post data in buffer > > //Send the post data > PrintWriter printWriter = new > PrintWriter(urlConn.getOutputStream()); > printWriter.println(postData.toString()); > printWriter.close(); > > //parse the response > HTMLEnumeration tags = parser.elements(); > > while (parser.hasMoreNodes()) > { > // ... Do Something > } > > This works fine on most URLs. I am normally able to > execute the server-side web application, obtain and > parse the HTML response. However, in the case of > these two URLs, I get the MalformedInputException. > > Is there something I'm missing? > > Thanks, > > Bob Lewis > > --- Somik Raha <so...@ya...> wrote: > > >Date: 2003-02-24 21:33 > >Sender: somik > >Logged In: YES > >user_id=187944 > > > >I ran the parser on these pages and it worked fine. > Try > >runParser.bat > http://www.flytango.com/en/index.html. > > > >It could be that you have intialized your > urlconnection > >incorrectly. Have you tried using the parser > directly, like : > > > >HTMLParser parser = new HTMLParser > >("http://www.flytango.com/en/index.html"); > >for (NodeIterator > i=parser.elements();i.hasMoreNodes();) { > > System.out.println(i.nextNode().toHtml()); > >} > > --- Somik Raha <so...@ya...> wrote: > > Hi Bob, > > Sounds like a bug. > > Can you file a bug report at > > http://htmlparser.sourceforge.net? > > > > Regards, > > Somik > > --- Bob Lewis <bob...@ya...> wrote: > > > Hi, > > > > > > I am trying to use htmlparser 1.3 to parse the > > HTML > > > at > > > http://www.flytango.com/en/taschedule.html and > > > http://www.flytango.com/en/index.html. When I > > > attempt > > > to parse these pages, I get > > > com.sun.io.MalformedInputException: > > > > > > sun.io.MalformedInputException > > > at > > > > > > sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:105) > > > at > > > > > > java.io.InputStreamReader.convertInto(InputStreamReader.java:132) > > > at > > > > > > java.io.InputStreamReader.fill(InputStreamReader.java:181) > > > at > > > > > > java.io.InputStreamReader.read(InputStreamReader.java:244) > > > at > > > > > > java.io.BufferedReader.fill(BufferedReader.java:134) > > > at > > > > > > java.io.BufferedReader.readLine(BufferedReader.java:294) > > > at > > > > > > java.io.BufferedReader.readLine(BufferedReader.java:357) > > > at > > > > > > org.htmlparser.HTMLReader.getNextLine(HTMLReader.java:139) > > > at > > > > > > org.htmlparser.HTMLReader.readElement(HTMLReader.java:176) > > > at > > > > > > org.htmlparser.util.HTMLEnumerationImpl.peek(HTMLEnumerationImpl.java:60) > > > at > > > > > > org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes(HTMLEnumerationImpl.java:91) > > > > > > Now, if I copy the source of these pages from a > > > browser into a file and put them on my own > > > webserver, > > > I can parse them without any errors. > > > > > > It's my guess that there is some strange control > > > character in the source that is causing the > > > exception, > > > but I'm not entirely sure. Any suggestions? If > > it > > > is > > > a bad character, would it be possible to add > code > > to > > > HTMLReader that strips offending characters from > > the > > > input stream? > > > > > > Here is the code I am using to parse: > > > > > > DefaultHTMLParserFeedback feedback > > > = new > > > > > > DefaultHTMLParserFeedback(DefaultHTMLParserFeedback.DEBUG); > > > > > > HTMLReader reader = null; > > > HTMLParser parser = null; > > > InputStreamReader isr > > > = new > > > InputStreamReader(urlConn.getInputStream()); > > > reader = new HTMLReader(isr, 8192); > > > parser = new HTMLParser(reader, > feedback); > > > boolean inForm = false; > > > > > > parser.addScanner(new > > > HTMLInputTagScanner()); > > > > > > HTMLEnumeration tags = > parser.elements(); > > > > > > RequestParameters params = new > > > RequestParameters(); > > > > > > while (tags.hasMoreNodes()) > > > { > > > ... > > > } > > > > > > > > > Thanks, > > > > > > Bob Lewis > > > > === message truncated === __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |