Re: [Htmlparser-user] CDATA in script breakes parser?
Brought to you by:
derrickoswald
From: Eugeny N D. <bo...@re...> - 2006-11-08 16:50:36
|
> Hi there, I found page: http://www.katzenfinch.com/ > This page contains several links, but HtmlParser does not follow them - in > general after parsing items it has only head and meta tags available - no body > tag with links, tables etc. > > Looks like CDATA item inside JavaScript breakes things? > Could somebody please advice? I tried to use this code: import java.io.InputStream; import java.util.LinkedList; import org.apache.log4j.Logger; import org.xml.sax.Attributes; import org.xml.sax.ErrorHandler; import org.xml.sax.InputSource; import org.xml.sax.SAXException; import org.xml.sax.SAXParseException; import org.xml.sax.helpers.DefaultHandler; public class SAXHTMLParser extends DefaultHandler { private static Logger log = Logger.getLogger(SAXHTMLParser.class); public LinkedList parseDocument(InputStream document, String encoding) { try { org.xml.sax.XMLReader reader = org.xml.sax.helpers.XMLReaderFactory .createXMLReader("org.htmlparser.sax.XMLReader"); reader.setContentHandler(this); reader.setErrorHandler(new MyErrorHandler()); reader.parse(new InputSource(document)); } catch (Exception e) { log.error(e, e); } return new LinkedList(); } /** *@see org.xml.sax.helpers.DefaultHandler#startElement(java.lang.String, java.lang.String, java.lang.String, org.xml.sax.Attributes) */ public void startElement(String uri, String localName, String qName, Attributes attrs) throws SAXException { // if ("img".equalsIgnoreCase(qName) || "a".equalsIgnoreCase(qName) // || "frame".equalsIgnoreCase(qName) // || "title".equalsIgnoreCase(qName) // || "base".equalsIgnoreCase(qName)) log.debug(localName); } class MyErrorHandler implements ErrorHandler { /** *@see org.xml.sax.ErrorHandler#error(org.xml.sax.SAXParseException) */ public void error(SAXParseException arg0) throws SAXException { log.error(arg0); } /** *@see org.xml.sax.ErrorHandler#fatalError(org.xml.sax.SAXParseException) */ public void fatalError(SAXParseException arg0) throws SAXException { log.error(arg0); } /** *@see org.xml.sax.ErrorHandler#warning(org.xml.sax.SAXParseException) */ public void warning(SAXParseException arg0) throws SAXException { log.error(arg0); } } } and results were [main] DEBUG SAXHTMLParser - !DOCTYPE [main] DEBUG SAXHTMLParser - HTML [main] DEBUG SAXHTMLParser - HEAD [main] DEBUG SAXHTMLParser - TITLE [main] DEBUG SAXHTMLParser - META [main] DEBUG SAXHTMLParser - META [main] DEBUG SAXHTMLParser - STYLE [main] DEBUG SAXHTMLParser - SCRIPT but if I switch to another SAX parser for HTML org.xml.sax.XMLReader reader = org.xml.sax.helpers.XMLReaderFactory .createXMLReader("org.ccil.cowan.tagsoup.Parser"); reader.setContentHandler(this); reader.setErrorHandler(new MyErrorHandler()); reader.parse(new InputSource(document)); I see this: [main] DEBUG .SAXHTMLParser - html [main] DEBUG .SAXHTMLParser - head [main] DEBUG .SAXHTMLParser - title [main] DEBUG .SAXHTMLParser - meta [main] DEBUG .SAXHTMLParser - meta [main] DEBUG .SAXHTMLParser - style [main] DEBUG .SAXHTMLParser - script [main] DEBUG .SAXHTMLParser - body [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - table [main] DEBUG .SAXHTMLParser - tr [main] DEBUG .SAXHTMLParser - td [main] DEBUG .SAXHTMLParser - td [main] DEBUG .SAXHTMLParser - tr [main] DEBUG .SAXHTMLParser - td [main] DEBUG .SAXHTMLParser - td [main] DEBUG .SAXHTMLParser - p [main] DEBUG .SAXHTMLParser - strong [main] DEBUG .SAXHTMLParser - br [main] DEBUG .SAXHTMLParser - span [main] DEBUG .SAXHTMLParser - td [main] DEBUG .SAXHTMLParser - tr [main] DEBUG .SAXHTMLParser - td [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - td [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - td [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - a [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - noscript [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - a [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - script [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - a [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - a [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - a [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - a [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - a [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - a [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - a [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - a [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - script So looks like implementation of a SAX parser in the HtmlParser is a bit buggy? Is it possible to provide custom SAX parser for HTMLParser library somehow? -- Eugene N Dzhurinsky |