I'm starting to use web-harvest and a page I'm scrapping is using rdf and dublin cores. I've declared the namespaces in the xml declaration of the configuration file and in the root node of the output file but I still have a SAXParseException.
And here is my error:
org.xml.sax.SAXParseException: The encoding declaration is required in the text declaration.
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:236)
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:215)
at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:386)
at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:316)
at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1438)
at com.sun.org.apache.xerces.internal.impl.XMLScanner.scanXMLDeclOrTextDecl(XMLScanner.java:488)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanXMLDeclOrTextDecl(XMLDocumentFragmentScannerImpl.java:710)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$XMLDeclDispatcher.dispatch(XMLDocumentScannerImpl.java:721)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:375)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:176)
at org.webharvest.definition.XmlParser.parse(Unknown Source)
at org.webharvest.definition.XmlNode.getInstance(Unknown Source)
at org.webharvest.definition.ScraperConfiguration.<init>(Unknown Source)
at org.webharvest.definition.ScraperConfiguration.<init>(Unknown Source)
at biblic.Main.main(Main.java:30)
I'm looking forward to read your tips/advices,
pierre
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I'm starting to use web-harvest and a page I'm scrapping is using rdf and dublin cores. I've declared the namespaces in the xml declaration of the configuration file and in the root node of the output file but I still have a SAXParseException.
Here is my configuration file:
<?xml version="1.0" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" encoding="UTF-8" ?>
<config charset="UTF-8">
...
<file action="write" path="data/articles.xml">
<![CDATA[ <articles xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" > ]]>
<loop item="article" index="i">
<list>
<xpath expression="/rdf:RDF/item">
<html-to-xml>
<http url="${url}"/>
</html-to-xml>
</xpath>
</list>
<body>
<xquery>
<xq-param name="article"><var name="article"/></xq-param>
<xq-expression><![CDATA
let $title := data($article/dc:title)
let $link := data($item/link)
let $source := data($item/dc:source)
return
<article>
<title>{normalize-space($title)}</title>
<link>{normalize-space($link)}</link>
<source>{normalize-space($source)}</source>
</article>
]></xq-expression>
</xquery>
</body>
</loop>
<![CDATA[ </articles> ]]>
And here is my error:
org.xml.sax.SAXParseException: The encoding declaration is required in the text declaration.
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:236)
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:215)
at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:386)
at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:316)
at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1438)
at com.sun.org.apache.xerces.internal.impl.XMLScanner.scanXMLDeclOrTextDecl(XMLScanner.java:488)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanXMLDeclOrTextDecl(XMLDocumentFragmentScannerImpl.java:710)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$XMLDeclDispatcher.dispatch(XMLDocumentScannerImpl.java:721)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:375)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:176)
at org.webharvest.definition.XmlParser.parse(Unknown Source)
at org.webharvest.definition.XmlNode.getInstance(Unknown Source)
at org.webharvest.definition.ScraperConfiguration.<init>(Unknown Source)
at org.webharvest.definition.ScraperConfiguration.<init>(Unknown Source)
at biblic.Main.main(Main.java:30)
I'm looking forward to read your tips/advices,
pierre