Hello guys,

I have a mobile application already published in the Apple's app store.

This SPI client app uses a Rest API in the server side to retrieve real time information regarding buses arrivals in a specific bus stop.

The app was working like a charm for 6 months.

The Rest API uses WebHarvest to scrap the real data information from a website (for instance: http://www.metlink.org.nz/stop/4912/departures).

Few days ago the HTML page scraped from my server side code has changed by adding the following line:

˜˜˜˜˜˜
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">

˜˜˜˜˜˜
Since than, my app has stopped working.

I know I can strip the line above using regExp but I would like to know if there is a way to inform WebHarvest to disable the XML validation. Disabling XML validation, I don't need to go in every configuration that I have and change my xpath expression to a regExp to strip the line above.

Here is my configuration file:
˜˜˜˜˜˜
<config charset="UTF-8">

<var-def name="pageContentStr">
    <html-to-xml>
        <http url="http://www.metlink.org.nz/stop/${stationID.toString()}/departures" />
    </html-to-xml>
</var-def>

<var-def name="serverTime">
    <xpath expression="/html/body/ul/li/span/text()">
        <var name="pageContentStr" />
    </xpath>
</var-def>

<var-def name="busRTI">
        <xpath expression="//tbody/tr[@data-code]/concat(td[1]/a[starts-with(@href,'timetables/')]/span/text(),'::',td[1]/a[starts-with(@href,'timetables/bus/')]/span/attribute::style,'::',td[2]/span/text(),'::',td[3]/span/text())">
            <var name="pageContentStr" />
        </xpath>
</var-def>

</config>
˜˜˜˜˜˜
The config file inserted above is working fine if I run it inside WebHarvest GUI (weird). However, I receive an error when running it inside my Rest API. Here is the error that I receive:

˜˜˜˜˜˜
exception

org.springframework.web.util.NestedServletException: Request processing failed; nested exception is org.webharvest.exception.ScraperXPathException: Error parsing XPath expression (XPath = [/html/body/ul/li/span/text()])!
org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:948)
org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:827)
javax.servlet.http.HttpServlet.service(HttpServlet.java:621)
org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:812)
javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
root cause

org.webharvest.exception.ScraperXPathException: Error parsing XPath expression (XPath = [/html/body/ul/li/span/text()])!
org.webharvest.runtime.processors.XPathProcessor.execute(XPathProcessor.java:70)
org.webharvest.runtime.processors.BaseProcessor.run(BaseProcessor.java:115)
org.webharvest.runtime.processors.BodyProcessor.execute(BodyProcessor.java:25)
org.webharvest.runtime.processors.VarDefProcessor.execute(VarDefProcessor.java:59)
org.webharvest.runtime.processors.BaseProcessor.run(BaseProcessor.java:115)
org.webharvest.runtime.Scraper.execute(Scraper.java:166)
org.webharvest.runtime.Scraper.execute(Scraper.java:179)
com.didibaba.services.adapters.metlink.MetLinkAdapterImpl.scrapeBusesForStation(MetLinkAdapterImpl.java:147)
com.didibaba.services.adapters.metlink.MetLinkAdapterImpl.getStationBuses(MetLinkAdapterImpl.java:118)
com.didibaba.services.BusStationServiceImpl.getBusStationInfoByName(BusStationServiceImpl.java:80)
com.didibaba.web.controllers.BusStationController.getBusStationInfo(BusStationController.java:36)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)
org.springframework.web.method.support.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:219)
org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:132)
org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:104)
org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandleMethod(RequestMappingHandlerAdapter.java:745)
org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:686)
org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:80)
org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:925)
org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:856)
org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:936)
org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:827)
javax.servlet.http.HttpServlet.service(HttpServlet.java:621)
org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:812)
javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
root cause

net.sf.saxon.trans.XPathException: org.xml.sax.SAXParseExceptionpublicId: -//W3C//DTD HTML 4.0 Transitional//EN; systemId: http://www.w3.org/TR/REC-html40/loose.dtd; lineNumber: 31; columnNumber: 3; The declaration for the entity "HTML.Version" must end with '>'.
net.sf.saxon.event.Sender.sendSAXSource(Sender.java:420)
net.sf.saxon.event.Sender.send(Sender.java:169)
net.sf.saxon.Configuration.buildDocument(Configuration.java:3346)
net.sf.saxon.Configuration.buildDocument(Configuration.java:3288)
net.sf.saxon.query.StaticQueryContext.buildDocument(StaticQueryContext.java:327)
org.webharvest.utils.XmlUtil.evaluateXPath(XmlUtil.java:77)
org.webharvest.runtime.processors.XPathProcessor.execute(XPathProcessor.java:68)
org.webharvest.runtime.processors.BaseProcessor.run(BaseProcessor.java:115)
org.webharvest.runtime.processors.BodyProcessor.execute(BodyProcessor.java:25)
org.webharvest.runtime.processors.VarDefProcessor.execute(VarDefProcessor.java:59)
org.webharvest.runtime.processors.BaseProcessor.run(BaseProcessor.java:115)
org.webharvest.runtime.Scraper.execute(Scraper.java:166)
org.webharvest.runtime.Scraper.execute(Scraper.java:179)
com.didibaba.services.adapters.metlink.MetLinkAdapterImpl.scrapeBusesForStation(MetLinkAdapterImpl.java:147)
com.didibaba.services.adapters.metlink.MetLinkAdapterImpl.getStationBuses(MetLinkAdapterImpl.java:118)
com.didibaba.services.BusStationServiceImpl.getBusStationInfoByName(BusStationServiceImpl.java:80)
com.didibaba.web.controllers.BusStationController.getBusStationInfo(BusStationController.java:36)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)
org.springframework.web.method.support.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:219)
org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:132)
org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:104)
org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandleMethod(RequestMappingHandlerAdapter.java:745)
org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:686)
org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:80)
org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:925)
org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:856)
org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:936)
org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:827)
javax.servlet.http.HttpServlet.service(HttpServlet.java:621)
org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:812)
javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
root cause

org.xml.sax.SAXParseExceptionpublicId: -//W3C//DTD HTML 4.0 Transitional//EN; systemId: http://www.w3.org/TR/REC-html40/loose.dtd; lineNumber: 31; columnNumber: 3; The declaration for the entity "HTML.Version" must end with '>'.
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441)
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1388)
com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanEntityDecl(XMLDTDScannerImpl.java:1562)
com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanDecls(XMLDTDScannerImpl.java:1964)
com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanDTDExternalSubset(XMLDTDScannerImpl.java:297)
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1162)
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1049)
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:962)
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:607)
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116)
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:489)
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:835)
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123)
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1210)
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:568)
net.sf.saxon.event.Sender.sendSAXSource(Sender.java:396)
net.sf.saxon.event.Sender.send(Sender.java:169)
net.sf.saxon.Configuration.buildDocument(Configuration.java:3346)
net.sf.saxon.Configuration.buildDocument(Configuration.java:3288)
net.sf.saxon.query.StaticQueryContext.buildDocument(StaticQueryContext.java:327)
org.webharvest.utils.XmlUtil.evaluateXPath(XmlUtil.java:77)
org.webharvest.runtime.processors.XPathProcessor.execute(XPathProcessor.java:68)
org.webharvest.runtime.processors.BaseProcessor.run(BaseProcessor.java:115)
org.webharvest.runtime.processors.BodyProcessor.execute(BodyProcessor.java:25)
org.webharvest.runtime.processors.VarDefProcessor.execute(VarDefProcessor.java:59)
org.webharvest.runtime.processors.BaseProcessor.run(BaseProcessor.java:115)
org.webharvest.runtime.Scraper.execute(Scraper.java:166)
org.webharvest.runtime.Scraper.execute(Scraper.java:179)
com.didibaba.services.adapters.metlink.MetLinkAdapterImpl.scrapeBusesForStation(MetLinkAdapterImpl.java:147)
com.didibaba.services.adapters.metlink.MetLinkAdapterImpl.getStationBuses(MetLinkAdapterImpl.java:118)
com.didibaba.services.BusStationServiceImpl.getBusStationInfoByName(BusStationServiceImpl.java:80)
com.didibaba.web.controllers.BusStationController.getBusStationInfo(BusStationController.java:36)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)
org.springframework.web.method.support.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:219)
org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:132)
org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:104)
org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandleMethod(RequestMappingHandlerAdapter.java:745)
org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:686)
org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:80)
org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:925)
org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:856)
org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:936)
org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:827)
javax.servlet.http.HttpServlet.service(HttpServlet.java:621)
org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:812)
javax.servlet.http.HttpServlet.service(HttpServlet.java:728)

˜˜˜˜˜˜

Thanks in advance.

 

Last edit: Júlio Adrian M Van Helden 2014-08-21