Menu

Unable to scrap a website mill-max.com. Scraping from other sites works. Please Help !!!

Help
2013-09-26
2013-09-26
  • tarandeep sawhney

    I am using WebHarvest for web scrapping and created one config file to scrap some data from sites. But I am getting I/O exception. I am able to scrap another website successfully, but this particular website is creating problem. May i please request to provide help to resolve this problem:

    Config.xml

    <config> <http url="https://www.mill-max.com"> </http> </config>

    ERROR Stack trace:

    D:\Web_Crawler\Source Code\Scrapper\utility>java -jar webharvest_all_2.jar confi
    g=../MillMax/Max.xml
    INFO ( ?:? ) - XML parsed in 33ms.
    INFO ( ?:? ) - VarDefProcessor starts processing...
    INFO ( ?:? ) - ConstantProcessor starts processing...
    INFO ( ?:? ) - ConstantProcessor processor executed in 0
    ms.
    INFO ( ?:? ) - VarDefProcessor processor executed in 3ms.
    INFO ( ?:? ) - HttpProcessor starts processing...
    INFO (HttpMethodDirector.java:439) - I/O exception (java.net.SocketException) c
    aught when processing request: Software caused connection abort: recv failed
    INFO (HttpMethodDirector.java:445) - Retrying request
    INFO (HttpMethodDirector.java:439) - I/O exception (java.net.SocketException) c
    aught when processing request: Software caused connection abort: recv failed
    INFO (HttpMethodDirector.java:445) - Retrying request
    INFO (HttpMethodDirector.java:439) - I/O exception (java.net.SocketException) c
    aught when processing request: Software caused connection abort: recv failed
    INFO (HttpMethodDirector.java:445) - Retrying request
    Exception in thread "main" org.webharvest.exception.HttpException: IO error duri
    ng HTTP execution for URL: https://www.mill-max.com
    at org.webharvest.runtime.web.HttpClientManager.execute(Unknown Source)
    at org.webharvest.runtime.processors.HttpProcessor.execute(Unknown Sourc
    e)
    at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
    at org.webharvest.runtime.Scraper.execute(Unknown Source)
    at org.webharvest.runtime.Scraper.execute(Unknown Source)
    at CommandLine.main(Unknown Source)
    Caused by: java.net.SocketException: Software caused connection abort: recv fail
    ed

    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.read(SocketInputStream.java:168)
    at java.net.SocketInputStream.read(SocketInputStream.java:121)
    at sun.security.ssl.InputRecord.readFully(InputRecord.java:312)
    at sun.security.ssl.InputRecord.read(InputRecord.java:350)
    at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:861)
    at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.
    java:1262)
    at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:680)
    at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:85)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82
    )
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
    at org.apache.commons.httpclient.HttpConnection.flushRequestOutputStream
    (HttpConnection.java:828)
    at org.apache.commons.httpclient.HttpMethodBase.writeRequest(HttpMethodB
    ase.java:2116)
    at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.j
    ava:1096)
    at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Htt
    pMethodDirector.java:398)
    at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMe
    thodDirector.java:171)
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.jav
    a:397)
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.jav
    a:323)
    ... 6 more

     
  • Maciej Czapiewski

    Hi,

    Can you attach your configuration file or part of configuration which is responsible for connecting to https://www.mill-max.com site?

    Cheers,
    Maciej

     

    Last edit: Maciej Czapiewski 2013-09-26
    • tarandeep sawhney

      Thanks Maciej for your response.

      Please find below configuration file that we are using to scrap mill-max.com site.
      For now, this is a very basic config being used for testing scraping for this site.

      We have tried scraping for other sites by changing site URL below, and it works without this error, but it throws above error for mil-max.com site. Please help!!

      <config>

          <file action="write" path="D:/milmax/source.xml">
              <xpath expression="//body">
              <html-to-xml>
                  <http url="https://www.mill-max.com/"/>
              </html-to-xml>
              </xpath>
          </file>
      

      </config>

       
    • tarandeep sawhney

      Hi Maciej, sorry for bothering

      Did you get a chance to look into this issue and do you have any thoughts why it is occuring

      thanks in advance

      regards
      tarandeep

       
  • Maciej Czapiewski

    Hi tarandeep,

    Unfortunately, not yet. You have to give me more time and I will try to give you any hint as soon as possible.

    Cheers,
    Maciej

     
    • tarandeep sawhney

      Sure Maciej, whenever you get time

      Looking forward to your inputs

      thanks in advance

      regards
      tarandeep

       

Log in to post a comment.