I am using WebHarvest for web scrapping and created one config file to scrap some data from sites. But I am getting I/O exception. I am able to scrap another website successfully, but this particular website is creating problem. May i please request to provide help to resolve this problem:
D:\Web_Crawler\Source Code\Scrapper\utility>java -jar webharvest_all_2.jar confi
g=../MillMax/Max.xml
INFO ( ?:? ) - XML parsed in 33ms.
INFO ( ?:? ) - VarDefProcessor starts processing...
INFO ( ?:? ) - ConstantProcessor starts processing...
INFO ( ?:? ) - ConstantProcessor processor executed in 0
ms.
INFO ( ?:? ) - VarDefProcessor processor executed in 3ms.
INFO ( ?:? ) - HttpProcessor starts processing...
INFO (HttpMethodDirector.java:439) - I/O exception (java.net.SocketException) c
aught when processing request: Software caused connection abort: recv failed
INFO (HttpMethodDirector.java:445) - Retrying request
INFO (HttpMethodDirector.java:439) - I/O exception (java.net.SocketException) c
aught when processing request: Software caused connection abort: recv failed
INFO (HttpMethodDirector.java:445) - Retrying request
INFO (HttpMethodDirector.java:439) - I/O exception (java.net.SocketException) c
aught when processing request: Software caused connection abort: recv failed
INFO (HttpMethodDirector.java:445) - Retrying request Exception in thread "main" org.webharvest.exception.HttpException: IO error duri
ng HTTP execution for URL: https://www.mill-max.com
at org.webharvest.runtime.web.HttpClientManager.execute(Unknown Source)
at org.webharvest.runtime.processors.HttpProcessor.execute(Unknown Sourc
e)
at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
at org.webharvest.runtime.Scraper.execute(Unknown Source)
at org.webharvest.runtime.Scraper.execute(Unknown Source)
at CommandLine.main(Unknown Source)
Caused by: java.net.SocketException: Software caused connection abort: recv fail
ed
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:168)
at java.net.SocketInputStream.read(SocketInputStream.java:121)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:312)
at sun.security.ssl.InputRecord.read(InputRecord.java:350)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:861)
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.
java:1262)
at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:680)
at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:85)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82
)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at org.apache.commons.httpclient.HttpConnection.flushRequestOutputStream
(HttpConnection.java:828)
at org.apache.commons.httpclient.HttpMethodBase.writeRequest(HttpMethodB
ase.java:2116)
at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.j
ava:1096)
at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Htt
pMethodDirector.java:398)
at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMe
thodDirector.java:171)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.jav
a:397)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.jav
a:323)
... 6 more
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Please find below configuration file that we are using to scrap mill-max.com site.
For now, this is a very basic config being used for testing scraping for this site.
We have tried scraping for other sites by changing site URL below, and it works without this error, but it throws above error for mil-max.com site. Please help!!
I am using WebHarvest for web scrapping and created one config file to scrap some data from sites. But I am getting I/O exception. I am able to scrap another website successfully, but this particular website is creating problem. May i please request to provide help to resolve this problem:
Config.xml
<config> <http url="https://www.mill-max.com"> </http> </config>ERROR Stack trace:
D:\Web_Crawler\Source Code\Scrapper\utility>java -jar webharvest_all_2.jar confi
g=../MillMax/Max.xml
INFO ( ?:? ) - XML parsed in 33ms.
INFO ( ?:? ) - VarDefProcessor starts processing...
INFO ( ?:? ) - ConstantProcessor starts processing...
INFO ( ?:? ) - ConstantProcessor processor executed in 0
ms.
INFO ( ?:? ) - VarDefProcessor processor executed in 3ms.
INFO ( ?:? ) - HttpProcessor starts processing...
INFO (HttpMethodDirector.java:439) - I/O exception (java.net.SocketException) c
aught when processing request: Software caused connection abort: recv failed
INFO (HttpMethodDirector.java:445) - Retrying request
INFO (HttpMethodDirector.java:439) - I/O exception (java.net.SocketException) c
aught when processing request: Software caused connection abort: recv failed
INFO (HttpMethodDirector.java:445) - Retrying request
INFO (HttpMethodDirector.java:439) - I/O exception (java.net.SocketException) c
aught when processing request: Software caused connection abort: recv failed
INFO (HttpMethodDirector.java:445) - Retrying request
Exception in thread "main" org.webharvest.exception.HttpException: IO error duri
ng HTTP execution for URL: https://www.mill-max.com
at org.webharvest.runtime.web.HttpClientManager.execute(Unknown Source)
at org.webharvest.runtime.processors.HttpProcessor.execute(Unknown Sourc
e)
at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
at org.webharvest.runtime.Scraper.execute(Unknown Source)
at org.webharvest.runtime.Scraper.execute(Unknown Source)
at CommandLine.main(Unknown Source)
Caused by: java.net.SocketException: Software caused connection abort: recv fail
ed
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:168)
at java.net.SocketInputStream.read(SocketInputStream.java:121)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:312)
at sun.security.ssl.InputRecord.read(InputRecord.java:350)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:861)
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.
java:1262)
at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:680)
at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:85)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82
)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at org.apache.commons.httpclient.HttpConnection.flushRequestOutputStream
(HttpConnection.java:828)
at org.apache.commons.httpclient.HttpMethodBase.writeRequest(HttpMethodB
ase.java:2116)
at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.j
ava:1096)
at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Htt
pMethodDirector.java:398)
at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMe
thodDirector.java:171)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.jav
a:397)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.jav
a:323)
... 6 more
Hi,
Can you attach your configuration file or part of configuration which is responsible for connecting to https://www.mill-max.com site?
Cheers,
Maciej
Last edit: Maciej Czapiewski 2013-09-26
Thanks Maciej for your response.
Please find below configuration file that we are using to scrap mill-max.com site.
For now, this is a very basic config being used for testing scraping for this site.
We have tried scraping for other sites by changing site URL below, and it works without this error, but it throws above error for mil-max.com site. Please help!!
<config>
</config>
Hi Maciej, sorry for bothering
Did you get a chance to look into this issue and do you have any thoughts why it is occuring
thanks in advance
regards
tarandeep
Hi tarandeep,
Unfortunately, not yet. You have to give me more time and I will try to give you any hint as soon as possible.
Cheers,
Maciej
Sure Maciej, whenever you get time
Looking forward to your inputs
thanks in advance
regards
tarandeep