I just found while harvesting some pages, the http op just hang up forever! in this case, the consequent pages would not get chance to be harvested, or is there any other way to work out this?
You can use the trick with <script> configuration to override the global HttpClient's settings
(as it has been described at: https://sourceforge.net/projects/web-harvest/forums/forum/591299/topic/3903602)
However setting just a timeouts is not enough as the HTTP processor works in protected mode.
Whenever something happenns with the HTTP request it tries to execute it once again (after short delay).
By default the number of attempts is set to 5 and the delay between attempts is 10 seconds.
You can easily override theses values with HTTP processor retry-attempts and retry-delay attributes not covered in documentation).
Example configuration might look as follows:
I was seeing this same issue and found out that by default, the HttpClient (Commons HttpClient 3.1) used by the HttpProcessor has no timeouts...I just updated the HttpClientManager file (this is where the HttpClient is instantiated) to include timeouts of 10s.
Instead of setting timeouts in the script, I set them in the library file.
I added this to the constructor (right after line 78):
clientParams.setConnectionManagerTimeout(10000);
clientParams.setSoTimeout(10000);
HttpConnectionManagerParams connectionManagerParams = client.getHttpConnectionManager().getParams();
connectionManagerParams.setConnectionTimeout(10000);
client.getHttpConnectionManager().setParams(connectionManagerParams);
This has appeared to solve the problem.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
HttpHang Log
You can use the trick with
<script>configuration to override the global HttpClient's settings(as it has been described at: https://sourceforge.net/projects/web-harvest/forums/forum/591299/topic/3903602)
However setting just a timeouts is not enough as the HTTP processor works in protected mode.
Whenever something happenns with the HTTP request it tries to execute it once again (after short delay).
By default the number of attempts is set to 5 and the delay between attempts is 10 seconds.
You can easily override theses values with HTTP processor retry-attempts and retry-delay attributes not covered in documentation).
Example configuration might look as follows:
I was seeing this same issue and found out that by default, the HttpClient (Commons HttpClient 3.1) used by the HttpProcessor has no timeouts...I just updated the HttpClientManager file (this is where the HttpClient is instantiated) to include timeouts of 10s.
I used this post as my guide: https://sourceforge.net/projects/web-harvest/forums/forum/591299/topic/3903602
Instead of setting timeouts in the script, I set them in the library file.
I added this to the constructor (right after line 78):
clientParams.setConnectionManagerTimeout(10000);
clientParams.setSoTimeout(10000);
HttpConnectionManagerParams connectionManagerParams = client.getHttpConnectionManager().getParams();
connectionManagerParams.setConnectionTimeout(10000);
client.getHttpConnectionManager().setParams(connectionManagerParams);
This has appeared to solve the problem.