Menu

#24 HTTP processor hangs some times

2.1.0
wont-fix
None
5
2025-09-06
2010-04-10
jim huang
No

I just found while harvesting some pages, the http op just hang up forever! in this case, the consequent pages would not get chance to be harvested, or is there any other way to work out this?

Log file attached.

Discussion

  • jim huang

    jim huang - 2010-04-10

    HttpHang Log

     
    • Robert Bala

      Robert Bala - 2012-11-15

      You can use the trick with <script> configuration to override the global HttpClient's settings
      (as it has been described at: https://sourceforge.net/projects/web-harvest/forums/forum/591299/topic/3903602)
      However setting just a timeouts is not enough as the HTTP processor works in protected mode.
      Whenever something happenns with the HTTP request it tries to execute it once again (after short delay).
      By default the number of attempts is set to 5 and the delay between attempts is 10 seconds.
      You can easily override theses values with HTTP processor retry-attempts and retry-delay attributes not covered in documentation).
      Example configuration might look as follows:

      <?xml version="1.0" encoding="UTF-8" ?> 
      <config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://web-harvest.sourceforge.net/schema/1.0/config" xsi:schemaLocation="http://web-harvest.sourceforge.net/schema/1.0/config config.xsd">
          <script>
              http.client.params.soTimeout = 10000;
              http.client.params.connectionManagerTimeout = 2000;
              http.client.httpConnectionManager.params.connectionTimeout = 3000;
          </script>
          <var-def name="response">
              <http url="http://localhost/timeout.php" method="GET" retry-attempts="1" retry-delay="0"/> 
          </var-def>             
       </config>
      
       
  • Tristen

    Tristen - 2011-07-17

    I was seeing this same issue and found out that by default, the HttpClient (Commons HttpClient 3.1) used by the HttpProcessor has no timeouts...I just updated the HttpClientManager file (this is where the HttpClient is instantiated) to include timeouts of 10s.

    I used this post as my guide: https://sourceforge.net/projects/web-harvest/forums/forum/591299/topic/3903602

    Instead of setting timeouts in the script, I set them in the library file.

    I added this to the constructor (right after line 78):
    clientParams.setConnectionManagerTimeout(10000);
    clientParams.setSoTimeout(10000);
    HttpConnectionManagerParams connectionManagerParams = client.getHttpConnectionManager().getParams();
    connectionManagerParams.setConnectionTimeout(10000);
    client.getHttpConnectionManager().setParams(connectionManagerParams);

    This has appeared to solve the problem.

     
  • Piotr Dyraga

    Piotr Dyraga - 2012-11-15
    • milestone: --> Backlog
     
  • Robert Bala

    Robert Bala - 2012-11-15
    • assigned_to: Robert Bala
    • milestone: Backlog --> 2.1.0rc1-RELEASE
     
  • Robert Bala

    Robert Bala - 2012-11-15
    • status: open --> wont-fix
     

Anonymous
Anonymous

Add attachments
Cancel





MongoDB Logo MongoDB