|
From: Stevi D. <ste...@gm...> - 2010-01-21 19:57:10
|
I'm having a hard time figuring out where to look for the cause of the following problem, so I'm trying to rule in/out the various elements. My problem is this -- very occasionally, I have a RESTEasy web service call that delays for about 60 seconds between the time the code completes and the time the response is returned. In one recent example, the two service calls, which manage the database transactions, etc. using Spring and Hibernate, made by request took 87 milliseconds, while the entire http request took 60267 milliseconds. (FWIW, I'm using a tool called beet (http://beet.sourceforge.com/) to track the duration of the http-request and my high level service calls.) We're using RESTEasy 1.2.1.GA (although I saw this with 1.1.GA as well), on a clustered WebSphere 6.1 server (4 nodes) that's behind an IBM ODR Request Router. We're using Castor for marshalling/unmarshalling with some custom logic for handling nested relationships. The calls these are happening to usually return in under 500 milliseconds, and if the call is repeated, it works fine. I'm not seeing any specific pattern to when the calls happened (not closely clustered, for example, and doesn't seem to be triggered by high load created using load testing tools). For additional information we've been having issues with the ODR queuing requests for up to 60 seconds (as noted by comparing the apache http server request log timestamps with the beet logged timestamps for the same http requests). And this only happens in our QA environment. I've not yet seen it happen in our Test environment, which is supposed to be set up the same, but we've consistently seen network issues in QA that don't happen in Test (although the fact QA is usually under slightly more load may account for that). All of this makes me suspect some network problem, but I'm at a loss how to isolate the issue to prove/disprove this. It feels almost like it's the reverse of the request router queuing the request, that there's something causing the response to just hang before completing. Amusingly, we haven't found any evidence it ever happened during our performance testing. If anybody has any suggestions for how to isolate the cause of this problem, it would be a huge help. Thanks in advance for any help! Sincerely, Stevi |