I'm having a hard time figuring out where to look for the cause of the
following problem, so I'm trying to rule in/out the various elements.
My problem is this -- very occasionally, I have a RESTEasy web service
call that delays for about 60 seconds between the time the code
completes and the time the response is returned.
In one recent example, the two service calls, which manage the
database transactions, etc. using Spring and Hibernate, made by
request took 87 milliseconds, while the entire http request took 60267
milliseconds. (FWIW, I'm using a tool called beet
(http://beet.sourceforge.com/) to track the duration of the
http-request and my high level service calls.)
We're using RESTEasy 1.2.1.GA (although I saw this with 1.1.GA as
well), on a clustered WebSphere 6.1 server (4 nodes) that's behind an
IBM ODR Request Router. We're using Castor for
marshalling/unmarshalling with some custom logic for handling nested
relationships.
The calls these are happening to usually return in under 500
milliseconds, and if the call is repeated, it works fine. I'm not
seeing any specific pattern to when the calls happened (not closely
clustered, for example, and doesn't seem to be triggered by high load
created using load testing tools).
For additional information we've been having issues with the ODR
queuing requests for up to 60 seconds (as noted by comparing the
apache http server request log timestamps with the beet logged
timestamps for the same http requests).
And this only happens in our QA environment. I've not yet seen it
happen in our Test environment, which is supposed to be set up the
same, but we've consistently seen network issues in QA that don't
happen in Test (although the fact QA is usually under slightly more
load may account for that).
All of this makes me suspect some network problem, but I'm at a loss
how to isolate the issue to prove/disprove this. It feels almost like
it's the reverse of the request router queuing the request, that
there's something causing the response to just hang before completing.
Amusingly, we haven't found any evidence it ever happened during our
performance testing.
If anybody has any suggestions for how to isolate the cause of this
problem, it would be a huge help.
Thanks in advance for any help!
Sincerely,
Stevi
|