Re: [Osgmm-discuss] Setting low rank on <site> because we have not heard from condor_q
Brought to you by:
mats_rynge
From: Peter D. <do...@cr...> - 2009-12-30 18:23:55
|
On Dec 29, 2009, at 22:34 PM, Mats Rynge wrote: > > Yes, the many jobs in the queue is causing condor_q to time out. I have made some improvements in later OSGMM versions, but the issues has not been fully solved. Occasionally seeing this message is fine, but if you see it all the time, that is a problem. I'm running v0.8. And I was seeing the error all the time. I had 5000 jobs queued up, short running jobs, and they were completing faster than the OSGMM could match them I think. The OSGMM java process was at 100% CPU (but just using one of the four cores, of course) and only about 200 jobs were running concurrently. > Not currently. I have been thinking about putting an explanation in the classad so that that it could be seen with condor_grid_overview. Would you prefer that over have more information in the logs? Either way, but some kind of understanding of where the rank number came from would be useful, especially for the sites under my administrative control, I know know if something needs fixing. Most of the sites seem to have a rank of 1, 200, 800,or 1000 (give or take a few points). And I just haven't quite figured out exactly where the value comes from. Cheers, Peter |