Re: [Osgmm-discuss] Condor Negotiator Crashing
Brought to you by:
mats_rynge
From: Peter D. <do...@cr...> - 2009-06-23 16:13:57
|
The Negotiator started crashing again. I'm sure it's some kind of conflict with the OSG Match-Maker I turned on D_FULL_DEBUG on the Negotiator Log, but that didn't tell me much more. Here's the output of condor_status and condor_status -l I wonder if it's some invalid class ad format. It's also interesting that at the top of condor_status you can see duplicates of some sites. I know some sites have two gatekeepers, but, for example, I know WQCG-Tuscany-OSG, which I helped set up, is just a simple setup, one CE, a few nodes, no SE, or anything else. I don't know why it's listed twice. Could this all be related to the ReSS changeover last week? http://abitibi.sbgrid.org/condor_status.txt http://abitibi.sbgrid.org/condor_status-l.txt NegotiatorLog: 6/23 12:07:08 ---------- Finished Negotiation Cycle ---------- 6/23 12:07:08 enter Matchmaker::updateCollector 6/23 12:07:08 Trying to update collector <10.0.10.39:9618> 6/23 12:07:08 Attempting to send update via UDP to collector abitibi.sbgrid.org <10.0.10.39:9618> 6/23 12:07:08 exit Matchmaker::UpdateCollector 6/23 12:07:33 ---------- Started Negotiation Cycle ---------- 6/23 12:07:33 Phase 1: Obtaining ads from collector ... 6/23 12:07:33 Getting all public ads ... 6/23 12:07:33 Trying to query collector <10.0.10.39:9618> 6/23 12:07:33 Sorting 208 ads ... 6/23 12:07:33 Getting startd private ads ... 6/23 12:07:33 Trying to query collector <10.0.10.39:9618> 6/23 12:07:33 Got ads: 208 public and 123 private 6/23 12:07:33 Public ads include 2 submitter, 174 startd 6/23 12:07:33 Entering compute_significant_attrs() 6/23 12:07:33 Leaving compute_significant_attrs() - result=JobUniverse,LastCheckpointPlatform,NumCkpts,EnteredCurrentState 6/23 12:07:33 Phase 2: Performing accounting ... 6/23 12:07:33 ERROR "Assertion ERROR on (resource_hash.insert( ResourceName, ResourceAd ) == 0)" at line 785 in file Accountant.cpp On Jun 22, 2009, at 4:27 PM, Alan De Smet wrote: > I'm juggling a few things right now, but I'm taking a quick look > at the negotiator ASSERTing. It may be that other layers are > also malfunctioning, I'm not sure, but even so it should not > cause the negotiator to fail in that way. I'll try to dig a bit > deeper. > > If this happens again, the output of "condor_status -l" might > prove helpful. My current hypothesis is that the collector is > misbehaving and ends up sending nonsensical data to the > negotiator, which throws its hands up in the air as a result. > > -- > Alan De Smet Condor Project Research > ad...@cs... http://www.cs.wisc.edu/condor/ |