Re: [Osgmm-discuss] Condor Negotiator Crashing
Brought to you by:
mats_rynge
From: Peter D. <do...@cr...> - 2009-06-22 19:49:57
|
Okay, that eliminated the errors on the console when I launch the match maker. Thanks. For the moment the negotiator stopped crashing, but it stopped before I put the new jar file in, so I don't know what to make of that. At the moment there are no valid sites in the matchmaker, I'm going to have to look into things further to see what's going on. It seems the verification runs didn't run this afternoon. The matchmaker related processes don't look right to me. [root@abitibi var]# ps aux |grep osgmm root 12992 0.0 0.0 101064 1336 pts/12 S 15:22 0:00 su - s /bin/sh osgmm -c . /opt/osg-shared/client-1.0/setup.sh && cd /opt/ osg-shared/client-1.0/osg-match-maker/ && ./sbin/osgmm osgmm 12993 0.0 0.0 63832 1220 ? Ss 15:22 0:00 sh - c . /opt/osg-shared/client-1.0/setup.sh && cd /opt/osg-shared/ client-1.0/osg-match-maker/ && ./sbin/osgmm osgmm 13049 0.0 0.0 63836 1164 ? S 15:22 0:00 /bin/ sh ./sbin/osgmm osgmm 13058 0.5 1.1 1906144 44880 ? Sl 15:22 0:08 /opt/ osg-shared/client-1.0/jdk1.5/bin/java -Xmx1500m -jar lib/osgmm-0.5.jar osgmm 30889 0.0 0.0 63836 1124 ? S 15:47 0:00 /bin/ sh -e /tmp/shellwrapper-6820506812473286557.sh Anyhow, looking in the osgmm.log file I noticed something interesting. It tries to track jobs by their job log files, and if it can't access the file, it tries to chmod 644 the log file. a.) I don't know that I like the idea of the matchmaker trying to change permissions on files in people's home directories. b.) if it can't read the file, the odds are pretty slim it's going to be able to change permissions on the file. But I guess this is why I have so many 0's and empty columns in the various fields of condor_grid_overview. If it can't access the log files, it can't display what jobs are running where and what their status is in the condor_grid_overview output. Is that correct? --Peter 22 Jun 09 15:32:29 INFO Trying to chmod 644 /opt/osg-shared/macintel/ ijstokes/sad2/3cny/2-fast/output-3cny/54/log 22 Jun 09 15:32:31 INFO Added job 292525_/opt/osg-shared/macintel/ ijstokes/sad2/3cny/2-fast/output-3cny/09/log for tracking 22 Jun 09 15:32:31 ERROR Unable open logfile: /opt/osg-shared/ macintel/ijstokes/sad2/3cny/2-fast/output-3cny/09/log (Permission denied) 22 Jun 09 15:32:31 INFO Trying to chmod 644 /opt/osg-shared/macintel/ ijstokes/sad2/3cny/2-fast/output-3cny/09/log 22 Jun 09 15:32:34 INFO Added job 292526_/opt/osg-shared/macintel/ ijstokes/sad2/3cny/2-fast/output-3cny/38/log for tracking 22 Jun 09 15:32:34 ERROR Unable open logfile: /opt/osg-shared/ macintel/ijstokes/sad2/3cny/2-fast/output-3cny/38/log (Permission denied) 22 Jun 09 15:32:34 INFO Trying to chmod 644 /opt/osg-shared/macintel/ ijstokes/sad2/3cny/2-fast/output-3cny/38/log On Jun 22, 2009, at 3:18 PM, Mats Rynge wrote: > Peter Doherty wrote: >> I don't know what's going on here, but my jobs submitted to the >> MatchMaker aren't being matched, and I found out the condor >> negotiator keeps crashing. If I shut down osgmm, the negotiator >> keeps running, but then if I start up osgmm, the negotiator >> crashes when it starts to match one of my jobs. >> Here are some of the errors I'm getting. >> I'm not sure where to start with this. > > I haven't seen the negotiator crash before, but I have seen the > hostname problem recently. Please try this preview of 0.7: > > http://www.renci.org/~rynge/osgmm-0.6.jar > > Replace the one you have in lib/ > > > > >> NegotiatorLog >> 6/22 14:41:25 ****************************************************** >> 6/22 14:41:25 ** condor_negotiator (CONDOR_NEGOTIATOR) STARTING UP >> 6/22 14:41:25 ** /opt/osg-shared/se/app/site/condor-7.2.1/sbin/ >> condor_negotiator >> 6/22 14:41:25 ** SubsystemInfo: name=NEGOTIATOR type=NEGOTIATOR(4) >> class=DAEMON(1) >> 6/22 14:41:25 ** Configuration: subsystem:NEGOTIATOR local:<NONE> >> class:DAEMON >> 6/22 14:41:25 ** $CondorVersion: 7.2.1 Feb 18 2009 BuildID: 133382 $ >> 6/22 14:41:25 ** $CondorPlatform: X86_64-LINUX_RHEL5 $ >> 6/22 14:41:25 ** PID = 4322 >> 6/22 14:41:25 ** Log last touched 6/22 14:36:34 >> 6/22 14:41:25 ****************************************************** >> 6/22 14:41:25 Using config source: /opt/osg-shared/se/app/site/ >> condor/ etc/condor_config >> 6/22 14:41:25 Using local config sources: >> 6/22 14:41:25 /opt/osg-local/condor/condor_config.local >> 6/22 14:41:25 DaemonCore: Command Socket at <10.0.10.39:51423> >> 6/22 14:41:25 About to rotate ClassAd log /opt/osg-local/condor/ >> spool/ Accountantnew.log >> 6/22 14:41:25 NEGOTIATOR_SOCKET_CACHE_SIZE = 16 >> 6/22 14:41:25 PREEMPTION_REQUIREMENTS = ( (CurrentTime - >> EnteredCurrentState) > (1 * (60 * 60)) && RemoteUserPrio > >> SubmittorPrio * 1.2 ) || (MY.NiceUser == True) >> 6/22 14:41:25 ACCOUNTANT_HOST = None (local) >> 6/22 14:41:25 NEGOTIATOR_INTERVAL = 25 sec >> 6/22 14:41:25 NEGOTIATOR_TIMEOUT = 30 sec >> 6/22 14:41:25 MAX_TIME_PER_SUBMITTER = 31536000 sec >> 6/22 14:41:25 MAX_TIME_PER_PIESPIN = 31536000 sec >> 6/22 14:41:25 PREEMPTION_RANK = (RemoteUserPrio * 1000000) - >> TARGET.ImageSize >> 6/22 14:41:25 NEGOTIATOR_PRE_JOB_RANK = RemoteOwner =?= UNDEFINED >> 6/22 14:41:25 NEGOTIATOR_POST_JOB_RANK = None >> 6/22 14:41:25 ---------- Started Negotiation Cycle ---------- >> 6/22 14:41:25 Phase 1: Obtaining ads from collector ... >> 6/22 14:41:25 Getting all public ads ... >> 6/22 14:41:25 Sorting 175 ads ... >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Getting startd private ads ... >> 6/22 14:41:25 Got ads: 175 public and 123 private >> 6/22 14:41:25 Public ads include 6 submitter, 137 startd >> 6/22 14:41:25 Phase 2: Performing accounting ... >> 6/22 14:41:25 ERROR "Assertion ERROR on >> (resource_hash.insert( ResourceName, ResourceAd ) == 0)" at line >> 785 in file Accountant.cpp >> after starting up osgmm: >> [root@abitibi condor]# /etc/init.d/osgmm start >> Starting up OSGMM >> [root@abitibi condor]# Exception in thread "Thread-1" >> java.lang.StringIndexOutOfBoundsException: String index out of >> range: -1 >> at java.lang.String.substring(String.java:1768) >> at org.renci.osgmm.Site.getHostName(Site.java:141) >> at org.renci.osgmm.Sites.addSite(Sites.java:106) >> at org.renci.osgmm.ReSS.processReSSAd(ReSS.java:228) >> at org.renci.osgmm.ReSS.pullReSS(ReSS.java:178) >> at org.renci.osgmm.ReSS.run(ReSS.java:102) >> ------------------------------------------------------------------------------ >> Are you an open source citizen? Join us for the Open Source Bridge >> conference! >> Portland, OR, June 17-19. Two days of sessions, one day of >> unconference: $250. >> Need another reason to go? 24-hour hacker lounge. Register today! >> http://ad.doubleclick.net/clk;215844324;13503038;v?http://opensourcebridge.org >> _______________________________________________ >> Osgmm-discuss mailing list >> Osg...@li... >> https://lists.sourceforge.net/lists/listinfo/osgmm-discuss > > > -- > Mats Rynge > Renaissance Computing Institute <http://www.renci.org> |