osgmm-discuss Mailing List for OSGMM (Page 2)
Brought to you by:
mats_rynge
You can subscribe to this list here.
2008 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2009 |
Jan
|
Feb
(2) |
Mar
|
Apr
|
May
|
Jun
(20) |
Jul
(2) |
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(3) |
From: Peter D. <do...@cr...> - 2009-06-22 19:49:57
|
Okay, that eliminated the errors on the console when I launch the match maker. Thanks. For the moment the negotiator stopped crashing, but it stopped before I put the new jar file in, so I don't know what to make of that. At the moment there are no valid sites in the matchmaker, I'm going to have to look into things further to see what's going on. It seems the verification runs didn't run this afternoon. The matchmaker related processes don't look right to me. [root@abitibi var]# ps aux |grep osgmm root 12992 0.0 0.0 101064 1336 pts/12 S 15:22 0:00 su - s /bin/sh osgmm -c . /opt/osg-shared/client-1.0/setup.sh && cd /opt/ osg-shared/client-1.0/osg-match-maker/ && ./sbin/osgmm osgmm 12993 0.0 0.0 63832 1220 ? Ss 15:22 0:00 sh - c . /opt/osg-shared/client-1.0/setup.sh && cd /opt/osg-shared/ client-1.0/osg-match-maker/ && ./sbin/osgmm osgmm 13049 0.0 0.0 63836 1164 ? S 15:22 0:00 /bin/ sh ./sbin/osgmm osgmm 13058 0.5 1.1 1906144 44880 ? Sl 15:22 0:08 /opt/ osg-shared/client-1.0/jdk1.5/bin/java -Xmx1500m -jar lib/osgmm-0.5.jar osgmm 30889 0.0 0.0 63836 1124 ? S 15:47 0:00 /bin/ sh -e /tmp/shellwrapper-6820506812473286557.sh Anyhow, looking in the osgmm.log file I noticed something interesting. It tries to track jobs by their job log files, and if it can't access the file, it tries to chmod 644 the log file. a.) I don't know that I like the idea of the matchmaker trying to change permissions on files in people's home directories. b.) if it can't read the file, the odds are pretty slim it's going to be able to change permissions on the file. But I guess this is why I have so many 0's and empty columns in the various fields of condor_grid_overview. If it can't access the log files, it can't display what jobs are running where and what their status is in the condor_grid_overview output. Is that correct? --Peter 22 Jun 09 15:32:29 INFO Trying to chmod 644 /opt/osg-shared/macintel/ ijstokes/sad2/3cny/2-fast/output-3cny/54/log 22 Jun 09 15:32:31 INFO Added job 292525_/opt/osg-shared/macintel/ ijstokes/sad2/3cny/2-fast/output-3cny/09/log for tracking 22 Jun 09 15:32:31 ERROR Unable open logfile: /opt/osg-shared/ macintel/ijstokes/sad2/3cny/2-fast/output-3cny/09/log (Permission denied) 22 Jun 09 15:32:31 INFO Trying to chmod 644 /opt/osg-shared/macintel/ ijstokes/sad2/3cny/2-fast/output-3cny/09/log 22 Jun 09 15:32:34 INFO Added job 292526_/opt/osg-shared/macintel/ ijstokes/sad2/3cny/2-fast/output-3cny/38/log for tracking 22 Jun 09 15:32:34 ERROR Unable open logfile: /opt/osg-shared/ macintel/ijstokes/sad2/3cny/2-fast/output-3cny/38/log (Permission denied) 22 Jun 09 15:32:34 INFO Trying to chmod 644 /opt/osg-shared/macintel/ ijstokes/sad2/3cny/2-fast/output-3cny/38/log On Jun 22, 2009, at 3:18 PM, Mats Rynge wrote: > Peter Doherty wrote: >> I don't know what's going on here, but my jobs submitted to the >> MatchMaker aren't being matched, and I found out the condor >> negotiator keeps crashing. If I shut down osgmm, the negotiator >> keeps running, but then if I start up osgmm, the negotiator >> crashes when it starts to match one of my jobs. >> Here are some of the errors I'm getting. >> I'm not sure where to start with this. > > I haven't seen the negotiator crash before, but I have seen the > hostname problem recently. Please try this preview of 0.7: > > http://www.renci.org/~rynge/osgmm-0.6.jar > > Replace the one you have in lib/ > > > > >> NegotiatorLog >> 6/22 14:41:25 ****************************************************** >> 6/22 14:41:25 ** condor_negotiator (CONDOR_NEGOTIATOR) STARTING UP >> 6/22 14:41:25 ** /opt/osg-shared/se/app/site/condor-7.2.1/sbin/ >> condor_negotiator >> 6/22 14:41:25 ** SubsystemInfo: name=NEGOTIATOR type=NEGOTIATOR(4) >> class=DAEMON(1) >> 6/22 14:41:25 ** Configuration: subsystem:NEGOTIATOR local:<NONE> >> class:DAEMON >> 6/22 14:41:25 ** $CondorVersion: 7.2.1 Feb 18 2009 BuildID: 133382 $ >> 6/22 14:41:25 ** $CondorPlatform: X86_64-LINUX_RHEL5 $ >> 6/22 14:41:25 ** PID = 4322 >> 6/22 14:41:25 ** Log last touched 6/22 14:36:34 >> 6/22 14:41:25 ****************************************************** >> 6/22 14:41:25 Using config source: /opt/osg-shared/se/app/site/ >> condor/ etc/condor_config >> 6/22 14:41:25 Using local config sources: >> 6/22 14:41:25 /opt/osg-local/condor/condor_config.local >> 6/22 14:41:25 DaemonCore: Command Socket at <10.0.10.39:51423> >> 6/22 14:41:25 About to rotate ClassAd log /opt/osg-local/condor/ >> spool/ Accountantnew.log >> 6/22 14:41:25 NEGOTIATOR_SOCKET_CACHE_SIZE = 16 >> 6/22 14:41:25 PREEMPTION_REQUIREMENTS = ( (CurrentTime - >> EnteredCurrentState) > (1 * (60 * 60)) && RemoteUserPrio > >> SubmittorPrio * 1.2 ) || (MY.NiceUser == True) >> 6/22 14:41:25 ACCOUNTANT_HOST = None (local) >> 6/22 14:41:25 NEGOTIATOR_INTERVAL = 25 sec >> 6/22 14:41:25 NEGOTIATOR_TIMEOUT = 30 sec >> 6/22 14:41:25 MAX_TIME_PER_SUBMITTER = 31536000 sec >> 6/22 14:41:25 MAX_TIME_PER_PIESPIN = 31536000 sec >> 6/22 14:41:25 PREEMPTION_RANK = (RemoteUserPrio * 1000000) - >> TARGET.ImageSize >> 6/22 14:41:25 NEGOTIATOR_PRE_JOB_RANK = RemoteOwner =?= UNDEFINED >> 6/22 14:41:25 NEGOTIATOR_POST_JOB_RANK = None >> 6/22 14:41:25 ---------- Started Negotiation Cycle ---------- >> 6/22 14:41:25 Phase 1: Obtaining ads from collector ... >> 6/22 14:41:25 Getting all public ads ... >> 6/22 14:41:25 Sorting 175 ads ... >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR >> target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, >> treating as TRUE >> 6/22 14:41:25 Getting startd private ads ... >> 6/22 14:41:25 Got ads: 175 public and 123 private >> 6/22 14:41:25 Public ads include 6 submitter, 137 startd >> 6/22 14:41:25 Phase 2: Performing accounting ... >> 6/22 14:41:25 ERROR "Assertion ERROR on >> (resource_hash.insert( ResourceName, ResourceAd ) == 0)" at line >> 785 in file Accountant.cpp >> after starting up osgmm: >> [root@abitibi condor]# /etc/init.d/osgmm start >> Starting up OSGMM >> [root@abitibi condor]# Exception in thread "Thread-1" >> java.lang.StringIndexOutOfBoundsException: String index out of >> range: -1 >> at java.lang.String.substring(String.java:1768) >> at org.renci.osgmm.Site.getHostName(Site.java:141) >> at org.renci.osgmm.Sites.addSite(Sites.java:106) >> at org.renci.osgmm.ReSS.processReSSAd(ReSS.java:228) >> at org.renci.osgmm.ReSS.pullReSS(ReSS.java:178) >> at org.renci.osgmm.ReSS.run(ReSS.java:102) >> ------------------------------------------------------------------------------ >> Are you an open source citizen? Join us for the Open Source Bridge >> conference! >> Portland, OR, June 17-19. Two days of sessions, one day of >> unconference: $250. >> Need another reason to go? 24-hour hacker lounge. Register today! >> http://ad.doubleclick.net/clk;215844324;13503038;v?http://opensourcebridge.org >> _______________________________________________ >> Osgmm-discuss mailing list >> Osg...@li... >> https://lists.sourceforge.net/lists/listinfo/osgmm-discuss > > > -- > Mats Rynge > Renaissance Computing Institute <http://www.renci.org> |
From: Mats R. <ry...@re...> - 2009-06-22 19:18:37
|
Peter Doherty wrote: > I don't know what's going on here, but my jobs submitted to the > MatchMaker aren't being matched, and I found out the condor negotiator > keeps crashing. If I shut down osgmm, the negotiator keeps running, > but then if I start up osgmm, the negotiator crashes when it starts to > match one of my jobs. > Here are some of the errors I'm getting. > I'm not sure where to start with this. I haven't seen the negotiator crash before, but I have seen the hostname problem recently. Please try this preview of 0.7: http://www.renci.org/~rynge/osgmm-0.6.jar Replace the one you have in lib/ > NegotiatorLog > > 6/22 14:41:25 ****************************************************** > 6/22 14:41:25 ** condor_negotiator (CONDOR_NEGOTIATOR) STARTING UP > 6/22 14:41:25 ** /opt/osg-shared/se/app/site/condor-7.2.1/sbin/ > condor_negotiator > 6/22 14:41:25 ** SubsystemInfo: name=NEGOTIATOR type=NEGOTIATOR(4) > class=DAEMON(1) > 6/22 14:41:25 ** Configuration: subsystem:NEGOTIATOR local:<NONE> > class:DAEMON > 6/22 14:41:25 ** $CondorVersion: 7.2.1 Feb 18 2009 BuildID: 133382 $ > 6/22 14:41:25 ** $CondorPlatform: X86_64-LINUX_RHEL5 $ > 6/22 14:41:25 ** PID = 4322 > 6/22 14:41:25 ** Log last touched 6/22 14:36:34 > 6/22 14:41:25 ****************************************************** > 6/22 14:41:25 Using config source: /opt/osg-shared/se/app/site/condor/ > etc/condor_config > 6/22 14:41:25 Using local config sources: > 6/22 14:41:25 /opt/osg-local/condor/condor_config.local > 6/22 14:41:25 DaemonCore: Command Socket at <10.0.10.39:51423> > 6/22 14:41:25 About to rotate ClassAd log /opt/osg-local/condor/spool/ > Accountantnew.log > 6/22 14:41:25 NEGOTIATOR_SOCKET_CACHE_SIZE = 16 > 6/22 14:41:25 PREEMPTION_REQUIREMENTS = ( (CurrentTime - > EnteredCurrentState) > (1 * (60 * 60)) && RemoteUserPrio > > SubmittorPrio * 1.2 ) || (MY.NiceUser == True) > 6/22 14:41:25 ACCOUNTANT_HOST = None (local) > 6/22 14:41:25 NEGOTIATOR_INTERVAL = 25 sec > 6/22 14:41:25 NEGOTIATOR_TIMEOUT = 30 sec > 6/22 14:41:25 MAX_TIME_PER_SUBMITTER = 31536000 sec > 6/22 14:41:25 MAX_TIME_PER_PIESPIN = 31536000 sec > 6/22 14:41:25 PREEMPTION_RANK = (RemoteUserPrio * 1000000) - > TARGET.ImageSize > 6/22 14:41:25 NEGOTIATOR_PRE_JOB_RANK = RemoteOwner =?= UNDEFINED > 6/22 14:41:25 NEGOTIATOR_POST_JOB_RANK = None > 6/22 14:41:25 ---------- Started Negotiation Cycle ---------- > 6/22 14:41:25 Phase 1: Obtaining ads from collector ... > 6/22 14:41:25 Getting all public ads ... > 6/22 14:41:25 Sorting 175 ads ... > 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR > target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, > treating as TRUE > 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR > target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, > treating as TRUE > 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR > target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, > treating as TRUE > 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR > target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, > treating as TRUE > 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR > target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, > treating as TRUE > 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR > target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, > treating as TRUE > 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR > target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, > treating as TRUE > 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR > target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, > treating as TRUE > 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR > target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, > treating as TRUE > 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR > target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, > treating as TRUE > 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR > target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, > treating as TRUE > 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR > target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, > treating as TRUE > 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR > target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, > treating as TRUE > 6/22 14:41:25 Getting startd private ads ... > 6/22 14:41:25 Got ads: 175 public and 123 private > 6/22 14:41:25 Public ads include 6 submitter, 137 startd > 6/22 14:41:25 Phase 2: Performing accounting ... > 6/22 14:41:25 ERROR "Assertion ERROR on > (resource_hash.insert( ResourceName, ResourceAd ) == 0)" at line 785 > in file Accountant.cpp > > > > > after starting up osgmm: > > [root@abitibi condor]# /etc/init.d/osgmm start > Starting up OSGMM > [root@abitibi condor]# Exception in thread "Thread-1" > java.lang.StringIndexOutOfBoundsException: String index out of range: -1 > at java.lang.String.substring(String.java:1768) > at org.renci.osgmm.Site.getHostName(Site.java:141) > at org.renci.osgmm.Sites.addSite(Sites.java:106) > at org.renci.osgmm.ReSS.processReSSAd(ReSS.java:228) > at org.renci.osgmm.ReSS.pullReSS(ReSS.java:178) > at org.renci.osgmm.ReSS.run(ReSS.java:102) > > > > > > ------------------------------------------------------------------------------ > Are you an open source citizen? Join us for the Open Source Bridge conference! > Portland, OR, June 17-19. Two days of sessions, one day of unconference: $250. > Need another reason to go? 24-hour hacker lounge. Register today! > http://ad.doubleclick.net/clk;215844324;13503038;v?http://opensourcebridge.org > _______________________________________________ > Osgmm-discuss mailing list > Osg...@li... > https://lists.sourceforge.net/lists/listinfo/osgmm-discuss > -- Mats Rynge Renaissance Computing Institute <http://www.renci.org> |
From: Peter D. <do...@cr...> - 2009-06-22 19:02:20
|
I don't know what's going on here, but my jobs submitted to the MatchMaker aren't being matched, and I found out the condor negotiator keeps crashing. If I shut down osgmm, the negotiator keeps running, but then if I start up osgmm, the negotiator crashes when it starts to match one of my jobs. Here are some of the errors I'm getting. I'm not sure where to start with this. Thanks --Peter NegotiatorLog 6/22 14:41:25 ****************************************************** 6/22 14:41:25 ** condor_negotiator (CONDOR_NEGOTIATOR) STARTING UP 6/22 14:41:25 ** /opt/osg-shared/se/app/site/condor-7.2.1/sbin/ condor_negotiator 6/22 14:41:25 ** SubsystemInfo: name=NEGOTIATOR type=NEGOTIATOR(4) class=DAEMON(1) 6/22 14:41:25 ** Configuration: subsystem:NEGOTIATOR local:<NONE> class:DAEMON 6/22 14:41:25 ** $CondorVersion: 7.2.1 Feb 18 2009 BuildID: 133382 $ 6/22 14:41:25 ** $CondorPlatform: X86_64-LINUX_RHEL5 $ 6/22 14:41:25 ** PID = 4322 6/22 14:41:25 ** Log last touched 6/22 14:36:34 6/22 14:41:25 ****************************************************** 6/22 14:41:25 Using config source: /opt/osg-shared/se/app/site/condor/ etc/condor_config 6/22 14:41:25 Using local config sources: 6/22 14:41:25 /opt/osg-local/condor/condor_config.local 6/22 14:41:25 DaemonCore: Command Socket at <10.0.10.39:51423> 6/22 14:41:25 About to rotate ClassAd log /opt/osg-local/condor/spool/ Accountantnew.log 6/22 14:41:25 NEGOTIATOR_SOCKET_CACHE_SIZE = 16 6/22 14:41:25 PREEMPTION_REQUIREMENTS = ( (CurrentTime - EnteredCurrentState) > (1 * (60 * 60)) && RemoteUserPrio > SubmittorPrio * 1.2 ) || (MY.NiceUser == True) 6/22 14:41:25 ACCOUNTANT_HOST = None (local) 6/22 14:41:25 NEGOTIATOR_INTERVAL = 25 sec 6/22 14:41:25 NEGOTIATOR_TIMEOUT = 30 sec 6/22 14:41:25 MAX_TIME_PER_SUBMITTER = 31536000 sec 6/22 14:41:25 MAX_TIME_PER_PIESPIN = 31536000 sec 6/22 14:41:25 PREEMPTION_RANK = (RemoteUserPrio * 1000000) - TARGET.ImageSize 6/22 14:41:25 NEGOTIATOR_PRE_JOB_RANK = RemoteOwner =?= UNDEFINED 6/22 14:41:25 NEGOTIATOR_POST_JOB_RANK = None 6/22 14:41:25 ---------- Started Negotiation Cycle ---------- 6/22 14:41:25 Phase 1: Obtaining ads from collector ... 6/22 14:41:25 Getting all public ads ... 6/22 14:41:25 Sorting 175 ads ... 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, treating as TRUE 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, treating as TRUE 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, treating as TRUE 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, treating as TRUE 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, treating as TRUE 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, treating as TRUE 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, treating as TRUE 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, treating as TRUE 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, treating as TRUE 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, treating as TRUE 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, treating as TRUE 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, treating as TRUE 6/22 14:41:25 Can't evaluate STARTD_AD_REEVAL_EXPR target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool, treating as TRUE 6/22 14:41:25 Getting startd private ads ... 6/22 14:41:25 Got ads: 175 public and 123 private 6/22 14:41:25 Public ads include 6 submitter, 137 startd 6/22 14:41:25 Phase 2: Performing accounting ... 6/22 14:41:25 ERROR "Assertion ERROR on (resource_hash.insert( ResourceName, ResourceAd ) == 0)" at line 785 in file Accountant.cpp after starting up osgmm: [root@abitibi condor]# /etc/init.d/osgmm start Starting up OSGMM [root@abitibi condor]# Exception in thread "Thread-1" java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1768) at org.renci.osgmm.Site.getHostName(Site.java:141) at org.renci.osgmm.Sites.addSite(Sites.java:106) at org.renci.osgmm.ReSS.processReSSAd(ReSS.java:228) at org.renci.osgmm.ReSS.pullReSS(ReSS.java:178) at org.renci.osgmm.ReSS.run(ReSS.java:102) |
From: Mats R. <ry...@re...> - 2009-06-17 21:17:14
|
Peter Doherty wrote: > What's the best way to determine why a site is getting a low rank? > We have two CEs, and condor_grid_overview shows them like this: > > SBGrid-Harvard-East 0 0 0 0 0 0 > 0 952 100% > SBGrid-Harvard-Exp 0 0 0 0 0 0 > 0 1 100% > > > I had assumed that the Exp CE was getting a low rank because it's > queue has been full for several weeks. But the queue cleared up the > past day, yet it still only has a rank of 1. But I know jobs can run > successfully, so why does it have a low rank? Hi Peter, In this case, it is both issues. > I looked in ~osgmm/var/verification-runs/SiteName and looked through > the error files. > fork.err shows: > + echo 'More than 5G of $HOME used!' Many sites have quotas on $HOME, so the idea is to disable the site if we are using more than 5GB of space. This test is a little bit Engage specific so maybe we should disable it. You can do that by editing libexec/verification-script.fork > jm.err shows: > ++ MANPATH=:/opt/osg-shared/wn-1.0/vdt/man > ++ export MANPATH > ++ . /opt/osg-shared/wn-1.0/vdt/etc/vdt-man-setup.sh > /opt/osg-shared/wn/setup.sh: line 47: /opt/osg-shared/wn-1.0/vdt/etc/ > vdt-man-setup.sh: No such file or directory > > > The second error is curious, the file is there. > > #ls -l /opt/osg-shared/wn-1.0/vdt/etc/vdt-man-setup.sh > -rw-r--r-- 1 root root 51 May 13 2008 /opt/osg-shared/wn-1.0/vdt/etc/ > vdt-man-setup.sh The file seems to exist on the head node, but not the compute nodes. The WN install should be on a shared file system as the purpose is for the tools to be available to the jobs. -- Mats Rynge Renaissance Computing Institute <http://www.renci.org> |
From: Peter D. <do...@cr...> - 2009-06-17 20:43:58
|
Hi, I know there aren't many people on the list, but I thought I'd post here nonetheless. What's the best way to determine why a site is getting a low rank? We have two CEs, and condor_grid_overview shows them like this: SBGrid-Harvard-East 0 0 0 0 0 0 0 952 100% SBGrid-Harvard-Exp 0 0 0 0 0 0 0 1 100% I had assumed that the Exp CE was getting a low rank because it's queue has been full for several weeks. But the queue cleared up the past day, yet it still only has a rank of 1. But I know jobs can run successfully, so why does it have a low rank? I looked in ~osgmm/var/verification-runs/SiteName and looked through the error files. fork.err shows: + echo 'More than 5G of $HOME used!' jm.err shows: ++ MANPATH=:/opt/osg-shared/wn-1.0/vdt/man ++ export MANPATH ++ . /opt/osg-shared/wn-1.0/vdt/etc/vdt-man-setup.sh /opt/osg-shared/wn/setup.sh: line 47: /opt/osg-shared/wn-1.0/vdt/etc/ vdt-man-setup.sh: No such file or directory The second error is curious, the file is there. #ls -l /opt/osg-shared/wn-1.0/vdt/etc/vdt-man-setup.sh -rw-r--r-- 1 root root 51 May 13 2008 /opt/osg-shared/wn-1.0/vdt/etc/ vdt-man-setup.sh Is one of these errors why the site rank is low, or is it something else? Thanks --Peter |
From: Alan De S. <ad...@cs...> - 2009-02-04 20:17:42
|
Thanks for the answers! Mats Rynge <ry...@re...> wrote: > Having the option for running OSGMM in its own account would be > nice. My goal is to support this. But ideally we'd like it to Just Work out of the box for users who don't have a dedicated account. We'll see what sort of options we can have. Another case for using a non-dedicated account is where a user wants to run everything as himself, perhaps because he lacks root access. In this case, the security concern shouldn't be present. -- Alan De Smet Condor Project Research ad...@cs... http://www.cs.wisc.edu/condor/ |
From: Mats R. <ry...@re...> - 2009-02-04 19:34:04
|
Alan De Smet wrote: > I have some questions about the OSG Match Maker as I'm working on > adding it to the VDT. > > 0. Would you prefer I direct these questions and comments to > osg...@li...? Yes, that would be great. I have added the list to the CC line. > 1. The documentation says the OSG Client software stack is a > prerequisite. Which parts are needed? I'm assuming you're > talking about the package known as "vo-client" and "OSG VO > Client", which only adds vomsetc/vomses and glite/etc/vomses on > top of the VDT install. Are these specifically needed? Are > there other things beyond the basic VDT that are required? Basic VDT should be fine. The requirements are Condor, Java, and VOMS clients. > 2. I've been told that the MM requiers a "full shell." What is > meant by this? What does it do with the shell that it needs > this? We're considering using the daemon account by default, > and while you can run /bin/sh as daemon, you typically can't > directly log in as the default shell is /bin/nologin or > similar. Will that be good enough. I need to do some testing here. OSGMM is shelling out to do a lot of tasks, and I'm not sure if /bin/nologin would have an impact on the shell outs. The other reason for having an account with a full shell is that if validation is enabled, the user OSGMM runs under needs a valid proxy. In the current instances we have of OSGMM we use voms-proxy-init to maintain the proxy. I guess maintaining the proxy could be done as another user and then copy/chown the proxy to the daemon user. > 3. Are there other MM-specific risks to running in the daemon > account, or another shared account? I notice that your sample > init script will kill anything running under the osgmm account, > but I think we can deal with this. The risk is if validation is enabled, and there is a user proxy for daemon. If some other process which is also running as daemon gets compromised, the proxy can easily be stolen. Having the option for running OSGMM in its own account would be nice. > 4. We're considering running the MM under Condor as a top level > Condor daemon. This would give it the same automatic > monitoring and restart capabilities as any other Condor > daemons. This includes the ability to hunt down and kill child > processes when shutting down. Are there any potential problems > with this that you immediately think of? That would be great. The only issue is probably the daemon user issues from 2. and 3. > 5. What versions of Java are acceptable? In particular, which of > 4, 5, and 6 will work with the OSG MM? 5 or 6 will work. -- Mats Rynge Renaissance Computing Institute <http://www.renci.org> |
From: Mats R. <ry...@re...> - 2008-03-10 22:25:17
|
-- Mats Rynge Renaissance Computing Institute <http://www.renci.org> |