osgmm-discuss Mailing List for OSGMM
Brought to you by:
mats_rynge
You can subscribe to this list here.
2008 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2009 |
Jan
|
Feb
(2) |
Mar
|
Apr
|
May
|
Jun
(20) |
Jul
(2) |
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(3) |
From: Peter D. <do...@cr...> - 2009-12-30 18:23:55
|
On Dec 29, 2009, at 22:34 PM, Mats Rynge wrote: > > Yes, the many jobs in the queue is causing condor_q to time out. I have made some improvements in later OSGMM versions, but the issues has not been fully solved. Occasionally seeing this message is fine, but if you see it all the time, that is a problem. I'm running v0.8. And I was seeing the error all the time. I had 5000 jobs queued up, short running jobs, and they were completing faster than the OSGMM could match them I think. The OSGMM java process was at 100% CPU (but just using one of the four cores, of course) and only about 200 jobs were running concurrently. > Not currently. I have been thinking about putting an explanation in the classad so that that it could be seen with condor_grid_overview. Would you prefer that over have more information in the logs? Either way, but some kind of understanding of where the rank number came from would be useful, especially for the sites under my administrative control, I know know if something needs fixing. Most of the sites seem to have a rank of 1, 200, 800,or 1000 (give or take a few points). And I just haven't quite figured out exactly where the value comes from. Cheers, Peter |
From: Mats R. <rynge@ISI.EDU> - 2009-12-30 03:49:03
|
Peter Doherty wrote: > I'm getting a lot of warnings in the osgmm.log file like this: > > Setting low rank on LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu because we > have not heard from condor_q > > What is causing this? Is there a timeout built into OSGMM around > condor_q? > Our system has a lot of jobs in the queue right now, and just issuing > condor_q takes about 10-15 seconds to return. Yes, the many jobs in the queue is causing condor_q to time out. I have made some improvements in later OSGMM versions, but the issues has not been fully solved. Occasionally seeing this message is fine, but if you see it all the time, that is a problem. > Is there any way to increase the logging of OSGMM so that I can try > and figure out what parameters the OSGMM is using to set ranks on > sites? It doesn't seem clear to me why certain sites got the rank > they did sometimes. Not currently. I have been thinking about putting an explanation in the classad so that that it could be seen with condor_grid_overview. Would you prefer that over have more information in the logs? -- Mats Rynge USC/ISI <http://www.isi.edu> |
From: Peter D. <do...@cr...> - 2009-12-24 16:14:52
|
I'm getting a lot of warnings in the osgmm.log file like this: Setting low rank on LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu because we have not heard from condor_q What is causing this? Is there a timeout built into OSGMM around condor_q? Our system has a lot of jobs in the queue right now, and just issuing condor_q takes about 10-15 seconds to return. Is there any way to increase the logging of OSGMM so that I can try and figure out what parameters the OSGMM is using to set ranks on sites? It doesn't seem clear to me why certain sites got the rank they did sometimes. Thanks --Peter Doherty |
From: Peter D. <do...@cr...> - 2009-11-23 16:38:02
|
I know the OSGMM queries ReSS to get a list of sites that support my VO, but what's the exact query? Is it looking at GlueCEAccessControlBaseRule ? The site list in OSGMM doesn't correlate exactly with any of my manual queries to ReSS. I've got plenty of sites that I can run at, but that aren't showing up in the MatchMaker, so I'm assuming that is an improperly configured site. Thanks, Peter |
From: Mats R. <ry...@re...> - 2009-10-20 22:23:31
|
Peter Doherty wrote: > Here's the basic scenario I'm wondering about. > I want to install the OSG-Client software on my linux machine, and > then submit a job to our CE that is running the OSGMM and have it > match to a resource and submit the job. > Is this possible? If so, how? If not, is it even feasible? What you are describing is very similar to Fermi's job forwarding gateway: http://fermigrid.fnal.gov/matchmaking.html http://osg-docdb.opensciencegrid.org/0006/000614/001/FermiGrid_OSG_Experience.ppt They are not using OSGMM, but their solution is based on ReSS so there is a lot in common. I'm sure Keith Chadwick can provide more information. But in general, you should be able to do this with a custom job manager on your CE. See globus/lib/perl/Globus/GRAM/JobManager/. Create a custom one which translates an incoming job to a OSGMM job. There are some issues to watch out for such as staging and proxy renewals, but it the end, your users can then submit regular condor-g jobs from their own submit machines. -- Mats Rynge Renaissance Computing Institute <http://www.renci.org> |
From: Peter D. <do...@cr...> - 2009-10-20 21:40:28
|
Sorry, I should have been more specific. Our CE also has the client software installed, and the OSGMM runs from the client install. There are other researchers here that have access to their own cluster. Rather than create accounts for all of them on our submit machine, I was hoping to set up the osg-client on their cluster, and then let them submit jobs from there, but utilize the matchmaking features of our cluster. Running two instances of the MatchMaker just creates extra overhead for myself. --Peter On Oct 20, 2009, at 17:32 , Derek Weitzel wrote: > Usually OSGMM is installed on the client machine, ie the same > machine that OSG-Client is installed. OSGMM is a client software, > and mostly useless on the CE. I suggest installing OSGMM along with > the OSG-Client on your linux machine. > > Now for submitting jobs to another machine to in turn submit to grid > machines is possible (confused?), but could get overly complicated. > > Derek Weitzel > Graduate Research Assistant > University of Nebraska Holland Computing Center > > On Oct 20, 2009, at 4:05 PM, Peter Doherty wrote: > >> Here's the basic scenario I'm wondering about. >> I want to install the OSG-Client software on my linux machine, and >> then submit a job to our CE that is running the OSGMM and have it >> match to a resource and submit the job. >> Is this possible? If so, how? If not, is it even feasible? >> >> Thanks, >> Peter >> >> ------------------------------------------------------------------------------ >> Come build with us! The BlackBerry(R) Developer Conference in SF, CA >> is the only developer event you need to attend this year. Jumpstart >> your >> developing skills, take BlackBerry mobile applications to market >> and stay >> ahead of the curve. Join us from November 9 - 12, 2009. Register now! >> http://p.sf.net/sfu/devconference >> _______________________________________________ >> Osgmm-discuss mailing list >> Osg...@li... >> https://lists.sourceforge.net/lists/listinfo/osgmm-discuss |
From: Derek W. <dj...@gm...> - 2009-10-20 21:32:54
|
Usually OSGMM is installed on the client machine, ie the same machine that OSG-Client is installed. OSGMM is a client software, and mostly useless on the CE. I suggest installing OSGMM along with the OSG- Client on your linux machine. Now for submitting jobs to another machine to in turn submit to grid machines is possible (confused?), but could get overly complicated. Derek Weitzel Graduate Research Assistant University of Nebraska Holland Computing Center On Oct 20, 2009, at 4:05 PM, Peter Doherty wrote: > Here's the basic scenario I'm wondering about. > I want to install the OSG-Client software on my linux machine, and > then submit a job to our CE that is running the OSGMM and have it > match to a resource and submit the job. > Is this possible? If so, how? If not, is it even feasible? > > Thanks, > Peter > > ------------------------------------------------------------------------------ > Come build with us! The BlackBerry(R) Developer Conference in SF, CA > is the only developer event you need to attend this year. Jumpstart > your > developing skills, take BlackBerry mobile applications to market and > stay > ahead of the curve. Join us from November 9 - 12, 2009. Register now! > http://p.sf.net/sfu/devconference > _______________________________________________ > Osgmm-discuss mailing list > Osg...@li... > https://lists.sourceforge.net/lists/listinfo/osgmm-discuss |
From: Peter D. <do...@cr...> - 2009-10-20 21:05:26
|
Here's the basic scenario I'm wondering about. I want to install the OSG-Client software on my linux machine, and then submit a job to our CE that is running the OSGMM and have it match to a resource and submit the job. Is this possible? If so, how? If not, is it even feasible? Thanks, Peter |
From: Peter D. <do...@cr...> - 2009-07-02 19:16:04
|
Thanks Alan, Yes, since the update version of the MatchMaker that Mat's sent to me things have been working a lot better. I'm glad you were able to find the problem, and I hope it wasn't too big of a headache for you. Cheers, Peter On Jul 2, 2009, at 2:08 PM, Alan De Smet wrote: > Although with Mats's change it's moot, I do want to note that the > upcoming Condor 7.2.5 will contain a fix for this. Such > duplicate ads will be treated as two distinct resources, and > matching done on both. While perhaps surprising, it matches > Condor's normal behavior with Machine ads. > > -- > Alan De Smet Condor Project Research > ad...@cs... http://www.cs.wisc.edu/condor/ > > ------------------------------------------------------------------------------ > _______________________________________________ > Osgmm-discuss mailing list > Osg...@li... > https://lists.sourceforge.net/lists/listinfo/osgmm-discuss |
From: Alan De S. <ad...@cs...> - 2009-07-02 18:08:06
|
Although with Mats's change it's moot, I do want to note that the upcoming Condor 7.2.5 will contain a fix for this. Such duplicate ads will be treated as two distinct resources, and matching done on both. While perhaps surprising, it matches Condor's normal behavior with Machine ads. -- Alan De Smet Condor Project Research ad...@cs... http://www.cs.wisc.edu/condor/ |
From: Mats R. <ry...@re...> - 2009-06-29 19:00:50
|
https://sourceforge.net/projects/osgmm/ Version 0.7 =========== - Removed compute-env.{sh|csh} - race conditions when a VO has mulitple OSGMM instances. These files can be managed from the extra maintenance scripts instead. - Fixed an issue with sites not advertising GlueSiteName correctly - Improved the tailing of job log files - Added the ability to run local tests against remote resources, see libexec/verification-script.local - Fixed a bug where the ReSS query constraint was not set correctly for some configurations - Memory usage improvements - Replaced advanced job example in the documentation - Replaced the init.d control of OSGMM with condor_master (documentation change) Version 0.6 =========== - Added OSGMM_Extra_Requirements in order to be able to add to the Requirements attribute - Fixed verification / maintenance jobs bug when using OSGMM_DoNotAdvertise - Made verification / maintenance jobs use timestamped script file to prevent the file from being cached at the sites - Change max classad line length to 20000 because the new software attributes can be pretty long in ReSS - Fixed bug in hostname parsing -- Mats Rynge Renaissance Computing Institute <http://www.renci.org> |
From: Mats R. <ry...@re...> - 2009-06-26 14:54:47
|
Mats Rynge wrote: > Alan De Smet wrote: >> Are the duplicate ads (two ads with the same Name, but different >> IpAddrs) intended? This is causing Condor problems, although it >> shouldn't. As an immediate workaround, if you need these >> duplicate ads, can you add something unique (like IpAddr) into >> the Name? > > The match maker does not set IpAddr. Are there other attributes that > could cause this issue? Alan and Peter, I think I have found the cause for the duplicate ads. The ReSS service moved to a different server, and it seems like ReSS sets the MyAddress and StartdIpAddr attributes to the local server IP (131.225.110.152 and 131.225.107.219). I have changed OSGMM to replace those with 127.0.0.1 to take away the confusion. Peter, a new version can be found at: http://www.renci.org/~rynge/osgmm-0.7-20090626.jar Once that is deployed, it will take a while for the old ads to expire. But in the end, condor_status should show only one ad per site. -- Mats Rynge Renaissance Computing Institute <http://www.renci.org> |
From: Mats R. <ry...@re...> - 2009-06-26 00:05:55
|
Alan De Smet wrote: > Thanks to the information Peter provided, I can reproduce the > problem from Condor's side. Whatever else may or may not be > wrong, Condor is definitely doing something wrong and we'll see > about fix it. > > Are the duplicate ads (two ads with the same Name, but different > IpAddrs) intended? This is causing Condor problems, although it > shouldn't. As an immediate workaround, if you need these > duplicate ads, can you add something unique (like IpAddr) into > the Name? The match maker does not set IpAddr. Are there other attributes that could cause this issue? -- Mats Rynge Renaissance Computing Institute <http://www.renci.org> |
From: Peter D. <do...@cr...> - 2009-06-25 14:11:08
|
I don't care about the duplicate ads. The MatchMaker must have done that. At the moment there don't appear to be duplicate ads The Negotiator still crashes. I'd be happy to help test out a prerelease version of the negotiator if it would help. --Peter On Jun 24, 2009, at 7:03 PM, Alan De Smet wrote: > Thanks to the information Peter provided, I can reproduce the > problem from Condor's side. Whatever else may or may not be > wrong, Condor is definitely doing something wrong and we'll see > about fix it. > > Are the duplicate ads (two ads with the same Name, but different > IpAddrs) intended? This is causing Condor problems, although it > shouldn't. As an immediate workaround, if you need these > duplicate ads, can you add something unique (like IpAddr) into > the Name? > > Peter, I hope to have a fix soon, but it won't be available > generally until 7.2.5. (Sadly it's too late for 7.2.4.) In the > meanwhile, would you want a prerelease version of the negotiator? > > -- > Alan De Smet Condor Project Research > ad...@cs... http://www.cs.wisc.edu/condor/ |
From: Alan De S. <ad...@cs...> - 2009-06-24 23:03:55
|
Thanks to the information Peter provided, I can reproduce the problem from Condor's side. Whatever else may or may not be wrong, Condor is definitely doing something wrong and we'll see about fix it. Are the duplicate ads (two ads with the same Name, but different IpAddrs) intended? This is causing Condor problems, although it shouldn't. As an immediate workaround, if you need these duplicate ads, can you add something unique (like IpAddr) into the Name? Peter, I hope to have a fix soon, but it won't be available generally until 7.2.5. (Sadly it's too late for 7.2.4.) In the meanwhile, would you want a prerelease version of the negotiator? -- Alan De Smet Condor Project Research ad...@cs... http://www.cs.wisc.edu/condor/ |
From: Alan De S. <ad...@cs...> - 2009-06-23 23:16:31
|
(Resending to include osg...@li.... My apologies for the duplicate, Peter.) Peter Doherty <do...@cr...> wrote: > Here's the output of condor_status and condor_status -l I'm seeing some suspicious things; in particular I'm surprised to see multiple ads with the same Name. As I currently understand the design, the collector should only allow one ad with a given Name. If this constraint is violated, the negotiator will ASSERT, as observed. I'll dig deeper to see why the collector isn't behaving as I expect. Your ads make testing this easier. Thank you. -- Alan De Smet Condor Project Research ad...@cs... http://www.cs.wisc.edu/condor/ |
From: Alan De S. <ad...@cs...> - 2009-06-23 23:16:23
|
I'm juggling a few things right now, but I'm taking a quick look at the negotiator ASSERTing. It may be that other layers are also malfunctioning, I'm not sure, but even so it should not cause the negotiator to fail in that way. I'll try to dig a bit deeper. If this happens again, the output of "condor_status -l" might prove helpful. My current hypothesis is that the collector is misbehaving and ends up sending nonsensical data to the negotiator, which throws its hands up in the air as a result. -- Alan De Smet Condor Project Research ad...@cs... http://www.cs.wisc.edu/condor/ |
From: Peter D. <do...@cr...> - 2009-06-23 22:46:58
|
Were the ads with the same name the ones from the Match-Maker? Since I've restarted the matchmaker (after clearing out some of it's directories) and then restarted Condor, things seem to be behaving better. It's odd, because the Match Maker seemed to be running fine for the past couple weeks, but then things just stopped working well late last week. We had a run of jobs, totaling about 13,000 jobs in the queue since last wednesday, and that just finished last night. But I have been having trouble getting the OSG Matchmaker to match jobs since Friday. --Peter On Jun 23, 2009, at 6:28 PM, Alan De Smet wrote: > (Resending to include osg...@li.... My > apologies for the duplicate, Peter.) > > Peter Doherty <do...@cr...> wrote: >> Here's the output of condor_status and condor_status -l > > I'm seeing some suspicious things; in particular I'm surprised to > see multiple ads with the same Name. As I currently understand > the design, the collector should only allow one ad with a given > Name. If this constraint is violated, the negotiator will > ASSERT, as observed. I'll dig deeper to see why the collector > isn't behaving as I expect. Your ads make testing this easier. > Thank you. > > -- > Alan De Smet Condor Project Research > ad...@cs... http://www.cs.wisc.edu/condor/ |
From: Mats R. <ry...@re...> - 2009-06-23 21:54:10
|
Peter Doherty wrote: > > It turns out this version breaks the verification runs for me. Are > there updated scripts to go along with it? For example the > fork.condor file in ~osgmm/var/verification-runs/SITE-NAME listed the > executable as "fork.script.123591332490" but that executable didn't > exist anywhere. fork.script.$ts is just a copy of the libexec/fork.script. I don't think the location changed from 0.5, but I might be wrong. > But I'm having trouble figuring out why the verification tests aren't > working right anymore. The Ranks for all the sites are low (1 or 3) > although the Success score is 100%. And several sites aren't even > being tested. It would really be helpful to me to get more logging > information showing why a site was dropped from the list, and why a > test can complete with TEST SUCCESSFUL, but the site Rank is still 1. > Like our site SBGrid-Harvard-East is no longer in my list from > condor_grid_overview, and since it doesn't have a directory under > verification-runs, I can't see the output from the tests. > Restarting the MatchMaker seems to clear out the osgmm.log file > without rolling it over. So after a few restarts this afternoon I now > have a huge gap in the log files, and perhaps that's where the answer > is why the East site was dropped. I'm trying to re-create your problem on a test machine here. It is running Condor 7.2.1 and is configured for the sbgrid VO. So far I have not seen the negotiator crash. I think it would be useful for me to poke around in your OSGMM install instead of trying to figure things out over email. I think I used to have an account on abitibi. If the account is still there, can I have the password reset (and sent in a private email of course)? -- Mats Rynge Renaissance Computing Institute <http://www.renci.org> |
From: Peter D. <do...@cr...> - 2009-06-23 21:26:34
|
On Jun 22, 2009, at 3:18 PM, Mats Rynge wrote: > > I haven't seen the negotiator crash before, but I have seen the > hostname problem recently. Please try this preview of 0.7: > > http://www.renci.org/~rynge/osgmm-0.6.jar > > Replace the one you have in lib It turns out this version breaks the verification runs for me. Are there updated scripts to go along with it? For example the fork.condor file in ~osgmm/var/verification-runs/SITE-NAME listed the executable as "fork.script.123591332490" but that executable didn't exist anywhere. I've reverted back to the older version for the time being. I cleared out everything in ~osgmm/var/final-ads, verification-runs, maintenance-runs and restarted the match maker. And then I restarted Condor. This helped. But I'm having trouble figuring out why the verification tests aren't working right anymore. The Ranks for all the sites are low (1 or 3) although the Success score is 100%. And several sites aren't even being tested. It would really be helpful to me to get more logging information showing why a site was dropped from the list, and why a test can complete with TEST SUCCESSFUL, but the site Rank is still 1. Like our site SBGrid-Harvard-East is no longer in my list from condor_grid_overview, and since it doesn't have a directory under verification-runs, I can't see the output from the tests. Restarting the MatchMaker seems to clear out the osgmm.log file without rolling it over. So after a few restarts this afternoon I now have a huge gap in the log files, and perhaps that's where the answer is why the East site was dropped. --Peter |
From: Peter D. <do...@cr...> - 2009-06-23 16:13:57
|
The Negotiator started crashing again. I'm sure it's some kind of conflict with the OSG Match-Maker I turned on D_FULL_DEBUG on the Negotiator Log, but that didn't tell me much more. Here's the output of condor_status and condor_status -l I wonder if it's some invalid class ad format. It's also interesting that at the top of condor_status you can see duplicates of some sites. I know some sites have two gatekeepers, but, for example, I know WQCG-Tuscany-OSG, which I helped set up, is just a simple setup, one CE, a few nodes, no SE, or anything else. I don't know why it's listed twice. Could this all be related to the ReSS changeover last week? http://abitibi.sbgrid.org/condor_status.txt http://abitibi.sbgrid.org/condor_status-l.txt NegotiatorLog: 6/23 12:07:08 ---------- Finished Negotiation Cycle ---------- 6/23 12:07:08 enter Matchmaker::updateCollector 6/23 12:07:08 Trying to update collector <10.0.10.39:9618> 6/23 12:07:08 Attempting to send update via UDP to collector abitibi.sbgrid.org <10.0.10.39:9618> 6/23 12:07:08 exit Matchmaker::UpdateCollector 6/23 12:07:33 ---------- Started Negotiation Cycle ---------- 6/23 12:07:33 Phase 1: Obtaining ads from collector ... 6/23 12:07:33 Getting all public ads ... 6/23 12:07:33 Trying to query collector <10.0.10.39:9618> 6/23 12:07:33 Sorting 208 ads ... 6/23 12:07:33 Getting startd private ads ... 6/23 12:07:33 Trying to query collector <10.0.10.39:9618> 6/23 12:07:33 Got ads: 208 public and 123 private 6/23 12:07:33 Public ads include 2 submitter, 174 startd 6/23 12:07:33 Entering compute_significant_attrs() 6/23 12:07:33 Leaving compute_significant_attrs() - result=JobUniverse,LastCheckpointPlatform,NumCkpts,EnteredCurrentState 6/23 12:07:33 Phase 2: Performing accounting ... 6/23 12:07:33 ERROR "Assertion ERROR on (resource_hash.insert( ResourceName, ResourceAd ) == 0)" at line 785 in file Accountant.cpp On Jun 22, 2009, at 4:27 PM, Alan De Smet wrote: > I'm juggling a few things right now, but I'm taking a quick look > at the negotiator ASSERTing. It may be that other layers are > also malfunctioning, I'm not sure, but even so it should not > cause the negotiator to fail in that way. I'll try to dig a bit > deeper. > > If this happens again, the output of "condor_status -l" might > prove helpful. My current hypothesis is that the collector is > misbehaving and ends up sending nonsensical data to the > negotiator, which throws its hands up in the air as a result. > > -- > Alan De Smet Condor Project Research > ad...@cs... http://www.cs.wisc.edu/condor/ |
From: Mats R. <ry...@re...> - 2009-06-22 20:18:12
|
Peter Doherty wrote: > > The osgmm.log file is entirely filled with the permission change > attempts. 4000 jobs in the queue, and it checks every 2 seconds on a > different job. > the osgmm.log.1.gz is a day old, and the osgmm.log.1 file has binary > data in it. Is that normal? No, I have never seen binary data in there. Maybe related is the ReSS server changing IP address, that is why you have so many "site has been dropped from ReSS". > Hmm... suddenly I'm wondering if that's a clue. I used 'strings' on > the osgmm.log.1 file, and the last entry is at 10:42 this morning. > That's about when things started to go wrong. I'll have to check if I > did something that would have caused something like that. I've > restarted the match maker a couple times with no success. > I'm attaching the log file anyhow. It's 5MB. It got rejected by the > mailing list... > so it's available here: > http://abitibi.sbgrid.org/osgmm.log.1 The permissions problem might lead to lower success rates. If OSGMM can't read the log files, some errors will not be picked up and you might end up sending a lot of jobs to a broken site. -- Mats Rynge Renaissance Computing Institute <http://www.renci.org> |
From: Peter D. <do...@cr...> - 2009-06-22 20:11:45
|
On Jun 22, 2009, at 3:57 PM, Mats Rynge wrote: > > What version of Condor do you have? Can you provide the var/log/ > osgmm.log? > # condor_version $CondorVersion: 7.2.1 Feb 18 2009 BuildID: 133382 $ $CondorPlatform: X86_64-LINUX_RHEL5 $ The osgmm.log file is entirely filled with the permission change attempts. 4000 jobs in the queue, and it checks every 2 seconds on a different job. the osgmm.log.1.gz is a day old, and the osgmm.log.1 file has binary data in it. Is that normal? Hmm... suddenly I'm wondering if that's a clue. I used 'strings' on the osgmm.log.1 file, and the last entry is at 10:42 this morning. That's about when things started to go wrong. I'll have to check if I did something that would have caused something like that. I've restarted the match maker a couple times with no success. I'm attaching the log file anyhow. It's 5MB. It got rejected by the mailing list... so it's available here: http://abitibi.sbgrid.org/osgmm.log.1 Thanks for your help Mats. I've got meetings that will probably fill the rest of my day, so I'll have to wait until tomorrow to do much else. > >> I thought we have already removed the sudo/chmod feature (it is not >> a great way to do it - I will remove the code for 0.7). The >> preferred way to do this is to have a pre script that fixes the >> permissions. See local-pre-job in http://osgmm.sourceforge.net/ar01s03.html#job Okay, I'll look into that. --Peter |
From: Peter D. <do...@cr...> - 2009-06-22 20:08:44
|
On Jun 22, 2009, at 3:57 PM, Mats Rynge wrote: > > What version of Condor do you have? Can you provide the var/log/ > osgmm.log? > # condor_version $CondorVersion: 7.2.1 Feb 18 2009 BuildID: 133382 $ $CondorPlatform: X86_64-LINUX_RHEL5 $ The osgmm.log file is entirely filled with the permission change attempts. 4000 jobs in the queue, and it checks every 2 seconds on a different job. the osgmm.log.1.gz is a day old, and the osgmm.log.1 file has binary data in it. Is that normal? Hmm... suddenly I'm wondering if that's a clue. I used 'strings' on the osgmm.log.1 file, and the last entry is at 10:42 this morning. That's about when things started to go wrong. I'll have to check if I did something that would have caused something like that. I've restarted the match maker a couple times with no success. I'm attaching the log file anyhow. It's 5MB. I'll put it up on some web space somewhere if your, or my mail server rejects it. Thanks for your help Mats. I've got meetings that will probably fill the rest of my day, so I'll have to wait until tomorrow to do much else. > >> I thought we have already removed the sudo/chmod feature (it is not >> a great way to do it - I will remove the code for 0.7). The >> preferred way to do this is to have a pre script that fixes the >> permissions. See local-pre-job in http://osgmm.sourceforge.net/ar01s03.html#job Okay, I'll look into that. --Peter |
From: Mats R. <ry...@re...> - 2009-06-22 20:02:25
|
Peter Doherty wrote: > Okay, that eliminated the errors on the console when I launch the > match maker. Thanks. > For the moment the negotiator stopped crashing, but it stopped before > I put the new jar file in, so I don't know what to make of that. > At the moment there are no valid sites in the matchmaker, I'm going to > have to look into things further to see what's going on. It seems the > verification runs didn't run this afternoon. > The matchmaker related processes don't look right to me. What version of Condor do you have? Can you provide the var/log/osgmm.log? > Anyhow, looking in the osgmm.log file I noticed something > interesting. It tries to track jobs by their job log files, and if it > can't access the file, it tries to chmod 644 the log file. > a.) I don't know that I like the idea of the matchmaker trying to > change permissions on files in people's home directories. > b.) if it can't read the file, the odds are pretty slim it's going to > be able to change permissions on the file. > > But I guess this is why I have so many 0's and empty columns in the > various fields of condor_grid_overview. If it can't access the log > files, it can't display what jobs are running where and what their > status is in the condor_grid_overview output. Is that correct? I thought we have already removed the sudo/chmod feature (it is not a great way to do it - I will remove the code for 0.7). The preferred way to do this is to have a pre script that fixes the permissions. See local-pre-job in http://osgmm.sourceforge.net/ar01s03.html#job -- Mats Rynge Renaissance Computing Institute <http://www.renci.org> |