README for Platform Community Scheduler Framework
March, 2005
=========================
Introduction
=========================
------------------------
1.1 What is the Platform Community Scheduler Framework?
------------------------
The Platform Community Scheduler Framework (CSF) is the industry's
first comprehensive and OGSI-compliant metascheduling framework
built upon the Globus Toolkit(R) 4.0 (GT4) As part of our commitment
to the future of Grid, we are contributing CSF back to the Globus
Project(TM) to be included in all future versions of the Globus Toolkit.
------------------------
1.2 Platform CSF metascheduling services
------------------------
The Platform CSF consists of the following services:
o Job Service, for submitting, controlling and monitoring jobs
o Reservation Service, for reserving resources on the Grid
o Queuing Service, which provides a basic scheduling capability in CSF.
=========================
Building and Installing
=========================
------------------------
1 Supported Platforms:
------------------------
This release supports the following platforms:
x86 systems running Linux: Kernel 2.4.x, compiled with glibc 2.2
Tested on RedHat Linux 9.0
------------------------
2 Required packages
------------------------
1. Additional tools (required also by GT4)
- JDK 1.4 or higher (http://java.sun.com/j2se/1.4.1/index.html)
- Ant 1.5 (http://ant.apache.org/)
- JUnit 3.8.1 (http://www.junit.org/index.htm)
See the GT4 installation documentation for information about
downloading and installing these tools:
http://www-unix.globus.org/toolkit/downloads/development/
2. Platform's LSF 6.0
3. CSF packages
- CSF4.0 source package: csf-src-4.0.tar.gz
------------------------
3 System requirements
------------------------
- CPU speed must be 1 GHz processor higher.
- At least 512 MB of RAM and 1 GB of free disk space.
------------------------
4 Installing the CSF4.0 packages
------------------------
To install the CSF4.0 packages, complete the following steps.
1. Install LSF for CSF
o Login as root
o Uncompress and untar lsf6.0_linux2.4-glibc2.2-x86.tar.Z and follow
the steps in README
o Set up your LSF environment by sourcing either cshrc.lsf or
profile.lsf.
2. Make sure JAVA_HOME, ANT_HOME are set to the correct
installation location.
Add JAVA_HOME/bin and ANT_HOME/bin to PATH variable.
3. Install the newest Gt4 packages:
please look through the GT4 install guide :
http://www-unix.globus.org/toolkit/docs/development/3.9.5/admin/
4. Set up GT4.0 environment
o Make sure JAVA_HOME, ANT_HOME are set to the correct installation location.
o Add JAVA_HOME/bin and ANT_HOME/bin to PATH variable.
o cd <gt4_install_location>
o export GLOBUS_LOCATION=`pwd`
o source either etc/globus-devel-env.csh or etc/globus-devel-env.sh
5. Install CSF4.0 :
o If download the CVS:
o Enter into packaging/
run this command:
. ./make-src-package
come into being the file of csf-4.0-src.tar.gz
o run this command as globus admin user (the user installed gt4)
gpt-build csf-4.0-src.tar.gz
gpt-postinstall
o If download the package: csf-4.0-src.tar.gz
o run this command as globus admin user (the user installed gt4)
gpt-build csf-4.0-src.tar.gz
gpt-postinstall
6£® Make sure you set up security configuration for GT4.0
You can get certificate from Globus or set up your own CA. Certificates
for every user and the host, which runs GT4.0, are required. We recommend
you getting certificates from Globus as setting up your own CA could
take some time.
For details, see section "Security Configuration" at
http://www-unix.globus.org/toolkit/docs/development/3.9.5/admin/docbook/
7. Sanity checking
o Verify your certificate by executing the following commands
% $GLOBUS_LOCATION/bin/grid-proxy-init
220:blu@dev04 /usr/local/bingfeng/gt395> grid-proxy-init
Your identity: /O=Grid/O=Globus/OU=lsf.platform.com/CN=Bingfeng Lu
Enter GRID pass phrase for this identity:
Creating proxy ........................................... Done
Your proxy is valid until: Wed Mar 9 05:02:14 2005
% $GLOBUS_LOCATION/bin/grid-proxy-info
222:blu@dev04 /usr/local/bingfeng/gt395> grid-proxy-info
subject : /O=Grid/O=Globus/OU=lsf.platform.com/CN=Bingfeng Lu/CN=880442514
issuer : /O=Grid/O=Globus/OU=lsf.platform.com/CN=Bingfeng Lu
identity : /O=Grid/O=Globus/OU=lsf.platform.com/CN=Bingfeng Lu
type : Proxy draft compliant impersonation proxy
strength : 512 bits
path : /tmp/x509up_u30107
timeleft : 11:59:36
=========================
Configuring
=========================
You must configure the following files:
- $GLOBUS_LOCATION/etc/metascheduler/resourcemanager-config.xml
- $GLOBUS_LOCATION/etc/metascheduler/metascheduler-config.xml
Those files include some example settings for you follow.
------------------------
1. Configuring Resource Manager Factory Service
------------------------
Edit $GLOBUS_LOCATION/etc/metascheduler/resourcemanager-config.xml and specify:
name = name of cluster installed in step (1) of installation
type = currently must have value of "LSF" (without quotes)
host = host running gabd, may be same as LSF master
port = port number specified in gabd configuration (i.e., ga.conf)
gabd's configuration locates at $LSF_ENVDIR/ga.conf, in which
port number is specified. The default port number for gabd is 1966.
Note: the example resource manager configuration in the file
is commented out. You need to define one after line "-->".
------------------------
2. Configuring CSF
------------------------
Edit $GLOBUS_LOCATION/etc/metascheduler-config.xml and specify:
GISHandle = handle of Index Service
registryHandle = handle of container registry
NOTE: do not use localhost or 127.0.0.1 loopback address, change it to
actual host IP address.
=========================
Testing
=========================
------------------------
1. Documentation
------------------------
NOTE: What we did the testing is on the GT395 .
CSF documentation can be found in directory $GLOBUS_LOCATION/docs/metascheduler/:
- examples: contains examples for job service and reservation service
- config: configuration template for resource manager and CSF.
- api: Java API documents in Javadoc format:
$GLOBUS_LOCATION/bin/
------------------------
2. Starting LSF
-----------------------
o Login as root
o Set up LSF environment by sourcing either cshrc.lsf or
profile.lsf
o lsadmin limstartup
o lsadmin resstartup
o badmin hstartup
o $LSF_SERVERDIR/gabd -d $LSF_ENVDIR
------------------------
3. Starting GT4.0
------------------------
1. Set up GT4.0 environment
- Make sure JAVA_HOME, ANT_HOME are set to corresponding install location.
- Add JAVA_HOME/bin and ANT_HOME/bin to PATH variable.
- cd <gt4_install_location>
- Export GLOBUS_LOCATION=`pwd`
- Source either etc/globus-devel-env.csh or etc/globus-devel-env.sh
2. Start service container
% globus-start-container
NOTE: control will not return to the terminal for "globus-start-container",
so you will need another window for testing CSF.
------------------------
4£® Using the Reservation Service
------------------------
Reservation Service requires resource manager supporting advance
reservation. By default only LSF cluster administrator(s) can make
reservations in the cluster. You can allow every user to make
reservation as well by defining advance reservation policies in your
LSF cluster.
See the chapter "Advance Reservation" in "Administering Platform LSF"
for more information.
o Creating a reservation by agreement
o Use template agreement file from
${GLOBUS_LOCATION}/docs/csf/example/agreement.xml,
you must change:
- agreement term values (ResReq or hostTerm, cpuTerm, userTerm,
startTime, endTime)
o java com.platform.metascheduler.client.ReservationServiceClient
<ReservationService_handle> createService agreement
<agreement_file> [reservation_name]
o On success, a corresponding reservation is created in the LSF
cluster. Use the "brsvs" command to query existing reservations.
NOTE: On success, this returns a handle to the reservation-name
On failure, this throws a fault
If not specified the reservation_name ,It will return a hach_code,
The hach_code has the same function as the reservation_name.
o Creating a reservation by RSL
o You can use template RSL file from csf/example/rsv.xml, but
must change:
- schema location specified in "schemaLocation" attribute.
- values for hosts, number ... etc
o java com.platform.metascheduler.client.ReservationServiceClient
<ReservationService_handle> createService rsl <rsl_file>
[reservation_name]
o java com.platform.metascheduler.client.ReservationServiceClient
< reservation_name > submit.
o On success, a corresponding reservation is created in the LSF
cluster. Use the "brsvs" command to query existing reservations.
o Modifying a reservation
o java com.platform.metascheduler.client.ReservationServiceClient
<reservation_name > modify <rsl_file>
o Operation that modifies the reservation request. This can only
be performed when reservation is in CREATED status.
o Canceling a reservation
o java com.platform.metascheduler.client.ReservationServiceClient
<reservation_name > cancel
o Querying Reservation Data
o java com.platform.metascheduler.client.ReservationServiceClient
<reservation_name > getRsvData
You will see XML file. At the top of it, you should see the reservation
information:
<GridReservation
ID="{http://10.60.39.113:8080/wsrf/services/metascheduler/ReservationService}ID_20050310_3" Status="RESERVED" UserID="bingfeng">
This reservation can be used by job. For details, see next section.
o Querying Reservation Status
o java com.platform.metascheduler.client.ReservationServiceClient
<reservation_name> getStatus
------------------------
5. Using the Job Service
------------------------
For jobs forwarded by the Platform Resource Manager Adapter, you can
verify if job service is working by checking the job status in
your LSF cluster
o Creating a job:
o java com.platform.metascheduler.client.JobServiceClient
<JobService_handle> createService rsl <rsl_file> [job_name]
RETURN: true or false ,and the job_name
(You should use this job_name to control the job).
NOTE: Users need to call submit or start after the creation
in order for the job to run; otherwise, the job service
will not perform any further action. Users should also
destroy the job after it is finished.
If not specified the job_name,it will return a hach_code,
The hach_code has the same function as the job_name.
eateService option "id <ID>" is not supported yet.
o Starting a job in a cluster controlled by resource manager:
o java com.platform.metascheduler.client.JobServiceClient
<job_name> start <ResourceManagerFactoryService_handle> <cluster_name>
RETURN: true or false
o java com.platform.metascheduler.client.JobServiceClient
<job_name> start <cluster_name>
RETURN: true or false
o Stopping a job:
o java com.platform.metascheduler.client.JobServiceClient
<JobServiceClient_handle> <job_name> stop
RETURN: true or false
o Resuming a job:
o java com.platform.metascheduler.client.JobServiceClient
<JobServiceClient_handle> <job_name> resume
RETURN: true or false
o Canceling a job:
o java com.platform.metascheduler.client.JobServiceClient
<JobServiceClient_handle> <job_name> cancel
RETURN: true or false
o Checking job information:
o java com.platform.metascheduler.client.JobServiceClient
<JobServiceClient_handle> <job_name> getJobData
RETURN: job data in xml format
o java com.platform.metascheduler.client.JobServiceClient
<JobServiceClient_handle> <job_name> getStatus
RETURN: job status in xml format
o Queuing a job:
The queuing service must be configured successfully, as described
in "5.6. Using the Queuing Service"
% java com.platform.metascheduler.client.JobServiceClient
<job_name> submit [queue_name]
If queue_name is not specified, the default queue that is
defined in the job service section is picked by job service.
o Creating a job with a grid reservation id
o Make a reservation by using Reservation service.
o Get grid reservation id
o Specify a reservation in job's rsl_file like
<metascheduler:gridReservationId>
<rsl:string>
<GridReservation ID="{http://10.60.39.113:8080/wsrf/services/metascheduler/ReservationService}ID_2005030_10"/>
</rsl:string>
</metascheduler:gridReservationId>
Queuing service dispatches the job to the cluster where the reservation is
made if grid reservation id is included. See the next section for details
of queuing service.
If job is started by using job service client, clusterName and
gridReservationId defined in job's rsl file is ignored.
o Creating a job with Gram:
For example:
$ csf-job-create rsl docs/metascheduler/examples/gram_job.xml job1
Service location:https://10.60.39.4:8443/wsrf/services/metascheduler/JobService
CreateJob Successfully: job1
o Starting a job with Gram:
For example:
$ csf-job-start job1 https://10.60.39.4:8443/wsrf/services/ManagedJobFactoryService Fork
begin to invoke start!
start(https://10.60.39.4:8443/wsrf/services/ManagedJobFactoryService, GRAMFORK) =>
GramJob
NOTE: We can use LSF\PBS\Multi\Conder to replace the Fork,but when we use the LSF,
PBS or Conder, these schedulers must be installed.
Checking the job information :
For example:
$ csf-job-status job1
getStatus() => Pdone
$ cat /tmp/stdout
abc 34 this is an example_string Globus was here
abc 34 this is an example_string Globus was here
------------------------
6. Using the Queuing Service
------------------------
o Configuring queues:
$GLOBUS_LOCATION/etc/metascheduler/metascheduler-config.xml
section queuingConfig
Each queue has its own configuration section, which includes:
- plugin: name of the class which must interface
implement com.platform.metascheduler.impl.schedPlugin
The default plugin is always loaded even without defining plugin.
If the plugin specified does not exist or does not implement the
schedPlugin interface, it will not be loaded.
The optional throttle plugin should be configured.
- scheduleInterval: interval in seconds between different
scheduling session. Its value is an integer between 5 and 600.
This parameter is optional. If not defined, default value (30
seconds) is used.
- throttle: maximal number of jobs can be dispatched in each
scheduling cycle. Its value is an integer greater than 0.
To test this, you must configure throttle to a small value and
scheduleInterval to a bigger value and submit more jobs. You can
check how many jobs are forwarded to LSF cluster(s) as well.
Turn on debug in $GLOBUS_LOCATION/log4j.properties
log4j.category.com.platform.metascheduler.impl.schedThrottle=DEBUG
You should see a message like the following:
[java] 438867 [Thread-60] DEBUG com.platform.metascheduler.impl.schedThrottle
- Keep only the first 3 decisions
By default, there is no queue configured. Any job submission to a
queue will fail.
o Creating a queue - only queues the are configured in the
configuration file can be created. Any user can create queues and
any queue can be used by all users. If you submit a job to a
queue that does not already exist, the queue is created automatically.
Syntax:
% java com.platform.metascheduler.client.QueuingServiceClient
<QueuingService_handle> create <queue_name>
Return: handle to the created queue if operation is successful.
o Checking one specific queue:
o Get queue data through queuing service client
Syntax:
% java com.platform.metascheduler.client.QueuingServiceClient
<QueuingService_handle> getQueueData <queue_name>
Return: queue name and status
o Get queue configuration through queuing service client:
Syntax:
% java com.platform.metascheduler.client.QueuingServiceClient
<QueuingService_handle> getQueueConfigParams <queue_name>
Return: queue configuration parameters.
o Queuing a job
o Jobs specifying a cluster name. The queuing service honors
the ClusterName specified in the job RSL file and dispatches it.
If it is incorrect, the job is not scheduled.
1 - create a job by using job service
2 - submit the job to the queue
o Jobs without specifying any cluster. The queuing service assigns
a cluster for the job and dispatches it.
1 - create a job by using job service
2 - submit the job to the queue
See "Using the Job Service" for job creation and job
submission to queue
o Jobs with reservation. Queuing service dispatches the job
to the cluster where the first reservation is made.
If the job does not specify clusterName or gridReservationId in
its RSL file, the queuing service dispatches the job to the
available Resource Manager Factory Service and Gram factories
in round-robin order.
o Debugging the queuing service - turn the following flags in
$GLOBUS_LOCATION/container-log4j.properties :
- log4j.category.com.platform.metascheduler.impl.QueuingServiceImpl=DEBUG
- log4j.category.com.platform.metascheduler.impl.schedPluginDefault=DEBUG
- log4j.category.com.platform.metascheduler.impl.schedThrottle=DEBUG
=========================
CSF Commands & APIs
=========================
In order to make CSF easy for end users to use, the CSF package
includes several sample commands to wrap around clients for Job
Service, Reservation Service, and Queuing Service.
All the commands are located in $GLOBUS_LOCATION/bin. For
command usage, use "help" as an argument.
------------------------
1. Job service commands
------------------------
% csf-job-cancel: cancel a job service instance
% csf-job-create: create a job service instance
% csf-job-data: get detailed job information
% csf-job-resume: resume a suspended job
% csf-job-start: start a job in a specific resource manager
% csf-job-status: get job status
% csf-job-stop: resume a suspended job
% csf-job-submit: submit the job to queuing service
------------------------
2. Reservation service commands
------------------------
% csf-rsv-cancel cancel an existing reservation
% csf-rsv-create create a reservation service instance
% csf-rsv-data get reservation information
% csf-rsv-status get reservation status
% csf-rsv-submit submit a created reservation to start reservation booking
------------------------
3.Queuing service commands
------------------------
% csf-queue-conf get queue configuration
% csf-queue-create create a queuing service instance
% csf-queue-data get queue information
Job service, reservation service and queuing service are Grid
services. Here we list published APIs of those services. For details,
see the WSDL files under directory
$GLOBUS_LOCATION/schema/metascheduler and the Javadoc information
under directory ${GLOBUS_LOCATION}/docs/csf/api.
------------------------
4. Job service APIs
------------------------
- String start(String factoryUri, String clusterName, String rmType, String rmRsvId)
Start a job in a specific cluster
- boolean cancel()
Cancel a job
- boolean resume()
Resume a suspended job
- boolean stop()
Suspend a job
- String getJobData()
Get detailed job information
- String getStatus
Get job status
------------------------
5. Reservation service APIs
------------------------
- String getRsvData()
Get reservation information
- boolean submit()
Submit a reservation request for reservation booking
- boolean modify(String rsl)
Modify a reservation request before the request is submitted
- String getStatus()
Get reservation status
------------------------
6. Queuing Service APIs
------------------------
- QueueDataType getQueueData()
Get queue information
- QueueConfigParamsType getQueueConfigParams()
Get queue configuration
- String submit(QueuingRequestType submitRequest)
Submit a job to current queue
- boolean remove(QueuingRequestType removeRequest)
Remove a job from current queue
=========================
Support for Multiple GT4.0 Hosting Environment
=========================
------------------------
1.The conception of multiple GT4.0 hosting environment
------------------------
The conception of multiple GT4.0 hosting environment means there exist multiple Globus enviroment(The hosting who installed the GT software means one Globus enviroment),they trust the same CA, all certificates are signed by the CA,we call the multiple Globus enviroment a "Community" or a "Virtual Organization(OA)".
The multi-hosting collaboration of CSF allows the hostA send it's jobs to the hostB,and the jobs run in the hostB's RM(resourcemanager)of Globus enviroment.
One community need one or more Center Index Server,the Center Index Server storage all "epr" about the RM information of the community. especially,the Center Index Server can not install the CSF
------------------------
2.The configure must to do
------------------------
for instance: Grid10£¨IP£º10.60.39.3£©£¬Grid5£¨IP£º10.60.39.113£©as the hosts computer of a Community£¬Grid7£¨IP£º10.60.39.4) as the Center Index Server of the Community¡£
o config the Grid10:
o In the file of $GLOBUS_LOCATION/etc/metascheduler/metaschedulerconfig.xml
(in the title of ReservationConfig),configure the Community IndexServiceHandle
For instance:
<reservation:CommunityGISHandle value=
" https://10.60.39.3:8443/wsrf/services/DefaultIndexService"/>
o In the file of $GLOBUS_LOCATION/etc/globus_wsrf_mds_index/hierarchy.xml
configure the VO Index Service Handle¡£
For instance:
<upstream>https://10.60.39.4:8443/wsrf/services/DefaultIndexService</upstream>
o config the Grid5
The configure process is the same as Grid10
o In the file of $GLOBUS_LOCATION/etc/globus_wsrf_mds_index/hierarchy.xml £¬configure the other Host Index Service Handle of the VO¡£
for instance£º
<downstream>https://10.60.39.3:8443/wsrf/services/DefaultIndexService</downstream>
<downstream>https://10.60.39.113:8443/wsrf/services/DefaultIndexService</downstream>
------------------------
3.Submit a job in the Multiple GT4.0 Hosting Environment
------------------------
o Config the Grid5's $GLOBUS_LOCATION/etc/metascheduler/resourcemanager-config.xml
for instance:
<cluster>
<name> cluster1 </name>
<type> LSF </type>
<host> grid5 </host>
<port> 1966</port>
</cluster>
o Config the Grid10's $GLOBUS_LOCATION/etc/metascheduler/resourcemanager-config.xml
for instance:
<cluster>
<name> cluster2 </name>
<type> LSF </type>
<host> grid10 </host>
<port> 1976</port>
</cluster>
Notice :The Grid5 and the Grid10 must config the different cluster_name
o Config the Grid5's job.xml
for instance:
<metascheduler:clusterName>
<rsl:string>
<rsl:stringElement value="cluster2"/>
</rsl:string>
</metascheduler:clusterName>
</metascheduler:job>
Notice:The Grid5 had configed to cluster1, and in the job.xml, the job was
appointed to cluster2(the Grid10 had configed),so when submit the jod which
created in Grid5 ,the job will run in the Grid10 .
o Create job1 in Grid5
[globus@grid5 gt395]$ csf-job-create rsl /tmp/job.xml job1
Service location:https://10.60.39.113:8443/wsrf/services/metascheduler/JobService
CreateJob Successfully: job1
o Submit job1 ,run in the Grid10
[globus@grid5 gt395]$ csf-job-submit job1 normal
submit(normal) => https://10.60.39.113:8443/wsrf/services/metascheduler/JobService
------------------------
4.Starting a job in an LSF cluster through Resource Manager Factory service
------------------------
o Create a job in GT4.0 on hostA environment:
% csf-job-create rsl job.xml job1
o Start the job in an LSF cluster through Resource Manager Factory
service:
% csf-job-start job1 http://hostB:8080/wsrf/services/metascheduler/ResourceManagerFactoryService clustername
=========================
Uninstallation
=========================
To undeploy and uninstall the CSF package:
Run this command in the directory $GLOBUS_LOCATION as globus admin user
ant -f share/globus_wsrf_common/build-packages.xml undeployGar -Dgar.id=metascheduler
rm -rf etc/gpt/packages/setup/csf
rm -rf etc/gpt/packages/csf
=========================
Contact Information
=========================
Please send all questions and comments to support@platform.com,
or phone toll-free 1-877-444-4LSF (+1 877 444 4573)
=========================
Copyright
=========================
Copyright 1994-2003 Platform Computing Corporation.
All rights reserved.
Although the information in this document has been carefully
reviewed, Platform Computing Corporation ("Platform") does not
warrant it to be free of errors or omissions. Platform reserves the
right to make corrections, updates, revisions or changes to the
information in this document.
UNLESS OTHERWISE EXPRESSLY STATED BY PLATFORM, THE PROGRAM DESCRIBED
IN THIS DOCUMENT IS PROVIDED "AS IS" AND WITHOUT WARRANTY OF ANY
KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE. IN NO EVENT WILL PLATFORM COMPUTING BE LIABLE TO
ANYONE FOR SPECIAL, COLLATERAL, INCIDENTAL, OR CONSEQUENTIAL
DAMAGES, INCLUDING WITHOUT LIMITATION ANY LOST PROFITS, DATA OR
SAVINGS, ARISING OUT OF THE USE OF OR INABILITY TO USE THIS PROGRAM.
LSF is a registered trademark of Platform Computing Corporation in
the United States and in other jurisdictions.
ACCELERATING INTELLIGENCE, THE BOTTOM LINE IN DISTRIBUTED COMPUTING,
PLATFORM COMPUTING, and the PLATFORM and LSF logos are trademarks of
Platform Computing Corporation in the United States and in other jurisdictions.
UNIX is a registered trademark of The Open Group.
Other products or services mentioned in this document are the
trademarks of their respective owners.
=========================
Security Considerations
=========================
[the Security_Considerations_Frag.html is embedded below.]
=========================
Troubleshooting
=========================
o If the client server report the error "connect denied" ,you'd
better check out if the LSF is started.
o If you can't create a queue, you'd better check out if you have
configed the queue in the file of
$GLOBUS_LOCATION/etc/metascheduler/metascheduler-config.xml
==============================
END of README
Last update: Wednesday March 30 2005
==============================