This README provides documentation on adding Graphics Support to a High
Throughput Computing Environment managed by Condor.
=================================
CONTENTS
=================================
/condor_config.local (Sample condor_config.local file)
/README (This README)
/samples/ (Sample Condor submission files)
/gpuQuery.submit
/gpuQueryLogs/ (Contains log files for gpuQuery.submit)
/tests/ (Graphics card tests provided by CUDA)
/deviceQuery (test obtaining information through CUDA)
(params: none)
/matrixMul (test performing matrix multiplication)
(params: size of matrix x, size of matrix y)
/script/ (Contains the scripts and executable needed to
/gpu.sh (script to get information about Graphics Cards)
/cudaQuery (executable to obtain information through CUDA)
=================================
INSTALLATION
=================================
_______LINUX_________
ADDING GRAPHICS CARD DISCOVER TO CONDOR
1. Test the script, which is located at "script/gpu.sh". In order to obtain
information about graphics cards, the condor user should have access to the
command lspci. For detailed information about NVIDIA CUDA capable graphics
cards, the condor user should be granted access to reading writing on the
graphics card. Files are located for the card in /dev/nvidiactl and
/dev/nvidia*
Sample script output:
HasGpu = True
NGpu = 2
Gpu0 = "Quadro FX 3700"
Gpu0CudaCapable = True
Gpu0Mem = 536150016
Gpu0Procs = 14
Gpu0Cores = 112
Gpu1 = "Quadro FX 3700"
Gpu1CudaCapable = True
Gpu1Mem = 536608768
Gpu1Procs = 14
Gpu1Cores = 112
HasCuda = True
CudaRelease = V1.1
CudaVersion = V0.2.1221
-
2. "condor_config.local" contains code to add cronjob into the machine's
condor local configuration file. Copy the cronjob code into the condor
local configuration file, which is located by default at:
/var/lib/condor/condor_config.local
Place gpu.sh script and cudaQuery binary into location accessible to condor
user.
Cronjob code:
STARTD_CRON_JOBLIST = $(STARTD_CRON_JOBLIST), UPDATEGPUINFO
STARTD_CRON_UPDATEGPUINFO_EXECUTABLE = /DIRECTORY/TO/SCRIPT/gpu.sh
STARTD_CRON_UPDATEGPUINFO_PERIOD = 1m
STARTD_CRON_UPDATEGPUINFO_MODE = Periodic
STARTD_CRON_UPDATEGPUINFO_KILL = True
3. Restart condor daemons on local machine with command:
`/sbin/service condor restart`
By restarting the daemons, the cronjob will be added and information
regarding gpus should be sent in class ad form.
4. To check that information is in the condor_collector run the command:
`condor_status -constraint HasGpu`
This command will display those machines with the requirement HasGpu.
Note: It may take a few minutes for the machine's class-ad to be sent.
5. To view the class-ads, type the command:
`condor_status (MACHINE ADDRESS) -long`
PREPARING CONDOR TO RUN CUDA JOBS
In order to run CUDA jobs in the Condor environment, submitting/running
users must be granted access to read/write the devices.
The devices that need to be accessed are located in /dev/nvidia*
These users could be:
Nobody (open access)
Controlled by Unix group (limited users)
Integrated with Condor user control (slot users)
_______MAC OS X______
To be implemented...
_______WINDOWS_______
To be implemented...
=================================
JOB SUBMISSION TESTING
=================================
To submit jobs into condor, users must be granted the access
(see PREPARING CONDOR TO RUN CUDA JOBS).
Tests have been provided in the samples directory. The "gpuQuery.submit"
file contains requirements that will locate machines that have identified a
machine that has GPU identification successfuly set up. Prior to testing,
ensure the requirements match what might be in the cluster.
I recommend for your first test running the "deviceQuery" test in order to
see if your Condor setup is properly set up. If the use submitting truely
has access to the GPU, the name of the graphics card will be successfully
printed. If the user is not properly granted the acess to the graphics card,
then the name of the card will be that it is emulated on the CPU.
=================================
EXAMPLES
=================================