ThunderstormDistributor Code

Distribute jobs to compute nodes on dynamic clusters

Status: Beta

Brought to you by: dtulga

Tree [a171fc] master / History

HTTPS access

File	Date	Author	Commit
TSDistributor	2014-03-10	David Tulga	[a171fc] ThunderstormDistributor 0.6 Beta
TSDistributor-Release	2014-03-10	David Tulga	[a171fc] ThunderstormDistributor 0.6 Beta
INSTALL	2014-03-10	David Tulga	[a171fc] ThunderstormDistributor 0.6 Beta
LICENSE	2014-03-10	David Tulga	[a171fc] ThunderstormDistributor 0.6 Beta
README	2014-03-10	David Tulga	[a171fc] ThunderstormDistributor 0.6 Beta
TSDistributor_0.6_2014_3_10.tar.gz	2014-03-10	David Tulga	[a171fc] ThunderstormDistributor 0.6 Beta

Read Me



ThunderstormDistributor 0.6 Beta (2014-3-9)
by David Tulga
Originally designed for the Wall Lab at Harvard CBMI


See INSTALL for custom installation instructions (for use with RaiforestCluster, no additional installation is required)
See LICENSE for license text (Open Source MIT-style license)


== Description ==

ThunderstormDistributor is a queuing system that distributes jobs and computational workload across dynamic clusters in the cloud. It manages the assignment of jobs to maximize CPU and memory usage and prevent oversubscription of compute nodes. It also performs advanced statistics collection on individual compute nodes and jobs to graph the distribution of disk, network, CPU, and memory usage over time, which facilitates the advanced optimization and tuning of computational workflows.

This version of ThunderstormDistributor includes job queuing across multiple machines on dynamic clusters, including user-specific execution, memory, CPU, real time, and wall time reservations and limits, and full statistics collection functionality.  It also supports dynamic resource allocation, to prevent node oversubscription, and job management, including login, cancellation, and termination. RainforestCluster can setup ThunderstormDistributor for a custom cluster on Amazon's EC2 cloud service. Otherwise, ThunderstormDistributor can run on other clusters, or even local or static clusters; however, manual setup is required. It does not have any Amazon dependencies, only requiring the Qt library and Linux (Kernel 2.6.18+). (See INSTALL for custom install intructions)  As it is in beta, debug logging is turned on, and it only supports one queue.


See: http://www.davidtulga.com/thunderstorm.htm for full information and a statistics viewer demo



== Usage ==


TSDistributor
Use this command to start the supervisor and worker processes
It is also the command that is called for the management aliases
(Global Modes Are: --supervisor, --worker, --submit, --jobs, --hosts, --cancel, --kill)
(Also include --version --help + --mode --help)

TSDistributor <global_mode_option> <additional_options>

For all commands, additional options available are:
	--config --config_file <config_file_path> : Load a different config file than the default
	--no_config : Don't load the default config file

--supervisor --super -s
Start in supervisor mode (ONLY ONE PER CLUSTER!)
Also: tssupervisor tssuper
Supervisor Mode Options:
	--cluster_name : Set the cluster name
	--super_worker_on : Start a worker on the supervisor node (Default)
	--super_worker_off : Don't start a worker on the supervisor node
	--network_failure_timeout <seconds> : Set the network timeout until a node is determined to have gone down, Default is 60 seconds
	--workers_die_on_disconnect_on : Workers die on network disconnect (Default)
	--workers_die_on_disconnect_off : Workers do not die on network disconnect
	--ram_limit --memory_limit -R : Set the default memory max per job for all jobs, Default is to set to the max ram limit of the smallest node in the cluster
	--ram_limit_absolute --memory_limit_absolute -Ra : Set a maximum absolute memory usage for all jobs (individually)
	--wall_limit -W : Set the default cpu wall time max per job for this worker, Default is no limit
	--wall_limit_absolute -Wa : Set a maximum absolute cpu wall time limit for all jobs
	--time_limit -T : Set the default real time max per job for this worker, Default is no limit
	--time_limit_absolute -Ta : Set a maximum absolute real time limit for all jobs


Also, for all commands except supervisor, either in the config file or by command line:
	--supervisor_location : set the supervisor location to connect to, can be either a /etc/hosts name, dns domain name, or ip address

--worker --w
Start in worker mode (Recommended one per node, supervisor process already has one start automatically)
Also: tsworker tsnode
Worker Mode Options (Also for worker on supervisor if enabled):
	--machine -m <node_name> : Set the machine name, otherwise auto-set to the hosts/shell name
	--n_cores -n : Set the number of cores (normally auto-detected to the total # of cores on the machine)
	--total_ram_limit --total_memory_limit -Rt : Set the maximum memory for all jobs total running on this worker, Default is to set to 90% of the ram on the machine (No jobs can be started on this worker requesting more memory than this limit), or all memory - 1 GiB (whichever is less)



== Cluster and Job Management:

tsub : Submit a job
(Also: tssub)
(Alias for TSDistributor --submit)
tsub <options> <command line>
tsub <options> '<command line>'
tsub <options> "<command line>"
Options:
	--machine -m <node_name> : Execute only on a given machine or machine group; Default is any; Special are: any, supervisor, super, workers
	--n_cores -n <# of cores> : Request this many cores be dedicated to this job, Default is 1
	--ram --ram_est --memory --memory_est -r <memory/ram usage estimate in MiB> : Request this much memory be dedicated to this job, Default is 0 (no estimate)
	--ram_limit --memory_limit -R <memory/ram hard limit in MiB> : Set a maximum memory usage for this job, where it will be terminated if it exceeds this, Default is memory estimate * 2 if set, otherwise set to the default job max for the worker
	--wall --wall_est --time --time_est -w -We <cpu wall time estimate in minutes> : Request this much cpu time for the job, Default is 0 (no estimate)
	--wall_limit --time_limit -W <cpu wall time hard limit in minutes> : Set a maximum cpu wall time for this job, where it will be terminated if it exceeds this, Default is wall time estimate * 2 if set, otherwise set to the default job max for the worker

tjobs : List jobs
(Also: tsjobs)
(Alias for: TSDistributor --jobs)
tjobs <options> <job_id>
	--job -j <job_id> : List information about only job id or job group name given, this is assumed as the first argument even without an option flag
	--machine -m <node_name> : List jobs only on a given machine, special groups work here as well (also includes jobs queued for a given machine / group only?)
	--running -R : List only running jobs
	--queued -Q : List only queued jobs
	--long --detail -l : List extra information in long format

thosts : List machines
(Also: thost tshost tshosts)
(Alias for: TSDistributor --hosts)
thosts <options> <host name(s)>
	--machine -m <node_name> : List only the given machine, or group, just writing the host or group name without this option works too
	--job -j <job_id> : List information about the machine running the job id or job group name given
	--running -R : List only hosts executing jobs
	--inactive --idle -I : List only hosts not executing anything
	--long --detail -l : List extra information in long format

tcancel : Cancel a Job (Only works when job is queued, not when running or suspended)
(Also: tscancel)
(Alias for: TSDistributor --cancel)
tcancel <job_id>
	--job -j <job_id> : Cancel the specified job id, assumed without flag, any or all cancels all jobs queued
	--machine -m <node_name> : Cancel jobs queued only for the given machine/group, if the job_id is omitted, will cancel all jobs on this machine/group (any or all cancels all queued jobs)
(Will only cancel the job if it is queued, otherwise returns failure)

tkill : Kill/forcibly terminate a Job
(Also: tskill)
(Alias for: TSDistributor --kill)
tkill <job_id>
	--job -j <job_id> : Kill the specified job id, assumed without flag, any or all kills all jobs
	--machine -m <node_name> : Kill jobs queued only for the given machine/group, if the job_id is omitted, will kill all jobs on this machine/group (any or all kills all jobs)

tadmin : Administer the cluster
(Also: tsadmin)
(Alias for: TSDistributor --admin)
tadmin <options>
	--shutdown_cluster : Shutdown the entire cluster
	--shutdown : Shutdown the given machine/worker (will shutdown the super worker if super is given, use --shutdown_cluster or --shutdown all to shut all down.)


any limit = 0 always means no limit
job_id = 0 is a wildcard

ThunderstormDistributor Code

Distribute jobs to compute nodes on dynamic clusters

Branches

Tree [a171fc] master / Download Snapshot History

Read Me

Tree [a171fc] master /

History