From: Nathan T. <er...@cs...> - 2005-04-05 14:42:13
|
We would like to make OProfile work on Clustermatic, a type of beowulf cluster (www.clustermatic.org). Such a cluster consists of a master node along with several *diskless* slave nodes where the master node contains support for a global process space across all the nodes (BProc). For current purposes, the key is that the slave nodes are diskless, with all file system support passing through a very small RAM disk. One way to handle large amounts of I/O is to have the master node (which has a disk) act as a NFS server for the slave nodes (where a NFS mount point would exist in each node's RAM disk). Our problem is that OProfile currently invariably stores configuration information in /root/.oprofile/daemonrc and samples in /var/lib/oprofile. We think we have a temporary workaround using a start up script that creates a symbolic link from /var/lib/oprofile to /NFS/oprofile/nodeXX where /NFS is the NFS mount. Since the config file is small, we can temporarily just store it in the RAM disk. To support Clustermatic in the long run, we would like to add a configuration option that allows samples and configuration information to be stored in a different 'base directory' (e.g. /NFS/oprofile/nodeXX). DCPI behaves in a similar fashion and we have found the ability to choose the location of the profile database to be very useful. (We do know the OProfile databases can be moved *after* the fact using oparchive, but it doesn't address our core problem of diskless nodes.) Since we'd ultimately like any work we do to make its way into the official OProfile sources, we wanted to get your comments and blessing on the proposal. John Mellor-Crummey Rob Fowler Nathan Tallent |
From: William C. <wc...@re...> - 2005-04-05 19:58:30
|
Nathan Tallent wrote: > > We would like to make OProfile work on Clustermatic, a type of beowulf > cluster (www.clustermatic.org). Such a cluster consists of a master > node along with several *diskless* slave nodes where the master node > contains support for a global process space across all the nodes > (BProc). For current purposes, the key is that the slave nodes are > diskless, with all file system support passing through a very small RAM > disk. One way to handle large amounts of I/O is to have the master node > (which has a disk) act as a NFS server for the slave nodes (where a NFS > mount point would exist in each node's RAM disk). > > Our problem is that OProfile currently invariably stores configuration > information in /root/.oprofile/daemonrc and samples in /var/lib/oprofile. > > We think we have a temporary workaround using a start up script that > creates a symbolic link from /var/lib/oprofile to /NFS/oprofile/nodeXX > where /NFS is the NFS mount. Since the config file is small, we can > temporarily just store it in the RAM disk. > > To support Clustermatic in the long run, we would like to add a > configuration option that allows samples and configuration information > to be stored in a different 'base directory' (e.g. > /NFS/oprofile/nodeXX). DCPI behaves in a similar fashion and we have > found the ability to choose the location of the profile database to be > very useful. (We do know the OProfile databases can be moved *after* > the fact using oparchive, but it doesn't address our core problem of > diskless nodes.) There would need to be some modifications in the op_mangle_filename() that converts the file name into a path for the sample file. Right now the path to the current sample directory is compiled in. oparchive cheats by making a tree structure that mimics the original file tree on the machine taking the data. oparchive includes the needed binaries. Thus, it and the analysis tools just prepends the path to the tree. For the clusters the approach used by oparchive leaves something to be desired. It would be preferable that the software doesn't make a copy of the executable for each of the nodes in the cluster. That would be a waste for a single image environment. Are the cluster's processors homogeneous? OProfile currently expects that all the processors in the machine have the same processing events. It would be quite possible to build heterogeneous clusters, e.g. Pentium M and Pentium 4. Even with same processor architecture processors can have different clock rates. This would affect event selection and analysis. What kind of analysis is being done on the collected data? Accumulating samples for a function across all the processors? Just looking at the performance of individual nodes? Finding which nodes were outliers with many more (or fewer) samples than the average? Each node would need its own sample directory. How many nodes are in the clusters? I am just wondering if there are going to be issues with having tens of thousands directories in a single directory and having lots of open file descriptors for the processing nodes? What about bandwidth issues of saving the samples files off the processing nodes? How do you start up the tasks on the processor nodes? I would like to know how each process gets the unique directory. Does a node compute the name locally based on it's processor name? > Since we'd ultimately like any work we do to make its way into the > official OProfile sources, we wanted to get your comments and blessing > on the proposal. It is not much fun maintaining divergent branches or patch sets to apply to existing packages. Making the work suitable to be included in the upstream package is much more desirable. -Will |
From: Rob F. <rj...@cs...> - 2005-04-05 22:04:38
|
My answers to Will's comments and questions are interspersed below. -- Rob William Cohen wrote: > Nathan Tallent wrote: > >> >> We would like to make OProfile work on Clustermatic, a type of beowulf >> cluster (www.clustermatic.org). Such a cluster consists of a master >> node along with several *diskless* slave nodes where the master node >> contains support for a global process space across all the nodes >> (BProc). For current purposes, the key is that the slave nodes are >> diskless, with all file system support passing through a very small >> RAM disk. One way to handle large amounts of I/O is to have the >> master node (which has a disk) act as a NFS server for the slave nodes >> (where a NFS mount point would exist in each node's RAM disk). >> >> Our problem is that OProfile currently invariably stores configuration >> information in /root/.oprofile/daemonrc and samples in /var/lib/oprofile. >> >> We think we have a temporary workaround using a start up script that >> creates a symbolic link from /var/lib/oprofile to /NFS/oprofile/nodeXX >> where /NFS is the NFS mount. Since the config file is small, we can >> temporarily just store it in the RAM disk. >> >> To support Clustermatic in the long run, we would like to add a >> configuration option that allows samples and configuration information >> to be stored in a different 'base directory' (e.g. >> /NFS/oprofile/nodeXX). DCPI behaves in a similar fashion and we have >> found the ability to choose the location of the profile database to be >> very useful. (We do know the OProfile databases can be moved *after* >> the fact using oparchive, but it doesn't address our core problem of >> diskless nodes.) > > > > There would need to be some modifications in the op_mangle_filename() > that converts the file name into a path for the sample file. Right now > the path to the current sample directory is compiled in. > > oparchive cheats by making a tree structure that mimics the original > file tree on the machine taking the data. oparchive includes the needed > binaries. Thus, it and the analysis tools just prepends the path to the > tree. > > For the clusters the approach used by oparchive leaves something to be > desired. It would be preferable that the software doesn't make a copy of > the executable for each of the nodes in the cluster. That would be a > waste for a single image environment. For our existing tools, we've assumed that sources and binaries will be available, but not necessarily using the same paths as existed either when the application was built or when it was run. Our solution is to give paths explicitly to the tools and to provide substitution rules for replacing one path prefix with another. > > Are the cluster's processors homogeneous? OProfile currently expects > that all the processors in the machine have the same processing events. > It would be quite possible to build heterogeneous clusters, e.g. Pentium > M and Pentium 4. Even with same processor architecture processors can > have different clock rates. This would affect event selection and analysis. Bproc clusters are homogeneous. We could handle heterogeneous clusters by adding architecture/implementation specificity to the scripts that start the demon. > > What kind of analysis is being done on the collected data? Accumulating > samples for a function across all the processors? Just looking at the > performance of individual nodes? Finding which nodes were outliers with > many more (or fewer) samples than the average? Here's a typical scenario we currently use on clusters with DCPI, PAPI, or oprofile on systems with local disks: 1) On each node of the job, the batch scheduler: Starts the demon on each node. Runs the user's job. Stops the demon. (Where there are local disks, optionally run preprocessing filters in parallel on compute nodes to extract data in our format.) A copy step, or the optional filtering writes data to place like /scratch/username/jobname.number/node_xx, i.e. one directory per node. The script creates these with the right ownership and permissions. 2) On non-PAPI systems, a high level analysis is done to look at how time is spent on each node with breakdowns for application, DSO's loaded by the app, MPI threads, system, etc. We're looking for gross anomalies such as nodes running dramatically slow, or processes that shouldn't be running on these nodes. 3) An "interesting" multi-profile (line-level profile with multiple metrics) is extracted from the data for each node. 4) Statistical analyses are applied to the collection of multi-profiles to identify groups of nodes that behave similarly. This clustering can be systematic, e.g., boundary vs interior nodes, or there can be anomalies, e.g. load imbalances, speed/heat issues, etc. (This stuff is a current resesearch thrust.) 5) If there are problems, do detailed browsing/analysis of representatives of the major statistical clusters and of the outliers to diagnose and fix performance problems. All of the processing in steps 1-4 can be automated (scripts), so the user/programmer can focus on the analysis/interpretation issues. > > Each node would need its own sample directory. How many nodes are in the > clusters? I am just wondering if there are going to be issues with > having tens of thousands directories in a single directory and having > lots of open file descriptors for the processing nodes? What about > bandwidth issues of saving the samples files off the processing nodes? On the DCPI clusters the number of nodes can be hundreds or thousands, but each compute node should only have a few file descriptors open. Whether or not the parallel file system can handle this is another issue. We worry about this and the bandwidth issue, but don't intend to spend a large amount of time on them until we know how bad the problems are. When the data movement cost is incurred is an issue. If data is only moved at the end of the job, then the main concern is that the system not fall over. On the other hand, instrumentation overhead that competes with the application is a problem. One important mode of operation for big, long-running applications (colliding black holes, dinosaur-killing asteroids) is to collect data for a 5 minute window every couple of hours and to take a look to ensure that nothing horrible has happened to performance. Printf's within the application can detect the onset of problems, but looking at profiling data is necessary for diagnosis. > > How do you start up the tasks on the processor nodes? I would like to > know how each process gets the unique directory. Does a node compute the > name locally based on it's processor name? Start up on conventional clusters is via a batch script. On bproc systems, the script runs on a head node and spawns parallel processes on the compute nodes. `hostname` on compute node 23 returns "n23", so /scratch/foo/bar/`hostname` would generate a unique path. Assuming that the driving script ensures that /scratch/foo/bar exists, is mounted, and that the owner/permissions are suitable, we would then propose to run, e.g., "oprofile --destdir /scratch/foo/bar/`hostname` ... " on each node. > >> Since we'd ultimately like any work we do to make its way into the >> official OProfile sources, we wanted to get your comments and blessing >> on the proposal. > > > It is not much fun maintaining divergent branches or patch sets to apply > to existing packages. Making the work suitable to be included in the > upstream package is much more desirable. > > -Will |
From: Rob F. <rj...@cs...> - 2005-04-05 22:41:38
|
A minor fix to my previous message -- I meant "opcontrol --destdir ..." Rob Fowler wrote: > My answers to Will's comments and questions are interspersed below. > > -- Rob > > William Cohen wrote: > |
From: John L. <le...@mo...> - 2005-04-06 15:36:46
|
On Tue, Apr 05, 2005 at 05:41:33PM -0500, Rob Fowler wrote: > I meant "opcontrol --destdir ..." I don't like the name. It needs to be --profile-dir or similar, we really want to put complete_dump, devices, etc, in /var/lib/oprofile/ still. This might be a good time to move oprofiled.log into the samples dir, for obvious reasons. Note that you can use session:/path/to/sample/files/current to get data back out. So your needed changes are (probably incomplete): 1) add --samples-dir to opcontrol 2) add --samples-dir to oprofiled 3) add --samples-dir to oparchive 4) make oparchive fix up the samples dir in the archive back to the default (so you don't have to specify session: as well as archive: when using opreport on an archive) john |