|
From: Dan B. <da...@ma...> - 2009-06-23 17:02:09
|
I am using sshfs to mount some remote data on the Condor submit host of a university campus grid consisting of several hundred library and lab PCs. The execute hosts see the mounted data via Condor's remote I/O mechanism. The application (a hydrological model written in FORTRAN) runs without problems unless more than about 40 instances are running at the same time. When too many of these model instances are running at the same time some of them fail part way through their 6 hour run, with "Operation not permitted" errors on opening files for reading. The files causing the problem are different each time, and (except for a few cases) no two model runs try to open the same input file. The job failures all happen within the space of a few minutes, leaving those that survive to run to completion. I am running a test SSH server on the remote host on port 2022 with debugging enabled. This is the command I am using to mount the remote data (with $USER etc replaced with appropriate values): sshfs $USER@$REMOTE_HOST:$PATH $MOUNT_POINT -o port=2022,debug,sshfs_debug,LogLevel=DEBUG3,reconnect,workaround=all,allow_other,default_permissions,uid=$UID,gid=$GID,umask=0002 My first question is, are there any documents describing the sshfs and ssh debug messages, to help me to understand what they are telling me? There are "Operation not permitted" errors but it is hard to match these up with the FORTRAN errors because there are no time stamps in the sshfs/ssh debug messages, and the messages themselves do not specify any paths. Some of the "Operation not permitted errors" in the sshfs output relate to OPEN operations and some to SETATTR operations. Here is an example of an "Operation not permitted" message, with (what I presume to be) its corresponding "Opcode" message (which are typically separated by several lines in the sshfs/ssh output file). unique: 21753707, opcode: OPEN (14), nodeid: 28085, insize: 48 unique: 21753707, error: -1 (Operation not permitted), outsize: 16 I don't know if any of these error messages match up with the FORTRAN errors produced when the model runs crash. For the above example I tried to find the file's path by looking for previous occurrences of "nodeid: 28085", but none of these messages contained a path either. The sshd debug message file on the remote data server seems to contain very similar information to the ssh part of the sshfs/ssh client debug output. I thought the problem might be related to the periodic re-keying I have observed in the ssh debug messages. It occurred to me that there might be a short period of data un-availability during the re-keying process that could cause problems for applications reading the data. Is this at all likely? In other words, does ssh behave like that or should it just carry on without delay or interruption to data transfers during re-keying? I assume the latter, because there doesn't seem to be any relationship between they re-keying messages and the "Operation not permitted" errors. In fact, the last time I ran the models the first "Operation not permitted" error occured before the first batch of re-keying messages in the sshfs/ssh output file. I am not certain that sshfs or ssh are to blame for the models crashing but nothing else seems to be amiss on the local or remote systems. The volumes of data are not particularly large; the total amount of data transfer involved in single model run is about 1.2 GB (including input and output), and only about half of that has been completed by the time a model crashes as a result of an "Operation not permitted" error. Any suggestions, or explanations of debug messages would be much appreciated. I think that sshfs could be very useful for this type of campus grid application, but only if it can handle the I/O for more than 40 simultaneous batch jobs. Finally, here is some version information obtained from the client host. [vxx05160@vicg2 coupled]$ sshfs -V SSHFS version 2.2 FUSE library version: 2.7.4 fusermount version: 2.7.4 using FUSE kernel interface version 7.8 >From the sshfs/ssh debug message file: debug1: Remote protocol version 2.0, remote software version OpenSSH_5.1 debug1: Local version string SSH-2.0-OpenSSH_4.3 Regards, Dan Bretherton -- Mr. D.A. Bretherton Reading e-Science Centre Environmental Systems Science Centre Harry Pitt Building 3 Earley Gate University of Reading Reading, RG6 6AL UK Tel. +44 118 378 7722 Fax: +44 118 378 6413 |