Let me give a bit of background first. We're using vblade 14 on 4 (four) 5TB arrays. We use these for reselling NAS storage to customers. Each one of these arrays has several dozen customers.
Normally everything works fine. However, I inherited a situation when I started this job that involved one of the arrays (we call it aoe6). It would continually disconnect, resulting in an input/output error whenever you tried to access the array - even with a simple ls. My monitoring script results in cron output like this:
rm: cannot remove `monitortest': Input/output error
/root/aoetest: line 4: monitortest: Input/output error
When this happens, I am unable to unmount the array and remount it - I have to reboot the network storage controller that connects to the 4 arrays in order to get the connection back.
Now, here's the fun part. Originally it was thought that there was something wrong with aoe6, so when I started it was my job to build a new aoe4 - new rackmount hardware, different than the old array. Then I had to migrate all of the data from aoe6 to aoe4 (it took 2 weeks). After all of the data was migrated, and the customers were moved to aoe4, I broke the aoe6 array, reinitialized it and added it back to the controller server. It's been working fine ever since.
However, now aoe4 (the new array() is disconnecting regularly. This leads me to believe that a customer or their data is causing this. The other three arrays, aoe5, 6 and 7 are all happily connected and running. The customers have access to Samba shares, ftp, ssh/sftp and rsync.
My question is, has anyone here seen something like this? If so, do you know what causes it? If not, is there anything I can do to figure out which customer it is and what they're doing to cause it? It would take weeks, perhaps months, to move customers back to aoe6 one at a time, and waiting for another disconnect between each move.