From: Steve D. <st...@us...> - 2002-05-17 15:25:55
|
On 17 May 2002, Ram Pai wrote: > On 16 May 2002, Daniel McNeil wrote: > >>> >>> Isn't the below example more a reason to save a local map of all >>> EVMS volume information? Having diskgroup info in a file does >>> NOT tell you which disk contains your volume, just which disks >>> might be missing. If you saved all volume information (and >>> maybe even diskgroup) then you could figure out which disk has >>> the volume you are getting an error on. On every boot or reprobe >>> you could inform the administrator what is different (new or missing). Remember that there is not always a 1-to-1 or 1-to-n correspondence between volumes and disks. If you use several disks to create an LVM group and then have LVM make from that group regions that are made into volumes, any of the volumes can reside on several of the disks resulting in an n-to-m mapping. Only LVM know the mapping of disks to regions. To maintain an accurate mapping LVM would have to make sure that it maintains the parent and child lists of the regions and disks to make sure they reflect which regions have pieces on which disks. It's a little more complex, but I think the Engine could handle it. You also have another issue to handle when an LVM region is shrunk or expanded. A shrink or expand could alter the parent/child relationship between a region or disk. A shrink may remove portions of the region from a disk; and expand could take up storage from another disk to expand the region. LVM would have to keep the parent/child lists up to date. The Engine would have to be aware of these changes and make sure it updates the map file. When you bring up the Engine (e.g., run evmsgui), the plug-ins throw up warnings when they don't find the pieces they need to build an object. The warnings actually provide more information than a volume-to-disk mapping would. They say which intermediate objects are missing or corrupt. One hole, though, is that objects/volumes whose children reside entirely on one failed disk will not be discovered and hence will not be reported as errors. > With my proposed solution, the engine would not be aware of missing > volumes, because we do not track the volumes in the map file. This may > be fixed by caching in the map file, all the volumes created from > sharable storage. > > Now, the engine would know the missing volumes, but wont' be able to > report the corresponding disks, unless we cache the entire metadata in > the map file. Hmmm... but thats a overkill. The engine would however > be able to give a rough estimate, by reporting the missing disks. Yes, caching all the metadata would be overkill. The user ultimately deals with volumes. If a volume is not there because a disk failed, I as an administrator would not be so much concerned with the intermediate steps that are missing as I am about which disk died and what volumes are no longer available. >>> >>> Questions: >>> >>> Does EVMS run any user-level engine stuff on boot that could potentially >>> create the map and save it to a file? >>> Or does EVMS only run the engine stuff on configuration changes? >>> Or were you thinking about updating and checking the file from >>> the kernel? The Engine does not currently run at boot up. The Engine is only run for configuration changes. I don't think we want to add boot time code to the Engine. Boot time code belongs in a separate boot program or script. > Here is the sequence I imagine: > > During boot: > 1) kernel discovers all the local disks, local diskgroups and creates > the local configuration. > 2) The clustering software runs and joins the cluster. > 3) the EVMS rc script then runs, which invokes the engine. > 4) the engine runs in a restricted mode(I mean validation mode), > reads all the sharable disk metadata, verifies if any > disk is missing, discovers the volumes. I also see a > neccessity for the engine to talk to spawned engines on > other cluster nodes, and agree upon the sharable > diskgroup configuration and update its map file for any > disks that have not been updated in the map file. > 5) informs the kernel to discover the configuration residing on > sharable disks. > 6) if everything is fine quits. I don't see a reason why a boot program would have to do validation before calling the kernel to discover volumes. Why not just prod the kernel to discover the volumes on the shared media and then compare the results against the map file? I see the steps as: 1) kernel discovers all the local disks, local diskgroups and creates the local configuration. 2) The clustering software runs and joins the cluster. 3) The EVMS rc script then runs, which invokes the clustering boot program. 4) The clustering boot program issues an ioctl to the kernel to enable shared disks and do a rediscover taking into account the new disks. 5) The clustering boot program looks at the list of volumes discovered by the kernel and compares it against the map file to look for missing volumes and displays warnings if there are. (The boot program could also do the comparison on the other nodes in the cluster.) Question: Is it important that the clustering software on one node knows whether another node cannot access shared storage in a cluster? If the storage is shared, then all you need are locks to serialize access from the various nodes. It doesn't matter if another node can't access the disk. The only reason I can see for one node to care if another node sees the shared storage is when an application the shared storage to communicate with the other node. In that case wouldn't it be better for the application to communicate via sockets instead of shared storage? Or at least if the application is going to communicate via the shared storage it would have to coordinate activity with the other node anyway, which is the applications responsibility, not the responsibility of the cluster software. Or am I missing something? > When a node is up and system administrator wants to add a sharable disk > to a diskgroup, the engine does the corresponding metadata updates on > the disks, adds a entry for the disk in the map files on all the nodes > for this disk. There are cases where the map file may not be updatable > on some nodes because they are not active cluster members. In such > cases its the responsibility of the engine on that node to consult other > nodes, and update its map file when it becomes a cluster member. > > > > There are some problems though: > > 1. if the disks in local diskgroups are missing, engine has no > opportunity to warn, because local diskgroups are imported > during boot by the kernel waybefore the engine gets a chance to > run. However the kernel does detect that disks are missing, and > can spew out a syslog message. But won't be able to pinpoint > the exact disk. Again, I would vote that the Engine remain uninvolved during the boot process. Most of the code that Ram proposes would be in the kernel as part of the discovery process, most of it provided by a clustering plug-in. Any user space requirements can be put into a new program that is run by the init scripts. The boot program may have dependencies on the Engine and may even use Engine services, so it would be logical to have the code in the Engine tree, something like the devnode_fixup program is today. Maybe its just a matter of semantics. When Ram says "the engine runs" perhaps he more literally means "the clustering boot program (which uses the Engine) runs". >>> Daniel > -- > Ram Pai > lin...@us... > 503-5783752 > Tieline: 7753752 > EVMS: http://www.sf.net/projects/evms > ---------------------------------- Steve D. |