Re: [Evms-cluster] Disks must be tracked?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On 17 May 2002, Ram Pai wrote:
> On 16 May 2002, Daniel McNeil wrote:
>
>>>
>>> Isn't the below example more a reason to save a local map of all
>>> EVMS volume information?   Having diskgroup info in a file does
>>> NOT tell you which disk contains your volume, just which disks
>>> might be missing.  If you saved all volume information (and
>>> maybe even diskgroup) then you could figure out which disk has
>>> the volume you are getting an error on.  On every boot or reprobe
>>> you could inform the administrator what is different (new or missing).

Remember that there is not always a 1-to-1 or 1-to-n correspondence between
volumes and disks.  If you use several disks to create an LVM group and
then have LVM make from that group regions that are made into volumes, any
of the volumes can reside on several of the disks resulting in an n-to-m
mapping.  Only LVM know the mapping of disks to regions.  To maintain an
accurate mapping LVM would have to make sure that  it maintains the parent
and child lists of the regions and disks to make sure they reflect which
regions have pieces on which disks.  It's a little more complex, but I
think the Engine could handle it.

You also have another issue to handle when an LVM region is shrunk or
expanded.  A shrink or expand could alter the parent/child relationship
between a region or disk.  A shrink may remove portions of the region from
a disk; and expand could take up storage from another disk to expand the
region.  LVM would have to keep the parent/child lists up to date.  The
Engine would have to be aware of these changes and make sure it updates the
map file.

When you bring up the Engine (e.g., run evmsgui), the plug-ins throw up
warnings when they don't find the pieces they need to build an object.  The
warnings actually provide more information than a volume-to-disk mapping
would.  They say which intermediate objects are missing or corrupt.  One
hole, though, is that objects/volumes whose children reside entirely on one
failed disk will not be discovered and hence will not be reported as
errors.

> With my proposed solution, the engine would not be aware of missing
> volumes, because we do not track the volumes in the map file.  This may
> be fixed by caching in the map file, all the volumes created from
> sharable storage.
>
> Now, the engine would know the missing volumes, but wont' be able to
> report the corresponding disks, unless we cache the entire metadata in
> the map file.  Hmmm...  but thats a overkill.  The engine would however
> be able to give a rough estimate, by reporting the missing disks.

Yes, caching all the metadata would be overkill.  The user ultimately deals
with volumes.  If a volume is not there because a disk failed, I as an
administrator would not be so much concerned with the intermediate steps
that are missing as I am about which disk died and what volumes are no
longer available.

>>>
>>> Questions:
>>>
>>> Does EVMS run any user-level engine stuff on boot that could
potentially
>>> create the map and save it to a file?
>>> Or does EVMS only run the engine stuff on configuration changes?
>>> Or were you thinking about updating and checking the file from
>>> the kernel?

The Engine does not currently run at boot up.  The Engine is only run for
configuration changes.  I don't think we want to add boot time code to the
Engine.  Boot time code belongs in a separate boot program or script.

> Here is the sequence I imagine:
>
> During boot:
>            1) kernel discovers all the local disks, local diskgroups and
creates
>                        the local configuration.
>            2) The clustering software runs and joins the cluster.
>            3) the EVMS rc script then runs, which invokes the engine.
>            4) the engine runs in a restricted mode(I mean validation
mode),
>                        reads all the sharable disk metadata, verifies if
any
>                        disk is missing, discovers the volumes.  I also
see a
>                        neccessity for the engine to talk to spawned
engines on
>                        other cluster nodes, and agree upon the sharable
>                        diskgroup configuration and update its map file
for any
>                        disks that have not been updated in the map file.
>            5) informs the kernel to discover the configuration residing
on
>                        sharable disks.
>            6) if everything is fine quits.

I don't see a reason why a boot program would have to do validation before
calling the kernel to discover volumes.  Why not just prod the kernel to
discover the volumes on the shared media and then compare the results
against the map file?

I see the steps as:
1) kernel discovers all the local disks, local diskgroups and creates the
local configuration.
2) The clustering software runs and joins the cluster.
3) The EVMS rc script then runs, which invokes the clustering boot program.
4) The clustering boot program issues an ioctl to the kernel to enable
shared disks and do a rediscover taking into account the new disks.
5) The clustering boot program looks at the list of volumes discovered by
the kernel and compares it against the map file to look for missing volumes
and displays warnings if there are.  (The boot program could also do the
comparison on the other nodes in the cluster.)

Question:  Is it important that the clustering software on one node knows
whether another node cannot access shared storage in a cluster?  If the
storage is shared, then all you need are locks to serialize access from the
various nodes.  It doesn't matter if another node can't access the disk.
The only reason I can see for one node to care if another node sees the
shared storage is when an application the shared storage to communicate
with the other node.  In that case wouldn't it be better for the
application to communicate via sockets instead of shared storage?  Or at
least if the application is going to communicate via the shared storage it
would have to coordinate activity with the other node anyway, which is the
applications responsibility, not the responsibility of the cluster
software.  Or am I missing something?

> When a node is up and system administrator wants to add a sharable disk
> to a diskgroup, the engine does the corresponding metadata updates on
> the disks, adds a entry for the disk in the map files on all the nodes
> for this disk.  There are cases where the map file may not be updatable
> on some nodes because they are not active cluster members.  In such
> cases its the responsibility of the engine on that node to consult other
> nodes, and update its map file when it becomes a cluster member.
>
>
>
> There are some problems though:
>
>            1.  if the disks in local diskgroups are missing, engine has
no
>            opportunity to warn, because local diskgroups are imported
>            during boot by the kernel waybefore the engine gets a chance
to
>            run. However the kernel does detect that disks are missing,
and
>            can spew out a syslog message. But won't be able to pinpoint
>            the exact disk.

Again, I would vote that the Engine remain uninvolved during the boot
process.  Most of the code that Ram proposes would be in the kernel as part
of the discovery process, most of it provided by a clustering plug-in.  Any
user space requirements can be put into a new program that is run by the
init scripts.  The boot program may have dependencies on the Engine and may
even use Engine services, so it would be logical to have the code in the
Engine tree, something like the devnode_fixup program is today.  Maybe its
just a matter of semantics.  When Ram says "the engine runs" perhaps he
more literally means "the clustering boot program (which uses the Engine)
runs".

>>> Daniel
> --
> Ram Pai
> lin...@us...
> 503-5783752
> Tieline: 7753752
> EVMS: http://www.sf.net/projects/evms
> ----------------------------------

Steve D.