Re: [Lse-tech] [patch - RFC] simple task notifier support

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Chandra Seetharaman wrote:
> On Fri, 2006-04-28 at 15:29 +0200, Jes Sorensen wrote:
> Do you mean to say that the module will not be allowed to unload _until_
> there is _a single_ task that the module has attached a notifier block
> with ? That doesn't sound appealing
> 
> Consider a filesystem example:
>  - Administrator insmods a filesystem
>  - administrator mounts a filesystem
>  - random user chandra gets to the filsystem
>  - chandra creates a task p1
>  - p1 does something which makes filesystem module interested in p1
>  - filesystem module attaches a notifier block to p1
>  - user chandra moves off the filsystem, but p1 remains alive. 
>  - administrator unmounts the filesystem, succeeds as no user in the fs
>  - administrator tries to rmmod the filesystem, and it fails due to the 
>    above logic. How can the administrator fix the problem ?

Unregistering doesn't fail, it just becomes delayed. How likely is it
that someone will unregister a file system like that while the task that
accessed it is still live? The cost of adding locking is far worse than
the above IMHO fairly artificial case.

> Also, i do not think it is advisable to call unregister from the
> notifier call.

There is nothing wrong with that. You're in the task context at the
time, nothing else will conflict.

> even though the notifier block is per task, notifier_head in the task is
> global, in other words, there can be many users (of the task notify
> mechanism) that can be doing one of the following at the _same time_
> with a single notifier head:
>  - register a new notifier block
>  - unregister a notifier block
>  - walk through the call chain in notifier_call_chain

Nobody will walk the notifier chain outside of task context, anything
doing that is broken. In addition, since the registration is done by
the task itself this isn't a problem and unregistration is done at
exit time.

We could use the rcu approach if it was really deemed necessary to be
able to add tasks from elsewhere.

> Unless we have some kind of protection, how can we guarantee that they
> will not be running over each other, walking off the list, calling
> notifier call of a removed module etc., ?

The walking the list argument doesn't make sense. The only way to do
that is by adding heavy locking which we don't need by applying the
simple constraints.

>> Again why do you need a lock? Registration is inherited so something
>> from userland creates it first and all children inherits it. Look at how
>> this is done with cpuset for instance.
> 
> cpuset is not a valid comparison here. cpuset is the _only_ user of the
> data structures it plays with, and it knows how to protect it. Whereas
> task notify mechanism will be used by many users and hence a common
> protection mechanism is needed.

It's totally valid for the mechanism. If you wish to apply accounting or
resource control to the system, then it's expectable that this is
enabled by the init task and not at random later. It doesn't mean the
module has to be in use at all time, but once you insmod your accounting
then it's on the system. Hence the cpusets comparison makes sense IMHO.

>> Huh? Beware that semundo is special and requires a semaphore to be
>> installed if clone() is called with a specific flag. If the semundo
>> patch doesn't free the notifier block in the exit path then I introduced
>> a bug. It's the responsibility of the TN_EXIT handler to clean up.
> 
> As i pointed above it is not advisable to unregister from the callback,
> and that is the reason i am pointing this as a problem, not specifically
> on how the semundo patch is written.

Please elaborate, you're in task context so you have control of it. What
is wrong calling the unregister function like this?

>>>  6 wastage of memory (notifier head and multiple notifier blocks for
>>>    every task in the system).
>> Sorry this argument is bogus. You're wasting a notifier head, but
> 
> not just notifier head, you need notifier blocks associated with each
> user of the task notifier mechanism.
> 
> It is certainly not bogus, if the user of the task notify mechanism is
> interested in _all_ tasks, then they have to allocate a notifier block
> for _every_ task in the system. Process event connector is an example
> that would be interested in _all_ tasks.

For things that wants to know about every task in the system it doesn't
make sense to register it with a task notifier. It should rather be put
in with a fixed callout in the path or there should be two levels of
notifier chains, global notifiers and task notifiers.

>> compared to what else is in the task struct this is nothing, in fact
>> one could probably eliminate some elements in the task_struct and tag
>> them onto notifier block so they are only allocated when really needed.
>> A notifier block is only allocated if something is using it.
> 
> According to your description above, you expect more people to start
> using task notifier, which in turn mean a notifier block for _each_ task
> for every such usage of task notifier mechanism.

Yes, however thats adding 24 bytes per task on a 64 bit platform or 12
bytes per task on a 32 bit platform. Compared to the overhead of running
many fixed callouts and the number of other data structures in the task
struct which are rarely used, this really isn't worth counting.

>>> I understand there is "one function call overhead" at fork() for you and
>>> the MVFS folks, but having a global notifier block will be a lot
>>> simpler.
>> Having a global notifier block will be a lot costlier because it will
>> require a global lock.
> 
> My point is that we are not getting away from using a global lock in the
> per task notifier_head case also.

I still haven't seen any arguments for why we need it. Sure a module
may not unload if it's still in use, but thats what it is for all
modules anyway.

>>> BTW, there is no locking involved during call_chain of atomic notifiers,
>>> so the overhead is minimal.
>> atomic notifiers make no sense in the fork path. An atomic notifier
>> implies that the service function cannot sleep, however that is not the
>> case during fork/exit etc. Either we go the full way with a global lock
>> or we do it lock free using raw notifiers as I suggest.
> 
> By global lock you mean a semaphore, i presume.

Any lock that would be taken for each task at the global level.

>> this is not a problem. Besides, how many cases do we have where a module
> 
> may not be many, but we don't want to be oopsing when they do, right ?
> As long as the kernel supports unloading a module, we should consider
> that as a valid usage.

It's not unsupported, just with constraints. task notifiers are not
meant to be a kitchen sink, they should be used carefully like many
other apis within the kernel.

>> is unloaded at runtime? A classical case for this is for things like
>> accounting and it's not a module that will be loaded and unloaded at
>> random.
> 
> I will not conclude on that. We have heard in this channel that MVFS
> will use this feature. I am sure that many more will follow.

Pardon my ignorance, but what is MVFS? What does it do and how does it
rely on this feature.

> They may not be doing at random, but they may do it, and IMO, we should
> work correctly in those cases also.

The philosophy behind the task notifiers is that they are to be simple
and efficient. If we start catering for random corner cases in the hot
path it's going to be very expensive. Remember, this is Linux, not AIX
or IRIX :)))

>>> Individual module owners cannot protect the notifier chain when it is
>>> walked through during call_chain.
>> Nobody should need to walk the notifier chain outside the specified
>> context.
> 
> They won't be walking the list, the notifier_call_chain would be
> walking, and it needs protection.

Since the notifier_call_chain will only be called from within task
context then it doesn't need any locking as long as we only allow
walking the chain from within task context.

>>> If any module wants to block, then we may have to provide a blocking
>>> task notify call chains also.
>> The task notification chain is allowed to sleep. We already use
>> GFP_KERNEL all over the place there.
> 
> We can just go with a blocking semaphore.

And add another blocking semaphore to the hot path, don't think thats
going to do anything good for scalability.

Regards,
Jes