Thread: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors (Page 2)

lse-tech

Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

From: Paul E. M. <pa...@us...> - 2006-01-17 17:26:39

On Thu, Jan 12, 2006 at 07:17:41AM -0800, Paul E. McKenney wrote:
> On Thu, Jan 12, 2006 at 06:50:34PM +1100, Keith Owens wrote:
> > "Paul E. McKenney" (on Wed, 11 Jan 2006 22:51:15 -0800) wrote:
> > >On Thu, Jan 12, 2006 at 05:19:01PM +1100, Keith Owens wrote:
> > >> OK, I have thought about it and ...
> > >> 
> > >>   notifier_call_chain_lockfree() must be called with preempt disabled.
> > >> 
> > >Fair enough!  A comment, perhaps?  In a former life I would have also
> > >demanded debug code to verify that preemption/interrupts/whatever were
> > >actually disabled, given the very subtle nature of any resulting bugs...
> > 
> > Comment - OK.  Debug code is not required, the reference to
> > smp_processor_id() already does all the debug checks that
> > notifier_call_chain_lockfree() needs.  CONFIG_PREEMPT_DEBUG is your
> > friend.
> 
> Ah, debug_smp_processor_id().  Very cool!!!

One other thing -- given that you are requiring that the read side
have preemption disabled, another update-side option would be to
use synchronize_sched() to wait for all preemption-disabled code
segments to complete.  This would allow you to remove the per-CPU
reference counters from the read side, but would require that the
update side either (1) be able to block or (2) be able to defer
the cleanup to process context.

Just another option...

						Thanx, Paul

Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

From: Keith O. <ka...@sg...> - 2006-01-17 23:57:52

"Paul E. McKenney" (on Tue, 17 Jan 2006 09:26:17 -0800) wrote:
>On Thu, Jan 12, 2006 at 07:17:41AM -0800, Paul E. McKenney wrote:
>> On Thu, Jan 12, 2006 at 06:50:34PM +1100, Keith Owens wrote:
>> > "Paul E. McKenney" (on Wed, 11 Jan 2006 22:51:15 -0800) wrote:
>> > >On Thu, Jan 12, 2006 at 05:19:01PM +1100, Keith Owens wrote:
>> > >> OK, I have thought about it and ...
>> > >> 
>> > >>   notifier_call_chain_lockfree() must be called with preempt disabled.
>> > >> 
>> > >Fair enough!  A comment, perhaps?  In a former life I would have also
>> > >demanded debug code to verify that preemption/interrupts/whatever were
>> > >actually disabled, given the very subtle nature of any resulting bugs...
>> > 
>> > Comment - OK.  Debug code is not required, the reference to
>> > smp_processor_id() already does all the debug checks that
>> > notifier_call_chain_lockfree() needs.  CONFIG_PREEMPT_DEBUG is your
>> > friend.
>> 
>> Ah, debug_smp_processor_id().  Very cool!!!
>
>One other thing -- given that you are requiring that the read side
>have preemption disabled, another update-side option would be to
>use synchronize_sched() to wait for all preemption-disabled code
>segments to complete.  This would allow you to remove the per-CPU
>reference counters from the read side, but would require that the
>update side either (1) be able to block or (2) be able to defer
>the cleanup to process context.

Originally I looked at that code but the comment scared me off.
synchronize_sched() maps to synchronize_rcu() and the comment says that
this only protects against rcu_read_lock(), not against preempt_disable().

/**
 * synchronize_sched - block until all CPUs have exited any non-preemptive
 * kernel code sequences.
 *
 * This means that all preempt_disable code sequences, including NMI and
 * hardware-interrupt handlers, in progress on entry will have completed
 * before this primitive returns.  However, this does not guarantee that
 * softirq handlers will have completed, since in some kernels
 *
 * This primitive provides the guarantees made by the (deprecated)
 * synchronize_kernel() API.  In contrast, synchronize_rcu() only
 * guarantees that rcu_read_lock() sections will have completed.
 */
#define synchronize_sched() synchronize_rcu()

Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

From: Paul E. M. <pa...@us...> - 2006-01-18 02:50:12

On Wed, Jan 18, 2006 at 10:57:47AM +1100, Keith Owens wrote:
> "Paul E. McKenney" (on Tue, 17 Jan 2006 09:26:17 -0800) wrote:
[ . . .]
> >One other thing -- given that you are requiring that the read side
> >have preemption disabled, another update-side option would be to
> >use synchronize_sched() to wait for all preemption-disabled code
> >segments to complete.  This would allow you to remove the per-CPU
> >reference counters from the read side, but would require that the
> >update side either (1) be able to block or (2) be able to defer
> >the cleanup to process context.
> 
> Originally I looked at that code but the comment scared me off.
> synchronize_sched() maps to synchronize_rcu() and the comment says that
> this only protects against rcu_read_lock(), not against preempt_disable().

Good point -- in mainline kernels (but not in some realtime
variants), the two guarantees happen to be one and the same.
But the comment certainly does not make this clear.

> /**
>  * synchronize_sched - block until all CPUs have exited any non-preemptive
>  * kernel code sequences.
>  *
>  * This means that all preempt_disable code sequences, including NMI and
>  * hardware-interrupt handlers, in progress on entry will have completed
>  * before this primitive returns.  However, this does not guarantee that
>  * softirq handlers will have completed, since in some kernels
>  *
>  * This primitive provides the guarantees made by the (deprecated)
>  * synchronize_kernel() API.  In contrast, synchronize_rcu() only
>  * guarantees that rcu_read_lock() sections will have completed.
>  */
> #define synchronize_sched() synchronize_rcu()

Does the following change help?

							Thanx, Paul

diff -urpNa -X dontdiff linux-2.6.15/include/linux/rcupdate.h linux-2.6.15-RCUcomment/include/linux/rcupdate.h
--- linux-2.6.15/include/linux/rcupdate.h	2006-01-02 19:21:10.000000000 -0800
+++ linux-2.6.15-RCUcomment/include/linux/rcupdate.h	2006-01-17 18:48:33.000000000 -0800
@@ -265,11 +265,14 @@ static inline int rcu_pending(int cpu)
  * This means that all preempt_disable code sequences, including NMI and
  * hardware-interrupt handlers, in progress on entry will have completed
  * before this primitive returns.  However, this does not guarantee that
- * softirq handlers will have completed, since in some kernels
+ * softirq handlers will have completed, since in some kernels, these
+ * handlers can run in process context, and can block.
  *
  * This primitive provides the guarantees made by the (deprecated)
  * synchronize_kernel() API.  In contrast, synchronize_rcu() only
  * guarantees that rcu_read_lock() sections will have completed.
+ * In "classic RCU", these two guarantees happen to be one and
+ * the same, but can differ in realtime RCU implementations.
  */
 #define synchronize_sched() synchronize_rcu()

Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

From: Lee R. <rlr...@jo...> - 2006-01-18 02:55:11

On Tue, 2006-01-17 at 18:49 -0800, Paul E. McKenney wrote:
> - * softirq handlers will have completed, since in some kernels
> + * softirq handlers will have completed, since in some kernels, these
> + * handlers can run in process context, and can block.
>   * 

I was under the impression that softirq handlers can run in process
context in all kernels.  Specifically, in realtime variants softirqs
always run in process context, and in mainline this only happens under
high load.

Lee

Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

From: Paul E. M. <pa...@us...> - 2006-01-18 06:29:52

On Tue, Jan 17, 2006 at 09:55:06PM -0500, Lee Revell wrote:
> On Tue, 2006-01-17 at 18:49 -0800, Paul E. McKenney wrote:
> > - * softirq handlers will have completed, since in some kernels
> > + * softirq handlers will have completed, since in some kernels, these
> > + * handlers can run in process context, and can block.
> >   * 
> 
> I was under the impression that softirq handlers can run in process
> context in all kernels.  Specifically, in realtime variants softirqs
> always run in process context, and in mainline this only happens under
> high load.

We might be talking past each other on this one.  If I am not getting
too confused, it is possible to configure a mainline kernel so that
the load cannot rise high enough to force softirqs into process
context.  Although looking at 2.6.15, it appears that this would
require rebuilding after hand-editing the value of MAX_SOFTIRQ_RESTART,
which some might feel too-brutal of tweaking to be considered mere
"configuration".

In any case, the key point of the comment is that synchronize_sched()
is not guaranteed to wait for all pending softirq handlers to complete.
Does the comment make that sufficiently clear?

						Thanx, Paul

Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

From: Matt H. <mat...@us...> - 2006-01-12 05:36:32

On Thu, 2006-01-12 at 14:29 +1100, Keith Owens wrote:
> John Hesterberg (on Wed, 11 Jan 2006 15:39:10 -0600) wrote:
> >On Wed, Jan 11, 2006 at 01:02:10PM -0800, Matt Helsley wrote:
> >> 	Have you looked at Alan Stern's notifier chain fix patch? Could that be
> >> used in task_notify?
> >
> >I have two concerns about an all-tasks notification interface.
> >First, we want this to scale, so don't want more global locks.
> >One unique part of the task notify is that it doesn't use locks.
> 
> Neither does Alan Stern's atomic notifier chain.  Indeed it cannot use
> locks, because the atomic notifier chains can be called from anywhere,
> including non maskable interrupts.  The downside is that Alan's atomic
> notifier chains require RCU.
> 
> An alternative patch that requires no locks and does not even require
> RCU is in http://marc.theaimsgroup.com/?l=linux-kernel&m=113392370322545&w=2
> 

	Interesting. Might the 'inuse' flags suffer from bouncing due to false
sharing?

Cheers,
	-Matt Helsley

Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

From: Matt H. <mat...@us...> - 2006-01-11 22:48:32

On Wed, 2006-01-11 at 15:39 -0600, John Hesterberg wrote:
> On Wed, Jan 11, 2006 at 01:02:10PM -0800, Matt Helsley wrote:
> > On Wed, 2006-01-11 at 05:36 -0500, Jes Sorensen wrote:
> > > >>>>> "Shailabh" == Shailabh Nagar <na...@wa...> writes:
> > > 
> > > Shailabh> Jes Sorensen wrote:
> > > >> I am quite concerned about that lock your patches put into struct
> > > >> task_struct through struct task_delay_info. Have you done any
> > > >> measurements on how this impacts performance on highly threaded
> > > >> apps on larger system?
> > > 
> > > Shailabh> I don't expect the lock contention to be high. The lock is
> > > Shailabh> held for a very short time (across two
> > > Shailabh> additions/increments). Moreover, it gets contended only when
> > > Shailabh> the stats are being read (either through /proc or
> > > Shailabh> connectors).  Since the reading of stats won't be that
> > > Shailabh> frequent (the utility of these numbers is to influence the
> > > Shailabh> I/O priority/rss limit etc. which won't be done at a very
> > > Shailabh> small granularity anyway), I wouldn't expect a problem.
> > > 
> > > Hi Shailabh,
> > > 
> > > When this is read through connectors, it's initiated by the connectors
> > > code which is called from the task's context hence we don't need
> > > locking for that. It's very similar to the task_notify code I am about
> > > to post and I think the connector code could fit into that
> > > framework. The main issue is /proc, but then one could even have a
> > > mechanism with a hook when the task exits that pushes the data to a
> > > storage point which is lock protected.
> > > 
> > > Even if a lock isn't contended, you are still going to see the cache
> > > lines bounce around due to the writes. It may not show up on a 4-way
> > > box but what happens on a 64-way? We have seen some pretty nasty
> > > effects on the bigger SN2 boxes with locks that were taken far too
> > > frequently, to the point where it would prevent the box from booting
> > > (now I don't expect it to that severe here, but I'd still like to
> > > explore an approach of doing it lock free).
> > > 
> > > Shailabh> But its better to take some measurements anyway. Any
> > > Shailabh> suggestions on a benchmark ?
> > > 
> > > >> IMHO it seems to make more sense to use something like Jack's
> > > >> proposed task_notifier code to lock-less collect the data into task
> > > >> local data structures and then take the data from there and ship
> > > >> off to userland through netlink or similar like you are doing?
> > > >> 
> > > >> I am working on modifying Jack's patch to carry task local data and
> > > >> use it for not just accounting but other areas that need optional
> > > >> callbacks (optional in the sense that it's a feature that can be
> > > >> enabled or disabled). Looking at Shailabh's delayacct_blkio()
> > > >> changes it seems that it would be really easy to fit those into
> > > >> that framework.
> > > >> 
> > > >> Guess I should post some of this code .....
> > > 
> > > Shailabh> Please do. If this accounting can fit into some other
> > > Shailabh> framework, thats fine too.
> > > 
> > > Ok, finally, sorry for the delay. My current code snapshot is
> > > available at http://www.trained-monkey.org/~jes/patches/task_notify/ -
> > > it's a modified version of Jack's task_notify code, and three example
> > > users of it (the SysV IPC semundo semaphore, the key infrastructure
> > > and SGI's JOB module). The patch order should be task_notify.diff,
> > > task-notify-keys.diff, task-notify-semundo.diff, and
> > > task_notify-job.diff last.
> > 
> > 	I can already tell you I don't like the "magic" mechanism to identify
> > notifier blocks. The problem is that it's yet another space of id
> > numbers that we have to manage -- either manually, by having a person
> > hand the numbers out to developers, or automatically using the idr code.
> > You could use the notifier block's address and avoid an additional id
> > space.
> > 
> > 	Also, even if this mechanism goes into task_notify it needs a better
> > name than "magic".
> > 
> > > I think task_notify it should be usable for statistics gathering as
> > > well, the only issue is how to attach it to the processes we wish to
> > > gather accounting for. Personally I am not a big fan of the current
> > > concept where statistics are gathered for all tasks at all time but
> > > just not exported until accounting is enabled.
> > 
> > 	Have you looked at Alan Stern's notifier chain fix patch? Could that be
> > used in task_notify?
> > 
> > 	If not, perhaps the patch use the standard kernel list idioms.
> > 
> > 	Another potential user for the task_notify functionality is the process
> > events connector. The problem is it requires the ability to receive
> > notifications about all processes. Also, there's a chance that future
> > CKRM code could use all-task and per-task notification. Combined with
> > John Hesterberg's feedback I think there is strong justification for an
> > all-tasks notification interface.
> 
> I have two concerns about an all-tasks notification interface.
> First, we want this to scale, so don't want more global locks.
> One unique part of the task notify is that it doesn't use locks.

	Are you against global locks in these paths based solely on principle
or have you measured their impact on scalability in those locations?

	Without measurement there are two conflicting principles here: code
complexity vs. scalability.

	If you've made measurements I'm curious to know what kind of locks were
measured, where they were used, what the load was doing -- as a rough
characterization of the pattern of notifications -- and how much it
impacted scalability. This information might help suggest a better
solution.

> Second, in at least some of the cases we're familiar with,
> even when we might need all-tasks notification we still need task-local
> data.  That's been a problem with some of the global mechanisms I've
> seen discussed.

	The per-task notification can allow you to have task-local data. Would
registration of per-task notification from within an all-task
notification be sufficient?

Cheers,
	-Matt Helsley

Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

From: Jes S. <je...@tr...> - 2006-01-12 10:01:11

>>>>> "Matt" == Matt Helsley <mat...@us...> writes:

Matt> On Wed, 2006-01-11 at 15:39 -0600, John Hesterberg wrote:
>> I have two concerns about an all-tasks notification interface.
>> First, we want this to scale, so don't want more global locks.  One
>> unique part of the task notify is that it doesn't use locks.

Matt> 	Are you against global locks in these paths based solely on
Matt> principle or have you measured their impact on scalability in
Matt> those locations?

Matt,

It all depends on the specific location of the locks and how often
they are taken. As long as something is taken by the same CPU all the
time is not going to be a major issue, but if we end up with anything
resembling a global lock, even if it is only held for a very short
time, it is going to bite badly. On a 4-way you probably won't notice,
but go to a 64-way and it bites, on a 512-way it eats you alive (we
had a problem in the timer code quite a while back that prevented the
machine from booting - it wasn't a lock that was held for a long time,
just the fact that every CPU would take it every HZ was enough).

Matt> 	Without measurement there are two conflicting principles here:
Matt> code complexity vs. scalability.

The two should never be mutually exlusive as long as we are careful.

Matt> 	If you've made measurements I'm curious to know what kind of
Matt> locks were measured, where they were used, what the load was
Matt> doing -- as a rough characterization of the pattern of
Matt> notifications -- and how much it impacted scalability. This
Matt> information might help suggest a better solution.

The one I mentioned above was in the timer interrupt path.

Cheers,
Jes

Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

From: Matt H. <mat...@us...> - 2006-01-12 23:26:45

On Thu, 2006-01-12 at 05:01 -0500, Jes Sorensen wrote:
> >>>>> "Matt" == Matt Helsley <mat...@us...> writes:
> 
> Matt> On Wed, 2006-01-11 at 15:39 -0600, John Hesterberg wrote:
> >> I have two concerns about an all-tasks notification interface.
> >> First, we want this to scale, so don't want more global locks.  One
> >> unique part of the task notify is that it doesn't use locks.
> 
> Matt> 	Are you against global locks in these paths based solely on
> Matt> principle or have you measured their impact on scalability in
> Matt> those locations?
> 
> Matt,
> 
> It all depends on the specific location of the locks and how often
> they are taken. As long as something is taken by the same CPU all the
> time is not going to be a major issue, but if we end up with anything
> resembling a global lock, even if it is only held for a very short
> time, it is going to bite badly. On a 4-way you probably won't notice,
> but go to a 64-way and it bites, on a 512-way it eats you alive (we
> had a problem in the timer code quite a while back that prevented the
> machine from booting - it wasn't a lock that was held for a long time,
> just the fact that every CPU would take it every HZ was enough).

	OK, so you've established that global locks in timer paths are Bad.
However you haven't established that timer paths are almost the same as
the task paths we're talking about. I suspect timer paths are used much
more frequently than fork, exec, or exit.

	I've appended a small patch that adds a global lock to the task_notify
paths for testing purposes. I'm curious to know what kind of a
performance difference you would see on your 64 or 512-way if you were
to run with it.

	Looking at the ideas for lockless implementations I'm curious how well
Keith's notifier chains patch would work in this case. It does not
acquire a global lock in the "call" path and, as Keith suggested,
probably can be modified to avoid cache bouncing.

Cheers,
	-Matt Helsley

> Matt> 	Without measurement there are two conflicting principles here:
> Matt> code complexity vs. scalability.
> 
> The two should never be mutually exlusive as long as we are careful.
> 
> Matt> 	If you've made measurements I'm curious to know what kind of
> Matt> locks were measured, where they were used, what the load was
> Matt> doing -- as a rough characterization of the pattern of
> Matt> notifications -- and how much it impacted scalability. This
> Matt> information might help suggest a better solution.
> 
> The one I mentioned above was in the timer interrupt path.
> 
> Cheers,
> Jes


Index: linux-2.6.15/kernel/fork.c
===================================================================
--- linux-2.6.15.orig/kernel/fork.c
+++ linux-2.6.15/kernel/fork.c
@@ -849,10 +849,12 @@ asmlinkage long sys_set_tid_address(int 
 	current->clear_child_tid = tidptr;
 
 	return current->pid;
 }
 
+extern spinlock_t proposed_global_lock;
+
 /*
  * This creates a new process as a copy of the old one,
  * but does not actually start it yet.
  *
  * It copies the registers, and all the appropriate
@@ -1137,12 +1139,14 @@ static task_t *copy_process(unsigned lon
 		p->signal->tty = NULL;
 
 	nr_threads++;
 	total_forks++;
 	write_unlock_irq(&tasklist_lock);
+	spin_lock(&proposed_global_lock);
 	proc_fork_connector(p);
 	cpuset_fork(p);
+	spin_unlock(&proposed_global_lock);
 	retval = 0;
 
 fork_out:
 	if (retval)
 		return ERR_PTR(retval);
Index: linux-2.6.15/kernel/exit.c
===================================================================
--- linux-2.6.15.orig/kernel/exit.c
+++ linux-2.6.15/kernel/exit.c
@@ -784,10 +784,11 @@ static void exit_notify(struct task_stru
 	/* If the process is dead, release it - nobody will wait for it */
 	if (state == EXIT_DEAD)
 		release_task(tsk);
 }
 
+extern spinlock_t proposed_global_lock;
 fastcall NORET_TYPE void do_exit(long code)
 {
 	struct task_struct *tsk = current;
 	int group_dead;
 
@@ -844,10 +845,12 @@ fastcall NORET_TYPE void do_exit(long co
 	if (group_dead) {
  		del_timer_sync(&tsk->signal->real_timer);
 		exit_itimers(tsk->signal);
 		acct_process(code);
 	}
+	spin_lock(&proposed_global_lock);
+	spin_unlock(&proposed_global_lock);
 	exit_mm(tsk);
 
 	exit_sem(tsk);
 	__exit_files(tsk);
 	__exit_fs(tsk);
Index: linux-2.6.15/fs/exec.c
===================================================================
--- linux-2.6.15.orig/fs/exec.c
+++ linux-2.6.15/fs/exec.c
@@ -1121,10 +1121,12 @@ int search_binary_handler(struct linux_b
 	return retval;
 }
 
 EXPORT_SYMBOL(search_binary_handler);
 
+extern spinlock_t proposed_global_lock;
+
 /*
  * sys_execve() executes a new program.
  */
 int do_execve(char * filename,
 	char __user *__user *argv,
@@ -1190,10 +1192,12 @@ int do_execve(char * filename,
 
 	retval = copy_strings(bprm->argc, argv, bprm);
 	if (retval < 0)
 		goto out;
 
+	spin_lock(&proposed_global_lock);
+	spin_unlock(&proposed_global_lock);
 	retval = search_binary_handler(bprm,regs);
 	if (retval >= 0) {
 		free_arg_pages(bprm);
 
 		/* execve success */
Index: linux-2.6.15/init/main.c
===================================================================
--- linux-2.6.15.orig/init/main.c
+++ linux-2.6.15/init/main.c
@@ -440,10 +440,11 @@ void __init parse_early_param(void)
 
 /*
  *	Activate the first processor.
  */
 
+spinlock_t proposed_global_lock = SPIN_LOCK_UNLOCKED;
 asmlinkage void __init start_kernel(void)
 {
 	char * command_line;
 	extern struct kernel_param __start___param[], __stop___param[];
 /*

Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

From: Jes S. <je...@tr...> - 2006-01-13 09:35:16

>>>>> "Matt" == Matt Helsley <mat...@us...> writes:

Matt> On Thu, 2006-01-12 at 05:01 -0500, Jes Sorensen wrote:
>> It all depends on the specific location of the locks and how often
>> they are taken. As long as something is taken by the same CPU all
>> the time is not going to be a major issue, but if we end up with
>> anything resembling a global lock, even if it is only held for a
>> very short time, it is going to bite badly. On a 4-way you probably
>> won't notice, but go to a 64-way and it bites, on a 512-way it eats
>> you alive (we had a problem in the timer code quite a while back
>> that prevented the machine from booting - it wasn't a lock that was
>> held for a long time, just the fact that every CPU would take it
>> every HZ was enough).

Matt> 	OK, so you've established that global locks in timer paths are
Matt> Bad.  However you haven't established that timer paths are
Matt> almost the same as the task paths we're talking about. I suspect
Matt> timer paths are used much more frequently than fork, exec, or
Matt> exit.

Hi Matt,

I wasn't trying to make it sound like this was an apples to apples
comparison, what I am saying is simply that locks aren't free.

You are totally right that fork/exec should be called a lot less
frequently, but the delay account data collection points will be in far
more places than that and they will all go for the lock.

Matt> 	I've appended a small patch that adds a global lock to the
Matt> task_notify paths for testing purposes. I'm curious to know what
Matt> kind of a performance difference you would see on your 64 or
Matt> 512-way if you were to run with it.

I don't have a 512-way to play with at the moment, but again I don't
think it makes sense to benchmark things which aren't matching what we
are looking at. If we can avoid the locks in the first place then
there's really no reason for not doing that.

Matt> 	Looking at the ideas for lockless implementations I'm curious
Matt> how well Keith's notifier chains patch would work in this
Matt> case. It does not acquire a global lock in the "call" path and,
Matt> as Keith suggested, probably can be modified to avoid cache
Matt> bouncing.

Yup, I was curious about that. I haven't had a chance to look at it
carefully yet.

Cheers,
Jes

Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

From: Matt H. <mat...@us...> - 2006-01-14 07:34:53

On Fri, 2006-01-13 at 04:35 -0500, Jes Sorensen wrote:
> >>>>> "Matt" == Matt Helsley <mat...@us...> writes:
> 
> Matt> On Thu, 2006-01-12 at 05:01 -0500, Jes Sorensen wrote:
> >> It all depends on the specific location of the locks and how often
> >> they are taken. As long as something is taken by the same CPU all
> >> the time is not going to be a major issue, but if we end up with
> >> anything resembling a global lock, even if it is only held for a
> >> very short time, it is going to bite badly. On a 4-way you probably
> >> won't notice, but go to a 64-way and it bites, on a 512-way it eats
> >> you alive (we had a problem in the timer code quite a while back
> >> that prevented the machine from booting - it wasn't a lock that was
> >> held for a long time, just the fact that every CPU would take it
> >> every HZ was enough).
> 
> Matt> 	OK, so you've established that global locks in timer paths are
> Matt> Bad.  However you haven't established that timer paths are
> Matt> almost the same as the task paths we're talking about. I suspect
> Matt> timer paths are used much more frequently than fork, exec, or
> Matt> exit.
> 
> Hi Matt,
> 
> I wasn't trying to make it sound like this was an apples to apples
> comparison, what I am saying is simply that locks aren't free.

OK. I'm not sure what gave you the impression I thought they were.

> You are totally right that fork/exec should be called a lot less
> frequently, but the delay account data collection points will be in far
> more places than that and they will all go for the lock.

	All of these places are highly likely to also be in the context of the
task that has the lock in it's task->delays struct. This could be a
strongly recommended practice for taking the lock -- we can add a
comment next to the lock suggesting as much.

> Matt> 	I've appended a small patch that adds a global lock to the
> Matt> task_notify paths for testing purposes. I'm curious to know what
> Matt> kind of a performance difference you would see on your 64 or
> Matt> 512-way if you were to run with it.
> 
> I don't have a 512-way to play with at the moment, but again I don't
> think it makes sense to benchmark things which aren't matching what we
> are looking at. If we can avoid the locks in the first place then
> there's really no reason for not doing that.

	It could be good for establishing a lower bound on contention resulting
from locking implementations of task_notify. I think in that respect
it's comparing two species of apples.

	We need to clarify what locking we're talking about -- locking in
task_notify to protect the notification list vs. locking in delayacct to
protect the accounting data consistency. I think that a simple lockless
approach may only be possible for task_notify.

Cheers,
	-Matt Helsley

Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

From: Jes S. <je...@tr...> - 2006-01-11 10:36:42

>>>>> "Shailabh" == Shailabh Nagar <na...@wa...> writes:

Shailabh> Jes Sorensen wrote:
>> I am quite concerned about that lock your patches put into struct
>> task_struct through struct task_delay_info. Have you done any
>> measurements on how this impacts performance on highly threaded
>> apps on larger system?

Shailabh> I don't expect the lock contention to be high. The lock is
Shailabh> held for a very short time (across two
Shailabh> additions/increments). Moreover, it gets contended only when
Shailabh> the stats are being read (either through /proc or
Shailabh> connectors).  Since the reading of stats won't be that
Shailabh> frequent (the utility of these numbers is to influence the
Shailabh> I/O priority/rss limit etc. which won't be done at a very
Shailabh> small granularity anyway), I wouldn't expect a problem.

Hi Shailabh,

When this is read through connectors, it's initiated by the connectors
code which is called from the task's context hence we don't need
locking for that. It's very similar to the task_notify code I am about
to post and I think the connector code could fit into that
framework. The main issue is /proc, but then one could even have a
mechanism with a hook when the task exits that pushes the data to a
storage point which is lock protected.

Even if a lock isn't contended, you are still going to see the cache
lines bounce around due to the writes. It may not show up on a 4-way
box but what happens on a 64-way? We have seen some pretty nasty
effects on the bigger SN2 boxes with locks that were taken far too
frequently, to the point where it would prevent the box from booting
(now I don't expect it to that severe here, but I'd still like to
explore an approach of doing it lock free).

Shailabh> But its better to take some measurements anyway. Any
Shailabh> suggestions on a benchmark ?

>> IMHO it seems to make more sense to use something like Jack's
>> proposed task_notifier code to lock-less collect the data into task
>> local data structures and then take the data from there and ship
>> off to userland through netlink or similar like you are doing?
>> 
>> I am working on modifying Jack's patch to carry task local data and
>> use it for not just accounting but other areas that need optional
>> callbacks (optional in the sense that it's a feature that can be
>> enabled or disabled). Looking at Shailabh's delayacct_blkio()
>> changes it seems that it would be really easy to fit those into
>> that framework.
>> 
>> Guess I should post some of this code .....

Shailabh> Please do. If this accounting can fit into some other
Shailabh> framework, thats fine too.

Ok, finally, sorry for the delay. My current code snapshot is
available at http://www.trained-monkey.org/~jes/patches/task_notify/ -
it's a modified version of Jack's task_notify code, and three example
users of it (the SysV IPC semundo semaphore, the key infrastructure
and SGI's JOB module). The patch order should be task_notify.diff,
task-notify-keys.diff, task-notify-semundo.diff, and
task_notify-job.diff last.

I think task_notify it should be usable for statistics gathering as
well, the only issue is how to attach it to the processes we wish to
gather accounting for. Personally I am not a big fan of the current
concept where statistics are gathered for all tasks at all time but
just not exported until accounting is enabled.

I just had a quick look at the connector code and I think it could
possibly be hooked into the task_notify code as well as connectors
seem to be another optional feature.

The API for the task_notify code is not set in stone and we can add
more notifier hooks where needed. If someone has strong reasoning for
making changes to the API, then I'll be very open to that.

Anyway, let me know what you think?

Cheers,
Jes

Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

From: Matt H. <mat...@us...> - 2006-01-12 23:55:43

On Wed, 2006-01-11 at 05:36 -0500, Jes Sorensen wrote:
> >>>>> "Shailabh" == Shailabh Nagar <na...@wa...> writes:
> 
> Shailabh> Jes Sorensen wrote:
> >> I am quite concerned about that lock your patches put into struct
> >> task_struct through struct task_delay_info. Have you done any
> >> measurements on how this impacts performance on highly threaded
> >> apps on larger system?
> 
> Shailabh> I don't expect the lock contention to be high. The lock is
> Shailabh> held for a very short time (across two
> Shailabh> additions/increments). Moreover, it gets contended only when
> Shailabh> the stats are being read (either through /proc or
> Shailabh> connectors).  Since the reading of stats won't be that
> Shailabh> frequent (the utility of these numbers is to influence the
> Shailabh> I/O priority/rss limit etc. which won't be done at a very
> Shailabh> small granularity anyway), I wouldn't expect a problem.

Not just when the stats are being read, but when the stats of that
particular task (A) are being read by one task (B) AND updated by the
task itself (A).

> Hi Shailabh,
> 
> When this is read through connectors, it's initiated by the connectors
> code which is called from the task's context hence we don't need
> locking for that. It's very similar to the task_notify code I am about
> to post and I think the connector code could fit into that
> framework. The main issue is /proc, but then one could even have a
> mechanism with a hook when the task exits that pushes the data to a
> storage point which is lock protected.

Hmm, which task? The request for stats does not necessarily/usualy
originate from the task we desire stats on. Hence the synchronization. 

In this way it's significantly different from lockless task_notify.

> Even if a lock isn't contended, you are still going to see the cache
> lines bounce around due to the writes. It may not show up on a 4-way
> box but what happens on a 64-way? We have seen some pretty nasty
> effects on the bigger SN2 boxes with locks that were taken far too
> frequently, to the point where it would prevent the box from booting
> (now I don't expect it to that severe here, but I'd still like to
> explore an approach of doing it lock free).

	You're referring to the performance impact of a global lock on a large
system. The lock Shailabh introduced is per-task. Those won't bounce
around nearly as much -- I think they bounce under the following
conditions:

- The task in which the lock is embedded is in the cache of processor A
and the task reading the stats of that task is on processor B.

- The scheduler bounces a task between processor A and processor B.

Am I missing any other circumstances under which it could bounce?

> Shailabh> But its better to take some measurements anyway. Any
> Shailabh> suggestions on a benchmark ?
> 
> >> IMHO it seems to make more sense to use something like Jack's
> >> proposed task_notifier code to lock-less collect the data into task
> >> local data structures and then take the data from there and ship
> >> off to userland through netlink or similar like you are doing?
> >> 
> >> I am working on modifying Jack's patch to carry task local data and
> >> use it for not just accounting but other areas that need optional
> >> callbacks (optional in the sense that it's a feature that can be
> >> enabled or disabled). Looking at Shailabh's delayacct_blkio()
> >> changes it seems that it would be really easy to fit those into
> >> that framework.
> >> 
> >> Guess I should post some of this code .....
> 
> Shailabh> Please do. If this accounting can fit into some other
> Shailabh> framework, thats fine too.
> 
> Ok, finally, sorry for the delay. My current code snapshot is
> available at http://www.trained-monkey.org/~jes/patches/task_notify/ -
> it's a modified version of Jack's task_notify code, and three example
> users of it (the SysV IPC semundo semaphore, the key infrastructure
> and SGI's JOB module). The patch order should be task_notify.diff,
> task-notify-keys.diff, task-notify-semundo.diff, and
> task_notify-job.diff last.
> 
> I think task_notify it should be usable for statistics gathering as
> well, the only issue is how to attach it to the processes we wish to
> gather accounting for. Personally I am not a big fan of the current
> concept where statistics are gathered for all tasks at all time but
> just not exported until accounting is enabled.

	The only potential problem I can see with using task_notify for
gathering statistics is each section of code that increments stats would
have to look for its notify block in the task's list.

Cheers,
	-Matt Helsley

Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

From: Matt H. <mat...@us...> - 2006-01-11 21:06:55

On Wed, 2006-01-11 at 05:36 -0500, Jes Sorensen wrote:
> >>>>> "Shailabh" == Shailabh Nagar <na...@wa...> writes:
> 
> Shailabh> Jes Sorensen wrote:
> >> I am quite concerned about that lock your patches put into struct
> >> task_struct through struct task_delay_info. Have you done any
> >> measurements on how this impacts performance on highly threaded
> >> apps on larger system?
> 
> Shailabh> I don't expect the lock contention to be high. The lock is
> Shailabh> held for a very short time (across two
> Shailabh> additions/increments). Moreover, it gets contended only when
> Shailabh> the stats are being read (either through /proc or
> Shailabh> connectors).  Since the reading of stats won't be that
> Shailabh> frequent (the utility of these numbers is to influence the
> Shailabh> I/O priority/rss limit etc. which won't be done at a very
> Shailabh> small granularity anyway), I wouldn't expect a problem.
> 
> Hi Shailabh,
> 
> When this is read through connectors, it's initiated by the connectors
> code which is called from the task's context hence we don't need
> locking for that. It's very similar to the task_notify code I am about
> to post and I think the connector code could fit into that
> framework. The main issue is /proc, but then one could even have a
> mechanism with a hook when the task exits that pushes the data to a
> storage point which is lock protected.
> 
> Even if a lock isn't contended, you are still going to see the cache
> lines bounce around due to the writes. It may not show up on a 4-way
> box but what happens on a 64-way? We have seen some pretty nasty
> effects on the bigger SN2 boxes with locks that were taken far too
> frequently, to the point where it would prevent the box from booting
> (now I don't expect it to that severe here, but I'd still like to
> explore an approach of doing it lock free).
> 
> Shailabh> But its better to take some measurements anyway. Any
> Shailabh> suggestions on a benchmark ?
> 
> >> IMHO it seems to make more sense to use something like Jack's
> >> proposed task_notifier code to lock-less collect the data into task
> >> local data structures and then take the data from there and ship
> >> off to userland through netlink or similar like you are doing?
> >> 
> >> I am working on modifying Jack's patch to carry task local data and
> >> use it for not just accounting but other areas that need optional
> >> callbacks (optional in the sense that it's a feature that can be
> >> enabled or disabled). Looking at Shailabh's delayacct_blkio()
> >> changes it seems that it would be really easy to fit those into
> >> that framework.
> >> 
> >> Guess I should post some of this code .....
> 
> Shailabh> Please do. If this accounting can fit into some other
> Shailabh> framework, thats fine too.
> 
> Ok, finally, sorry for the delay. My current code snapshot is
> available at http://www.trained-monkey.org/~jes/patches/task_notify/ -
> it's a modified version of Jack's task_notify code, and three example
> users of it (the SysV IPC semundo semaphore, the key infrastructure
> and SGI's JOB module). The patch order should be task_notify.diff,
> task-notify-keys.diff, task-notify-semundo.diff, and
> task_notify-job.diff last.

	I can already tell you I don't like the "magic" mechanism to identify
notifier blocks. The problem is that it's yet another space of id
numbers that we have to manage -- either manually, by having a person
hand the numbers out to developers, or automatically using the idr code.
You could use the notifier block's address and avoid an additional id
space.

	Also, even if this mechanism goes into task_notify it needs a better
name than "magic".

> I think task_notify it should be usable for statistics gathering as
> well, the only issue is how to attach it to the processes we wish to
> gather accounting for. Personally I am not a big fan of the current
> concept where statistics are gathered for all tasks at all time but
> just not exported until accounting is enabled.

	Have you looked at Alan Stern's notifier chain fix patch? Could that be
used in task_notify?

	If not, perhaps the patch use the standard kernel list idioms.

	Another potential user for the task_notify functionality is the process
events connector. The problem is it requires the ability to receive
notifications about all processes. Also, there's a chance that future
CKRM code could use all-task and per-task notification. Combined with
John Hesterberg's feedback I think there is strong justification for an
all-tasks notification interface.

	If there was some way to quickly check for registration on an all-tasks
list you could do both all-task and per-task notification in the same
code. I took a shot at this a while back but the locking was incomplete.
Perhaps a per-cpu all-task notification list would satisfy your
performance-impact concerns.

<snip>

> The API for the task_notify code is not set in stone and we can add
> more notifier hooks where needed. If someone has strong reasoning for
> making changes to the API, then I'll be very open to that.

	I'd like to see the all-task notification I mentioned above.
Notifications where uid/gids change could be useful for auditing and
process events connector.

<snip>

Cheers,
	-Matt Helsley

Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

From: Jes S. <je...@tr...> - 2006-01-12 09:51:36

>>>>> "Matt" == Matt Helsley <mat...@us...> writes:

Matt> On Wed, 2006-01-11 at 05:36 -0500, Jes Sorensen wrote:

Matt> 	I can already tell you I don't like the "magic" mechanism to
Matt> identify notifier blocks. The problem is that it's yet another
Matt> space of id numbers that we have to manage -- either manually,
Matt> by having a person hand the numbers out to developers, or
Matt> automatically using the idr code.  You could use the notifier
Matt> block's address and avoid an additional id space.

Hi Matt,

I did debate with myself whether or not to take the 'magic' approach.
I don't think the allocation is a big issue, it's all compile time
determined and as long as people stick it in task_notifier.h it
doesn't matter if their magic number is changed when it goes into the
official tree.

Using the notifier block's address won't work, they are often
dynamically allocated so a client searching for it will rarely know
the address.

The alternative is to search for a function pointer of one of the call
out functions, howevere it requires all users all provide a specific
function pointer or we need to add some flags for the search function
(eg. one of the TN_ ones could be passed in), but this would
complicate the search function and make it slower. This is the
approach Jack took in his original approach, however Jack's patch had
just a single notifier function and it was the user's respoinsibility
to write the code for the demultiplexing of the callouts. I do not
like this approach as it will be a lot more error prone with everyone
doing their own version.

Matt> 	Also, even if this mechanism goes into task_notify it needs a
Matt> better name than "magic".

Magic is fairly standard for this kind of feature in the kernel,
grep MAGIC include/linux/*.h ;-)

>> I think task_notify it should be usable for statistics gathering as
>> well, the only issue is how to attach it to the processes we wish
>> to gather accounting for. Personally I am not a big fan of the
>> current concept where statistics are gathered for all tasks at all
>> time but just not exported until accounting is enabled.

Matt> 	Have you looked at Alan Stern's notifier chain fix patch?
Matt> Could that be used in task_notify?

No sorry, do you have a pointer?

Matt> 	If not, perhaps the patch use the standard kernel list idioms.

Matt> 	Another potential user for the task_notify functionality is
Matt> the process events connector. The problem is it requires the
Matt> ability to receive notifications about all processes. Also,
Matt> there's a chance that future CKRM code could use all-task and
Matt> per-task notification. Combined with John Hesterberg's feedback
Matt> I think there is strong justification for an all-tasks
Matt> notification interface.

I am wary of this. I don't see how we effectively will be able to do
an all task notification without having to traverse the full task list
and take a ton of global locks. If anybody have any idea to get around
this problem I'd be very interested in hearing their suggestions.

Matt> 	If there was some way to quickly check for registration on an
Matt> all-tasks list you could do both all-task and per-task
Matt> notification in the same code. I took a shot at this a while
Matt> back but the locking was incomplete.  Perhaps a per-cpu all-task
Matt> notification list would satisfy your performance-impact
Matt> concerns.

That could be the approach to take to get around it, but I must admit
I don't know the schedule enough to say whether we will be able to
assign all tasks to one single CPU at any given time and then there's
the issue of locking when they migrate. But again, the latter may
already be handled by the scheduler?

>> The API for the task_notify code is not set in stone and we can add
>> more notifier hooks where needed. If someone has strong reasoning
>> for making changes to the API, then I'll be very open to that.

Matt> 	I'd like to see the all-task notification I mentioned above.
Matt> Notifications where uid/gids change could be useful for auditing
Matt> and process events connector.

If anybody is willing to step up and propose a patch for this I'd be
interested in taking a look. I don't off hand see a simple solution
for it.

Regards,
Jes

Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

From: Matt H. <mat...@us...> - 2006-01-12 23:07:08

On Thu, 2006-01-12 at 04:51 -0500, Jes Sorensen wrote:
> >>>>> "Matt" == Matt Helsley <mat...@us...> writes:
> 
> Matt> On Wed, 2006-01-11 at 05:36 -0500, Jes Sorensen wrote:
> 
> Matt> 	I can already tell you I don't like the "magic" mechanism to
> Matt> identify notifier blocks. The problem is that it's yet another
> Matt> space of id numbers that we have to manage -- either manually,
> Matt> by having a person hand the numbers out to developers, or
> Matt> automatically using the idr code.  You could use the notifier
> Matt> block's address and avoid an additional id space.
> 
> Hi Matt,
> 
> I did debate with myself whether or not to take the 'magic' approach.
> I don't think the allocation is a big issue, it's all compile time
> determined and as long as people stick it in task_notifier.h it
> doesn't matter if their magic number is changed when it goes into the
> official tree.
> 
> Using the notifier block's address won't work, they are often
> dynamically allocated so a client searching for it will rarely know
> the address.

	If the task_notify user is responsible for allocating the notifier
block then it could easily keep a copy of the pointer handy for the
corresponding deregistration.

> The alternative is to search for a function pointer of one of the call
> out functions, howevere it requires all users all provide a specific
> function pointer or we need to add some flags for the search function
> (eg. one of the TN_ ones could be passed in), but this would
> complicate the search function and make it slower. This is the
> approach Jack took in his original approach, however Jack's patch had
> just a single notifier function and it was the user's respoinsibility
> to write the code for the demultiplexing of the callouts. I do not
> like this approach as it will be a lot more error prone with everyone
> doing their own version.

	I don't see how it will be error prone. Jack's interface was simple.
And if we're really worred about users messing it up we could generate a
set of macros that users must use to demultiplex the call.

	One function pointer per event does bloat the notifier block structure.
Given that this notifier block may need to be replicated per-task this
is a significant amount of space. Jack's demultiplexing approach
requires that the space be proportional to the number of event functions
instead of proportional to the number of notifier blocks.

	Furthermore, if new task events are added this structure would need to
be expanded. In contrast, Jack's approach only required the addition of
a new event value.

	For these reasons I think using a single function pointer will be much
more effective.

> Matt> 	Also, even if this mechanism goes into task_notify it needs a
> Matt> better name than "magic".
> 
> Magic is fairly standard for this kind of feature in the kernel,
> grep MAGIC include/linux/*.h ;-)

Using ids and pointers in the kernel are pretty standard to. I invite
you to grep for those and count them too ;)

Seriously, it's a nit but I think the name could better reflect the
purpose of the field.

> >> I think task_notify it should be usable for statistics gathering as
> >> well, the only issue is how to attach it to the processes we wish
> >> to gather accounting for. Personally I am not a big fan of the
> >> current concept where statistics are gathered for all tasks at all
> >> time but just not exported until accounting is enabled.
> 
> Matt> 	Have you looked at Alan Stern's notifier chain fix patch?
> Matt> Could that be used in task_notify?
> 
> No sorry, do you have a pointer?

No problem. Here it is:
http://marc.theaimsgroup.com/?l=linux-kernel&m=113407207418475&w=2

I think it would be ideal if task_notify could simply be a notifier
chain for notifying users of task events/changes.

> Matt> 	If not, perhaps the patch use the standard kernel list idioms.
> Matt> 	Another potential user for the task_notify functionality is
> Matt> the process events connector. The problem is it requires the
> Matt> ability to receive notifications about all processes. Also,
> Matt> there's a chance that future CKRM code could use all-task and
> Matt> per-task notification. Combined with John Hesterberg's feedback
> Matt> I think there is strong justification for an all-tasks
> Matt> notification interface.
> 
> I am wary of this. I don't see how we effectively will be able to do
> an all task notification without having to traverse the full task list
> and take a ton of global locks. If anybody have any idea to get around
> this problem I'd be very interested in hearing their suggestions.

	Why would we have to traverse the full task list?! Just add one
notifier block to a single, global list of notifiers. Keith's patch
demonstrates one potential method of avoiding a lock in the common path:
traversing the notification chain and making the calls.

> Matt> 	If there was some way to quickly check for registration on an
> Matt> all-tasks list you could do both all-task and per-task
> Matt> notification in the same code. I took a shot at this a while
> Matt> back but the locking was incomplete.  Perhaps a per-cpu all-task
> Matt> notification list would satisfy your performance-impact
> Matt> concerns.
> 
> That could be the approach to take to get around it, but I must admit
> I don't know the schedule enough to say whether we will be able to
> assign all tasks to one single CPU at any given time and then there's
> the issue of locking when they migrate. But again, the latter may
> already be handled by the scheduler?

	"all-task" means the notification block calls its function when any
task triggers a traversal of the notification lists. This does not imply
that registration/call/deregister of an all-task notification must
traverse all tasks.

> >> The API for the task_notify code is not set in stone and we can add
> >> more notifier hooks where needed. If someone has strong reasoning
> >> for making changes to the API, then I'll be very open to that.
> 
> Matt> 	I'd like to see the all-task notification I mentioned above.
> Matt> Notifications where uid/gids change could be useful for auditing
> Matt> and process events connector.
> 
> If anybody is willing to step up and propose a patch for this I'd be
> interested in taking a look. I don't off hand see a simple solution
> for it.

	Keith posted an interesting patch for notification chains that I
believe could address your concerns -- though from what I've read it
needs to disable preemption. As far as adding notifications for uid/gid
change, that's relatively trivial. You might look at some revisions of
task_notify that Chandra Seetharaman and I have posted.

Cheers,
	-Matt

Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

From: Jes S. <je...@tr...> - 2006-01-13 09:59:05

>>>>> "Matt" == Matt Helsley <mat...@us...> writes:

Matt> On Thu, 2006-01-12 at 04:51 -0500, Jes Sorensen wrote:
>> Using the notifier block's address won't work, they are often
>> dynamically allocated so a client searching for it will rarely know
>> the address.

Matt> 	If the task_notify user is responsible for allocating the
Matt> notifier block then it could easily keep a copy of the pointer
Matt> handy for the corresponding deregistration.

And when they are dynamically allocated on a per-object level? Then
you'd have to keep linked lists around to keep track of them. Sorry,
doesn't work.

>> The alternative is to search for a function pointer of one of the
>> call out functions, howevere it requires all users all provide a
>> specific function pointer or we need to add some flags for the
>> search function (eg. one of the TN_ ones could be passed in), but
>> this would complicate the search function and make it slower. This
>> is the approach Jack took in his original approach, however Jack's
>> patch had just a single notifier function and it was the user's
>> respoinsibility to write the code for the demultiplexing of the
>> callouts. I do not like this approach as it will be a lot more
>> error prone with everyone doing their own version.

Matt> 	I don't see how it will be error prone. Jack's interface was
Matt> simple.  And if we're really worred about users messing it up we
Matt> could generate a set of macros that users must use to
Matt> demultiplex the call.

Because experience shows that the more some people copy the same code
someone will get it wrong.

Matt> 	One function pointer per event does bloat the notifier block
Matt> structure.  Given that this notifier block may need to be
Matt> replicated per-task this is a significant amount of
Matt> space. Jack's demultiplexing approach requires that the space be
Matt> proportional to the number of event functions instead of
Matt> proportional to the number of notifier blocks.

Matt> 	Furthermore, if new task events are added this structure would
Matt> need to be expanded. In contrast, Jack's approach only required
Matt> the addition of a new event value.

Matt> 	For these reasons I think using a single function pointer will
Matt> be much more effective.

So you add a few pointers per task compared to the code that does the
demultiplexing. We're talkin 3-4 extra pointers in comparison to the
demultiplexing code *and* the extra function call.

My approach is consitent with the rest of the kernel does for most
structures, struct file operations etc.

But if thats what makes the difference as to whether we go for this
type of API, then lets change it back. I am not married to the API,
but rather the functionality.

Matt> No problem. Here it is:
Matt> http://marc.theaimsgroup.com/?l=linux-kernel&m=113407207418475&w=2

Matt> I think it would be ideal if task_notify could simply be a
Matt> notifier chain for notifying users of task events/changes.

Thats really what it was intended for, but of course only in the
task's own context.

>>  That could be the approach to take to get around it, but I must
>> admit I don't know the schedule enough to say whether we will be
>> able to assign all tasks to one single CPU at any given time and
>> then there's the issue of locking when they migrate. But again, the
>> latter may already be handled by the scheduler?

Matt> 	"all-task" means the notification block calls its function
Matt> when any task triggers a traversal of the notification
Matt> lists. This does not imply that registration/call/deregister of
Matt> an all-task notification must traverse all tasks.

Well then the name is somewhat misleading ;-) Sounds almost like
something that could fit into Erik's Job stuff, but I guess it's a
question of what criteria those tasks are put onto that list with.

Being a little short on the history, are there any pointers to
examples or descriptions of what this would be used for? Curious to
understand the usage pattern.

Matt> I'd like to see the all-task notification I mentioned above.
Matt> Notifications where uid/gids change could be useful for auditing
Matt> and process events connector.
>>  If anybody is willing to step up and propose a patch for this I'd
>> be interested in taking a look. I don't off hand see a simple
>> solution for it.

Matt> 	Keith posted an interesting patch for notification chains that
Matt> I believe could address your concerns -- though from what I've
Matt> read it needs to disable preemption. As far as adding
Matt> notifications for uid/gid change, that's relatively trivial. You
Matt> might look at some revisions of task_notify that Chandra
Matt> Seetharaman and I have posted.

I did go through the archives a while back and I did notice that both
Jack and Erik pointed out a number of issues with those approaches. If
we are going to do the task-group-notify thing, then going to Keith's
approach is probably the best.

Cheers,
Jes

Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

From: Jes S. <je...@tr...> - 2006-01-13 10:39:17

>>>>> "Matt" == Matt Helsley <mat...@us...> writes:

Matt> On Thu, 2006-01-12 at 04:51 -0500, Jes Sorensen wrote:
Matt> Have you looked at Alan Stern's notifier chain fix patch?  Could
Matt> that be used in task_notify?
>>  No sorry, do you have a pointer?

Matt> No problem. Here it is:
Matt> http://marc.theaimsgroup.com/?l=linux-kernel&m=113407207418475&w=2

Matt> I think it would be ideal if task_notify could simply be a
Matt> notifier chain for notifying users of task events/changes.

Ok, went back and looked at this. I think the core concept is fine,
but there are details such as having a data pointer associated with
the notifier block which is too important to leave out. Otherwise we
have to stick things into the task struct in many cases which is a
waste of space. I also think it needs to be possible to search the
list for special slow path uses to avoid us adding excessive amounts
of callbacks that are only used in one place by one client.

If we can cross-API it for task-group-notifiers then that should be
fine.

Cheers,
Jes

Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

From: Matt H. <mat...@us...> - 2006-01-13 23:29:11

On Fri, 2006-01-13 at 05:38 -0500, Jes Sorensen wrote:
> >>>>> "Matt" == Matt Helsley <mat...@us...> writes:
> 
> Matt> On Thu, 2006-01-12 at 04:51 -0500, Jes Sorensen wrote:
> Matt> Have you looked at Alan Stern's notifier chain fix patch?  Could
> Matt> that be used in task_notify?
> >>  No sorry, do you have a pointer?
> 
> Matt> No problem. Here it is:
> Matt> http://marc.theaimsgroup.com/?l=linux-kernel&m=113407207418475&w=2
> 
> Matt> I think it would be ideal if task_notify could simply be a
> Matt> notifier chain for notifying users of task events/changes.
> 
> Ok, went back and looked at this. I think the core concept is fine,
> but there are details such as having a data pointer associated with
> the notifier block which is too important to leave out. Otherwise we
> have to stick things into the task struct in many cases which is a
> waste of space. I also think it needs to be possible to search the
> list for special slow path uses to avoid us adding excessive amounts
> of callbacks that are only used in one place by one client.

	I agree that having a data pointer associated with the notifier block
is important. It helps us avoid increasing the size of task struct for
each task_notify user and makes modularization of them possible.

	Hmm, yes excessive amounts of callbacks for those used by only one
client could be a problem. My first approach to that problem would be to
have one task_notify list per-event rather than just a single list for
all events. This has ugly space implications.

	More importantly, I don't think it's a problem yet. Until it's a
problem we ought to go with the simpler implementations. When it is a
problem the task_notify interface should insulate the users from such a
change.

> If we can cross-API it for task-group-notifiers then that should be
> fine.
> 
> Cheers,
> Jes

Yup.

Cheers,
	-Matt Helsley

<< < 1 2 (Page 2 of 2)