Re: [perfmon2] [PATCH 1/2] perf_events: add cgroup support (v8)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

* Peter Zijlstra <pe...@in...> [2011-02-02 13:46:32]:

> On Wed, 2011-02-02 at 17:20 +0530, Balbir Singh wrote:
> > * Peter Zijlstra <pe...@in...> [2011-02-02 12:29:20]:
> > 
> > > On Thu, 2011-01-20 at 15:39 +0100, Peter Zijlstra wrote:
> > > > On Thu, 2011-01-20 at 15:30 +0200, Stephane Eranian wrote:
> > > > > @@ -4259,8 +4261,20 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
> > > > >  
> > > > >         /* Reassign the task to the init_css_set. */
> > > > >         task_lock(tsk);
> > > > > +       /*
> > > > > +        * we mask interrupts to prevent:
> > > > > +        * - timer tick to cause event rotation which
> > > > > +        *   could schedule back in cgroup events after
> > > > > +        *   they were switched out by perf_cgroup_sched_out()
> > > > > +        *
> > > > > +        * - preemption which could schedule back in cgroup events
> > > > > +        */
> > > > > +       local_irq_save(flags);
> > > > > +       perf_cgroup_sched_out(tsk);
> > > > >         cg = tsk->cgroups;
> > > > >         tsk->cgroups = &init_css_set;
> > > > > +       perf_cgroup_sched_in(tsk);
> > > > > +       local_irq_restore(flags);
> > > > >         task_unlock(tsk);
> > > > >         if (cg)
> > > > >                 put_css_set_taskexit(cg); 
> > > > 
> > > > So you too need a callback on cgroup change there.. Li, Paul, any chance
> > > > we can fix this cgroup_subsys::exit callback? The scheduler code needs
> > > > to do funny thing because its in the wrong place as well.
> > > 
> > > cgroup guys? Shall I just fix this exit thing since the only user seems
> > > to be the scheduler and now perf for both of which its unfortunate at
> > > best?
> > 
> > Are you suggesting that the cgroup_exit on task_exit notification should be
> > pulled out?
> 
> 
> No, just fixed. The callback as it exists isn't useful and leads to
> hacks like the above.
>

OK

> 
> > > Balbir, memcontrol.c uses pre_destroy(), I pose that using this method
> > > is broken per definition since it makes the cgroup empty notification
> > > void.
> > >
> > 
> > We use pre_destroy() to reclaim, so that delete/rmdir() will be able
> > to clean up the node/group. I am not sure what you mean by it makes
> > the empty notification void and why pre_destroy() is broken?
> 
> A quick look at the code looked like it could return -EBUSY (and other
> errors), in that case the rmdir of the empty cgroup will fail.
> 
> Therefore it can happen that after the last task is removed, and we get
> the notification that the cgroup is empty, and we attempt the rmdir we
> will fail.
> 
> This again means that all such notification handlers must poll state,
> which is ridiculous.

The reason why the failure occurs is because someone has an active
reference to the cgroup structure. In the case of memory, it was every
page_cgroup earlier. The only reason why a notification would have to
poll state is if

1. notification is sent that there are no references, this group can
be cleaned up
2. A new reference is acquired before the cleanup

1 and 2 are unlikely

-- 
	Three Cheers,
	Balbir