Thread: [Lse-tech] wait_for_rcu implementation proposal for 2.4.7

Status: Pre-Alpha

Brought to you by: atheurer, hlinder, jwright, mingming, and 3 others

lse-tech

[Lse-tech] wait_for_rcu implementation proposal for 2.4.7

From: Andrea A. <an...@su...> - 2001-07-31 08:19:51

I'm looking at the rcu patches, I checked those two:

	rcu_qsctr-2.4.6-02.patch
	rcu_sched-2.4.6-02.patch

The first one has a showstopper problem that it introduced a `call' in
the syscall entry point fast path. I don't think slowing down the
syscall entry point will pay off.

The second adds conditional branches to schedule() which is a fast path
too (even if certainly less important than the syscall entry point).

I'd prefer an approch with a zero performance impact on the fast paths
(and with a slower slow path of course, but as usual I believe improving
any slow path at the expense of any fast path is a very bad idea ;).

BTW: it is better if wait_for_rcu doesn't need to allocate any memory so
it can be used before being able to release memory without risking
running into deadlocks.

Both for semplicity of the implementation and to avoid any memory
allocation in the wait_for_rcu path but without adding any other thread,
I think we could use ksoftirqd for it (ok, abusing it a little bit, but
not too much), I mean it would be enough to do something like:

	for_each_cpu(cpu):
		if cpu == smp_processor_id():
			continue
		ksoftirqd_task(cpu)->need_resched = 1
		wake_ksoftirqd(cpu)
		while (ksoftirqd_task(cpu)->need_resched);

to implement wait_for_rcu in 2.4.7.  No changes needed in any kernel
common code, so obviously no impact on any fast path. Comments? Would
you be interested to implement a new rcu patch for 2.4.7 with the above
logic and no changes in the common code?

Andrea

PS. a micro optimization in the implementation against the above is to
set all the need_resched before starting waiting on them.

Re: [Lse-tech] wait_for_rcu implementation proposal for 2.4.7

From: Dipankar S. <dip...@se...> - 2001-07-31 10:48:37

Hi Andrea,

On Tue, Jul 31, 2001 at 10:20:19AM +0200, Andrea Arcangeli wrote:
> I'm looking at the rcu patches, I checked those two:
> 
> 	rcu_qsctr-2.4.6-02.patch
> 	rcu_sched-2.4.6-02.patch
> 
> The first one has a showstopper problem that it introduced a `call' in
> the syscall entry point fast path. I don't think slowing down the
> syscall entry point will pay off.

This is definitely not the optimal solution. An ideal solution would
have been to increment the syscall counter in the cpu-local structure
directly in the syscall entry code, but there doesn't seem to be a
way of automatically generating structure sizes and field offsets
to be used in assembly code. I can't hard-code those because my
cacheline aligned structures may have different sizes depending
on the CPU type. A workaround could however be to keep these
as initialized data and accessed as global from the syscall entry
point. Interestingly I measured syscall entry code with both
'call' and assembly code to increment rcu-plocal-syscallctr and
it showed no difference. It may also have been an telling indictment of
my knowledge of P-III micro-architecture and my inability to write
good assembly code for it :-) I would still use assembly here if I can figure
out a good way to do that.

This is also sign that we need PDA that can be quickly accessed.

> 
> The second adds conditional branches to schedule() which is a fast path
> too (even if certainly less important than the syscall entry point).

Another thing Andi pointed out was the mod operation and that scheduler
folks might not be happy about it.

> 
> I'd prefer an approch with a zero performance impact on the fast paths
> (and with a slower slow path of course, but as usual I believe improving
> any slow path at the expense of any fast path is a very bad idea ;).

This is certainly a noble goal. We will keep that in our minds.

> BTW: it is better if wait_for_rcu doesn't need to allocate any memory so
> it can be used before being able to release memory without risking
> running into deadlocks.

Yes. I agree. This is a big issue with the current implementations
and the recommendation is that don't wait up till the last minue
before destruction to allocate the rcu_head.

> 
> Both for semplicity of the implementation and to avoid any memory
> allocation in the wait_for_rcu path but without adding any other thread,
> I think we could use ksoftirqd for it (ok, abusing it a little bit, but
> not too much), I mean it would be enough to do something like:
> 
> 	for_each_cpu(cpu):
> 		if cpu == smp_processor_id():
> 			continue
> 		ksoftirqd_task(cpu)->need_resched = 1
> 		wake_ksoftirqd(cpu)
> 		while (ksoftirqd_task(cpu)->need_resched);
> 
> to implement wait_for_rcu in 2.4.7.  No changes needed in any kernel
> common code, so obviously no impact on any fast path. Comments? Would
> you be interested to implement a new rcu patch for 2.4.7 with the above
> logic and no changes in the common code?

I will try to write a 2.4.7 patch and do some measurements. Stay tuned :-)

Thanks
Dipankar
-- 
Dipankar Sarma  <dip...@se...> Project: http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

Re: [Lse-tech] wait_for_rcu implementation proposal for 2.4.7

From: Christoph H. <hc...@ca...> - 2001-07-31 11:11:29

Hi Dipankar,

On Tue, Jul 31, 2001 at 04:23:30PM +0530, Dipankar Sarma wrote:
> > The first one has a showstopper problem that it introduced a `call' in
> > the syscall entry point fast path. I don't think slowing down the
> > syscall entry point will pay off.
> 
> This is definitely not the optimal solution. An ideal solution would
> have been to increment the syscall counter in the cpu-local structure
> directly in the syscall entry code,

Agreed.

> but there doesn't seem to be a
> way of automatically generating structure sizes and field offsets
> to be used in assembly code.

Some ports actually do this kind of thing.
An good example is arch/sparc/kernel/check_asm.sh.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.

Re: [Lse-tech] wait_for_rcu implementation proposal for 2.4.7

From: Andrew M. <ak...@zi...> - 2001-07-31 11:24:36

Dipankar Sarma wrote:
> 
> An ideal solution would
> have been to increment the syscall counter in the cpu-local structure
> directly in the syscall entry code, but there doesn't seem to be a
> way of automatically generating structure sizes and field offsets
> to be used in assembly code.

Here's a little trick which does that - you turn the structure
offset into a globally visible label at the assembler level:

	.equ	OFFSET_OF_MY_FIELD,36

and in assembly code, let the linker resolve it for you:

	movl	OFFSET_OF_MY_FIELD(%ebx),%eax


An example:


--- linux-2.4.0-test6-pre2/kernel/ksyms.c	Sun Jul 30 02:28:18 2000
+++ linux-akpm/kernel/ksyms.c	Sat Aug  5 12:09:15 2000
@@ -540,3 +540,21 @@
 
 EXPORT_SYMBOL(tasklist_lock);
 EXPORT_SYMBOL(pidhash);
+
+
+#define OFFSETOF(struct_name, item) ((unsigned long)&(((struct struct_name *)0)->item))
+
+#define EXPORT_ASM_CONSTANT(asm_label, value)					\
+	__asm__ __volatile__("\t.equ " #asm_label ",%c0" :: "i" (value) );	\
+	__asm__ __volatile__("\t.globl " #asm_label)
+
+#define EXPORT_ASM_STRUCT_OFFSET(asm_label, struct_name, item)			\
+		EXPORT_ASM_CONSTANT(asm_label, OFFSETOF(struct_name, item))
+
+
+static void wrapper()
+{
+	EXPORT_ASM_STRUCT_OFFSET(TS_SIGPENDING, task_struct, sigpending);
+	EXPORT_ASM_STRUCT_OFFSET(TS_NEED_RESCHED, task_struct, need_resched);
+}
+
--- linux-2.4.0-test6-pre2/arch/i386/kernel/entry.S	Tue Jul 11 22:21:12 2000
+++ linux-akpm/arch/i386/kernel/entry.S	Sat Aug  5 12:11:20 2000
@@ -73,10 +73,8 @@
  */
 state		=  0
 flags		=  4
-sigpending	=  8
 addr_limit	= 12
 exec_domain	= 16
-need_resched	= 20
 tsk_ptrace	= 24
 processor	= 52
 
@@ -215,9 +213,9 @@
 	jne   handle_softirq
 	
 ret_with_reschedule:
-	cmpl $0,need_resched(%ebx)
+	cmpl $0,TS_NEED_RESCHED(%ebx)
 	jne reschedule
-	cmpl $0,sigpending(%ebx)
+	cmpl $0,TS_SIGPENDING(%ebx)
 	jne signal_return
 restore_all:
 	RESTORE_ALL

Re: [Lse-tech] wait_for_rcu implementation proposal for 2.4.7

From: Dipankar S. <dip...@se...> - 2001-07-31 11:48:50

Hi Andrew,

On Tue, Jul 31, 2001 at 09:30:02PM +1000, Andrew Morton wrote:
> Dipankar Sarma wrote:
> > 
> > An ideal solution would
> > have been to increment the syscall counter in the cpu-local structure
> > directly in the syscall entry code, but there doesn't seem to be a
> > way of automatically generating structure sizes and field offsets
> > to be used in assembly code.
> 
> Here's a little trick which does that - you turn the structure
> offset into a globally visible label at the assembler level:
> 
> 	.equ	OFFSET_OF_MY_FIELD,36
> 
> and in assembly code, let the linker resolve it for you:
> 
> 	movl	OFFSET_OF_MY_FIELD(%ebx),%eax
> 

Yes, this seems to be the only way out. I was just
planning to use C code to initialize the global locations to
store the size/field offset of cpu-local structure/fields and
use them in assembly.

Thanks for the hint.

Dipankar
-- 
Dipankar Sarma  <dip...@se...> Project: http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.