From: Andrea A. <an...@su...> - 2001-07-31 08:19:51
|
I'm looking at the rcu patches, I checked those two: rcu_qsctr-2.4.6-02.patch rcu_sched-2.4.6-02.patch The first one has a showstopper problem that it introduced a `call' in the syscall entry point fast path. I don't think slowing down the syscall entry point will pay off. The second adds conditional branches to schedule() which is a fast path too (even if certainly less important than the syscall entry point). I'd prefer an approch with a zero performance impact on the fast paths (and with a slower slow path of course, but as usual I believe improving any slow path at the expense of any fast path is a very bad idea ;). BTW: it is better if wait_for_rcu doesn't need to allocate any memory so it can be used before being able to release memory without risking running into deadlocks. Both for semplicity of the implementation and to avoid any memory allocation in the wait_for_rcu path but without adding any other thread, I think we could use ksoftirqd for it (ok, abusing it a little bit, but not too much), I mean it would be enough to do something like: for_each_cpu(cpu): if cpu == smp_processor_id(): continue ksoftirqd_task(cpu)->need_resched = 1 wake_ksoftirqd(cpu) while (ksoftirqd_task(cpu)->need_resched); to implement wait_for_rcu in 2.4.7. No changes needed in any kernel common code, so obviously no impact on any fast path. Comments? Would you be interested to implement a new rcu patch for 2.4.7 with the above logic and no changes in the common code? Andrea PS. a micro optimization in the implementation against the above is to set all the need_resched before starting waiting on them. |
From: Dipankar S. <dip...@se...> - 2001-07-31 10:48:37
|
Hi Andrea, On Tue, Jul 31, 2001 at 10:20:19AM +0200, Andrea Arcangeli wrote: > I'm looking at the rcu patches, I checked those two: > > rcu_qsctr-2.4.6-02.patch > rcu_sched-2.4.6-02.patch > > The first one has a showstopper problem that it introduced a `call' in > the syscall entry point fast path. I don't think slowing down the > syscall entry point will pay off. This is definitely not the optimal solution. An ideal solution would have been to increment the syscall counter in the cpu-local structure directly in the syscall entry code, but there doesn't seem to be a way of automatically generating structure sizes and field offsets to be used in assembly code. I can't hard-code those because my cacheline aligned structures may have different sizes depending on the CPU type. A workaround could however be to keep these as initialized data and accessed as global from the syscall entry point. Interestingly I measured syscall entry code with both 'call' and assembly code to increment rcu-plocal-syscallctr and it showed no difference. It may also have been an telling indictment of my knowledge of P-III micro-architecture and my inability to write good assembly code for it :-) I would still use assembly here if I can figure out a good way to do that. This is also sign that we need PDA that can be quickly accessed. > > The second adds conditional branches to schedule() which is a fast path > too (even if certainly less important than the syscall entry point). Another thing Andi pointed out was the mod operation and that scheduler folks might not be happy about it. > > I'd prefer an approch with a zero performance impact on the fast paths > (and with a slower slow path of course, but as usual I believe improving > any slow path at the expense of any fast path is a very bad idea ;). This is certainly a noble goal. We will keep that in our minds. > BTW: it is better if wait_for_rcu doesn't need to allocate any memory so > it can be used before being able to release memory without risking > running into deadlocks. Yes. I agree. This is a big issue with the current implementations and the recommendation is that don't wait up till the last minue before destruction to allocate the rcu_head. > > Both for semplicity of the implementation and to avoid any memory > allocation in the wait_for_rcu path but without adding any other thread, > I think we could use ksoftirqd for it (ok, abusing it a little bit, but > not too much), I mean it would be enough to do something like: > > for_each_cpu(cpu): > if cpu == smp_processor_id(): > continue > ksoftirqd_task(cpu)->need_resched = 1 > wake_ksoftirqd(cpu) > while (ksoftirqd_task(cpu)->need_resched); > > to implement wait_for_rcu in 2.4.7. No changes needed in any kernel > common code, so obviously no impact on any fast path. Comments? Would > you be interested to implement a new rcu patch for 2.4.7 with the above > logic and no changes in the common code? I will try to write a 2.4.7 patch and do some measurements. Stay tuned :-) Thanks Dipankar -- Dipankar Sarma <dip...@se...> Project: http://lse.sourceforge.net Linux Technology Center, IBM Software Lab, Bangalore, India. |
From: Christoph H. <hc...@ca...> - 2001-07-31 11:11:29
|
Hi Dipankar, On Tue, Jul 31, 2001 at 04:23:30PM +0530, Dipankar Sarma wrote: > > The first one has a showstopper problem that it introduced a `call' in > > the syscall entry point fast path. I don't think slowing down the > > syscall entry point will pay off. > > This is definitely not the optimal solution. An ideal solution would > have been to increment the syscall counter in the cpu-local structure > directly in the syscall entry code, Agreed. > but there doesn't seem to be a > way of automatically generating structure sizes and field offsets > to be used in assembly code. Some ports actually do this kind of thing. An good example is arch/sparc/kernel/check_asm.sh. Christoph -- Of course it doesn't work. We've performed a software upgrade. |
From: Andrew M. <ak...@zi...> - 2001-07-31 11:24:36
|
Dipankar Sarma wrote: > > An ideal solution would > have been to increment the syscall counter in the cpu-local structure > directly in the syscall entry code, but there doesn't seem to be a > way of automatically generating structure sizes and field offsets > to be used in assembly code. Here's a little trick which does that - you turn the structure offset into a globally visible label at the assembler level: .equ OFFSET_OF_MY_FIELD,36 and in assembly code, let the linker resolve it for you: movl OFFSET_OF_MY_FIELD(%ebx),%eax An example: --- linux-2.4.0-test6-pre2/kernel/ksyms.c Sun Jul 30 02:28:18 2000 +++ linux-akpm/kernel/ksyms.c Sat Aug 5 12:09:15 2000 @@ -540,3 +540,21 @@ EXPORT_SYMBOL(tasklist_lock); EXPORT_SYMBOL(pidhash); + + +#define OFFSETOF(struct_name, item) ((unsigned long)&(((struct struct_name *)0)->item)) + +#define EXPORT_ASM_CONSTANT(asm_label, value) \ + __asm__ __volatile__("\t.equ " #asm_label ",%c0" :: "i" (value) ); \ + __asm__ __volatile__("\t.globl " #asm_label) + +#define EXPORT_ASM_STRUCT_OFFSET(asm_label, struct_name, item) \ + EXPORT_ASM_CONSTANT(asm_label, OFFSETOF(struct_name, item)) + + +static void wrapper() +{ + EXPORT_ASM_STRUCT_OFFSET(TS_SIGPENDING, task_struct, sigpending); + EXPORT_ASM_STRUCT_OFFSET(TS_NEED_RESCHED, task_struct, need_resched); +} + --- linux-2.4.0-test6-pre2/arch/i386/kernel/entry.S Tue Jul 11 22:21:12 2000 +++ linux-akpm/arch/i386/kernel/entry.S Sat Aug 5 12:11:20 2000 @@ -73,10 +73,8 @@ */ state = 0 flags = 4 -sigpending = 8 addr_limit = 12 exec_domain = 16 -need_resched = 20 tsk_ptrace = 24 processor = 52 @@ -215,9 +213,9 @@ jne handle_softirq ret_with_reschedule: - cmpl $0,need_resched(%ebx) + cmpl $0,TS_NEED_RESCHED(%ebx) jne reschedule - cmpl $0,sigpending(%ebx) + cmpl $0,TS_SIGPENDING(%ebx) jne signal_return restore_all: RESTORE_ALL |
From: Dipankar S. <dip...@se...> - 2001-07-31 11:48:50
|
Hi Andrew, On Tue, Jul 31, 2001 at 09:30:02PM +1000, Andrew Morton wrote: > Dipankar Sarma wrote: > > > > An ideal solution would > > have been to increment the syscall counter in the cpu-local structure > > directly in the syscall entry code, but there doesn't seem to be a > > way of automatically generating structure sizes and field offsets > > to be used in assembly code. > > Here's a little trick which does that - you turn the structure > offset into a globally visible label at the assembler level: > > .equ OFFSET_OF_MY_FIELD,36 > > and in assembly code, let the linker resolve it for you: > > movl OFFSET_OF_MY_FIELD(%ebx),%eax > Yes, this seems to be the only way out. I was just planning to use C code to initialize the global locations to store the size/field offset of cpu-local structure/fields and use them in assembly. Thanks for the hint. Dipankar -- Dipankar Sarma <dip...@se...> Project: http://lse.sourceforge.net Linux Technology Center, IBM Software Lab, Bangalore, India. |