|
From: Jake H. <jh...@an...> - 2004-10-16 20:25:38
Attachments:
atomic_fixes.diff.bz2
|
As discussed recently on syllable-developer (in the "LiveCD 0.5.4 Aterm problems" thread), the atomic primitives in Syllable badly needed updating for SMP safety. Based on a suggestion by Daniel Gryniewicz, I replaced the primitives in atheos/atomic.h with versions based on Linux 2.6.8's include/asm-i386/atomic.h. Many locations in the kernel needed to be patched, but the patches were generally minor. Also, I replaced occurrences of atomic_add(&foo, 1) and atomic_add(&foo, -1) with atomic_inc(&foo) and atomic_dec(&foo), which are slightly faster and more self-explanatory. I also fixed a few typos and changed kernel/debug.c to use ints instead of floats. Amazingly, the new kernel appears to run okay on my VMware setup, although I don't have an SMP installation to test on, and so this should be considered highly experimental. I still need to patch the drivers and there may be other code that is broken by this change. Fortunately, the size of atomic_t stayed the same so new and old code should interoperate. If you have any trouble with the bz2'd attachment, I can make the patches available for download. Enjoy! -- Jake Hamby |
|
From: Daniel G. <da...@fp...> - 2004-10-16 22:39:56
|
On Sat, 2004-10-16 at 13:25 -0700, Jake Hamby wrote:
> As discussed recently on syllable-developer (in the "LiveCD 0.5.4 Aterm
> problems" thread), the atomic primitives in Syllable badly needed
> updating for SMP safety. Based on a suggestion by Daniel Gryniewicz, I
> replaced the primitives in atheos/atomic.h with versions based on Linux
> 2.6.8's include/asm-i386/atomic.h. Many locations in the kernel needed
> to be patched, but the patches were generally minor. Also, I replaced
> occurrences of atomic_add(&foo, 1) and atomic_add(&foo, -1) with
> atomic_inc(&foo) and atomic_dec(&foo), which are slightly faster and
> more self-explanatory.
>
> I also fixed a few typos and changed kernel/debug.c to use ints instead
> of floats. Amazingly, the new kernel appears to run okay on my VMware
> setup, although I don't have an SMP installation to test on, and so this
> should be considered highly experimental. I still need to patch the
> drivers and there may be other code that is broken by this change.
> Fortunately, the size of atomic_t stayed the same so new and old code
> should interoperate. If you have any trouble with the bz2'd attachment,
> I can make the patches available for download.
>
> Enjoy!
> --
> Jake Hamby
Great work. I especially like the atomic_add(..., -1) -> atomic_dec().
Some of those old uses were quite scary... I'm not surprised we had SMP
issues.
I didn't notice anything wrong with the patch. You might want to
consider a few new local variables, where consecutive atomic_read()'s
occure on the same variable. For example, this hunk:
@@ -242,14 +243,14 @@
for ( ; psTmp <= psPage; psTmp++ )
{
- if ( 0 != psTmp->p_nCount )
+ if ( 0 != atomic_read( &psTmp->p_nCount ) )
{
- printk( "Page %d present on free list
with count of %d\n", psTmp->p_nPageNum, psTmp->p_nCount );
+ printk( "Page %d present on free list
with count of %d\n", psTmp->p_nPageNum,
atomic_read( &psTmp->p_nCount ) );
}
- atomic_add( &psTmp->p_nCount, 1 );
+ atomic_inc( &psTmp->p_nCount );
psTmp->p_psNext = NULL;
g_nAllocatedPages++;
- g_sSysBase.ex_nFreePageCount--;
+ atomic_dec( &g_sSysBase.ex_nFreePageCount );
}
break;
}
It seems quite possible that p_nCount gets changed between the reads.
It would only cause printing problems, but could make debugging harder,
and should be easy to fix.
Reading this patch made me think I should be re-visiting the atomic
usage in the irq code... <sigh>
Great work, thank you much.
Daniel
|
|
From: Jake H. <jh...@an...> - 2004-10-17 00:47:09
Attachments:
atomic_driver_fixes.diff.bz2
|
Daniel Gryniewicz wrote: > > Great work. I especially like the atomic_add(..., -1) -> atomic_dec(). > Some of those old uses were quite scary... I'm not surprised we had SMP > issues. > > I didn't notice anything wrong with the patch. You might want to > consider a few new local variables, where consecutive atomic_read()'s > occure on the same variable. For example, this hunk: > > It seems quite possible that p_nCount gets changed between the reads. > It would only cause printing problems, but could make debugging harder, > and should be easy to fix. Yeah, I noticed that too, but didn't want to change too much at once. Anyway, here are my patches to the kernel/drivers directory for the new primitives (install the new headers from my last patch with "make dist" before compiling). One nice thing was that several of the drivers were Linux ports that used the atomic_*() macros properly, so all I had to do for them was comment out the compatibility macros in "linuxcomp.h". I also fixed all the compiler warnings except for a single "implicit declaration of function 'ffs'" warning in sound/emu10k1/src/ac97_codec.c since I didn't know what header file to include and I'm not even sure whether ffs() can be called by drivers since it doesn't seem to be exported by the kernel. Actually, I'm a bit suspicious of the emu10k1 driver since I had to add atomic versions of test_bit() and set_bit() from Linux to emu_wrapper.h. The current version was using atomic_and() in efxmgr.c when it should have been using test_bit(). Unfortunately, I don't have access to an EMU10K1 based sound card so I have no way of testing my changes. -- Jake |
|
From: Kristian V. D. V. <va...@li...> - 2004-10-17 13:46:55
|
On Saturday 16 October 2004 9:25 pm, Jake Hamby wrote: > As discussed recently on syllable-developer (in the "LiveCD 0.5.4 Aterm > problems" thread), the atomic primitives in Syllable badly needed > updating for SMP safety. > > <snip> > > Amazingly, the new kernel appears to run okay on my VMware > setup, although I don't have an SMP installation to test on, and so this > should be considered highly experimental. I'm afraid we seem to be on the wrong track; the patches do not fix the instability on SMP systems. The symptoms are identical to the current 0.5.4 kernel when compiled with Gcc 3.3 and -O2 optimisations E.g. random panics during boot. So far the only thing which has made any difference is to compile with -O1 instead of -O2, so it must be an over-agresive optimisation going on somewhere. It just doesn't appear to be in the atomic code. I may still accept these atomic patches as it's probably a good idea to move away from direct access of atomic_t types as the kernel and drivers do now, and I see no harm in moving to a more Linux-centric way of doing things. It will make drivers easier to port at least. But sadly it does not fix the immediate problem of SMP crashes. Back to scratching our heads it seems. -- Vanders http://syllable.sourceforge.net/ http://www.liqwyd.com |
|
From: Jake H. <jh...@an...> - 2004-10-17 17:21:45
|
Kristian Van Der Vliet wrote: > I'm afraid we seem to be on the wrong track; the patches do not fix the > instability on SMP systems. The symptoms are identical to the current 0.5.4 > kernel when compiled with Gcc 3.3 and -O2 optimisations E.g. random panics > during boot. That's too bad. Anyway, here's my final set of atomic patches for the appserver and filesystem code. The FS patch also fixes a few compiler warnings having to do with missing newlines at the end of some files and a commented-out section with spaces following the "\" token, and adds the "-fno-strict-aliasing" flag to get rid of some warnings. The appserver patch fixes <util/locker.h> and <gui/font.h> which must be installed to "/ainc" before compiling. It also inlines the ddriver_mmx.c code into ddriver.cpp (the only place where it's used), which adds a couple of new "use of memory input without lvalue in asm operand x is deprecated" warnings which appear to be harmless as they also occur when compiling the MPlayer code from which ddriver_mmx.c was taken. Finally, the atomic_gthr_default_h.diff patch must be applied in /usr/gcc/include/c++/3.3.4/i586-pc-syllable/bits before compiling the appserver. In CVS it should be applied to the "gthr-syllable.h" files in system/apps/utils/Builder/packages/gcc-*/patches/gcc. I notice that I missed an opportunity to change "atomic_add( &mutex->count, -1 );" to "atomic_dec( &mutex->count );" in one location so that could also be done for a tiny performance improvement. Like I said in my post to syllable-developer, it would be nice if we had an UPDATING file in CVS to alert users of changes like this where manual intervention is needed. Anyway, I'm sorry to hear that these changes didn't fix your problem, but I'm not too surprised. They do seem to be a necessary precondition for figuring out what the real problem is. If I see anything else suspicious, I'll be sure to let you know. -- Jake |
|
From: Jake H. <jh...@an...> - 2004-10-17 18:33:25
|
Kristian Van Der Vliet wrote: > So far the only thing which has made any difference is to compile with -O1 > instead of -O2, so it must be an over-agresive optimisation going on > somewhere. It just doesn't appear to be in the atomic code. Can you figure out which file(s) need to be compiled with -O1 for the kernel to work? You should be able to do a binary search: compile half of the files with -O1 and keep dividing by two until you've isolated the files which are responsible. Then post your results and I'll look at the assembly output to see if anything pops out. Hopefully the problem is isolated to one or two specific source files. -- Jake |
|
From: Jake H. <jh...@an...> - 2004-10-23 20:03:23
|
Kristian Van Der Vliet wrote: > > I may still accept these atomic patches as it's probably a good idea to move > away from direct access of atomic_t types as the kernel and drivers do now, > and I see no harm in moving to a more Linux-centric way of doing things. It > will make drivers easier to port at least. But sadly it does not fix the > immediate problem of SMP crashes. > > Back to scratching our heads it seems. > What's the status on importing my patches for the atomic primitives? I have a whole slew of new patches that I'm working on and it would be a lot easier for me to create diffs from CVS if the atomic patches were checked in. Also, since the new primitives require updating a file in the GCC include directory before compiling C++ apps, it's best if we make the change now and let everyone on syllable-developer know what's going on, so it'll get plenty of testing before 0.5.5. I don't plan to suggest any more changes that would break source compatibility like this, but in this case, I think the new primitives are definitely the way to go from a performance standpoint. Anyway, I'm still looking at the SMP code. Can you give me a better idea of how things are failing when compiled with -O3? So far, the only real error message I got was from the original poster, William Rose, a "Divide error" when opening Terminal. What are the other symptoms? Here's what I've been hacking on: * Inlining a number of assembly routines from intel.s, such as cli()/sti(), get/put_cpu_flags(), isa_read*(), isa_write*(), flush_tlb(), save/load_fpu_state(), etc. I was particularly concerned with the first two pairs, as they are used whenever interrupts are disabled, such as before acquiring a spinlock. * Change kmalloc() to warn whenever allocating 128K or larger. The current memory allocator (from Linux 2.0.x) is very inefficient when a power-of-two size is requested, as it has to round up to the next higher power-of-two due to a 16-byte overhead subtracted from each block. In other words, a 128K allocation would use up a 256K block, and so on. Even worse, the entire block must consistent of adjacent pages, so memory fragmentation is a real issue. Ultimately I want to import the new slab allocator introduced in Linux 2.2.x, which is more efficient but also much more complex and is still only intended for <128K allocations. In the meantime, introducing the warning allowed me to discover and fix the code in bcache.c to use create_area() instead of kmalloc() for its hash table, which is an exact power-of-two. Another case for potentially large kmalloc()'s is in copy_arg_list() when a _very_ long command line is passed. I'll have to see how Linux handles these types of scenarios as it doesn't allow kmalloc() over 128K at all. * Simplify array.c by removing code to calculate the values nTabCount, nAvgCount, and nMaxCount, which are never used anywhere and so can be safely removed from the structure in inc/array.h. * Implement lazy FP context switching. Instead of spending the time to save and restore the FPU state on every context switch, a flag can be set to throw an exception the next time the FPU is used. For threads that never touch the FPU, nothing has to be saved, and when a thread accesses the FPU, the exception handler saves the FPU state for the last thread that was using it, then restores the FPU state for the current thread. If the current thread is also the last thread that was using the FPU, then nothing has to be saved and the FPU is simply enabled. This goes hand-in-hand with my plan to extend save_fpu_state() to use FXSAVE instead of FSAVE on Pentium III and above (including Athlon), which saves the SSE/SSE2 context as well as the FP/MMX context. This would allow us to set the bit to enable SSE/SSE2 support so Pentium3/4 and Athlon optimized vector instructions could be used. -Jake |
|
From: Kristian V. D. V. <va...@li...> - 2004-10-24 11:13:34
|
On Saturday 23 October 2004 9:03 pm, Jake Hamby wrote: > Kristian Van Der Vliet wrote: > > I may still accept these atomic patches as it's probably a good idea to > > move away from direct access of atomic_t types as the kernel and drivers > > do now, and I see no harm in moving to a more Linux-centric way of doing > > things. It will make drivers easier to port at least. But sadly it does > > not fix the immediate problem of SMP crashes. > > > > Back to scratching our heads it seems. > > What's the status on importing my patches for the atomic primitives? I'll get on with checking them in. Hopefully they'll be in CVS today; I'll let you know as soon as I'm done. > Anyway, I'm still looking at the SMP code. Can you give me a better > idea of how things are failing when compiled with -O3? So far, the only > real error message I got was from the original poster, William Rose, a > "Divide error" when opening Terminal. What are the other symptoms? I'll capture a few kernel logs from both an -O3 and -O2 kernel; the symptoms appear to be identical. > * Inlining a number of assembly routines from intel.s, such as > cli()/sti(), get/put_cpu_flags(), isa_read*(), isa_write*(), > flush_tlb(), save/load_fpu_state(), etc. Sounds sensible. > * Implement lazy FP context switching. Instead of spending the time to > save and restore the FPU state on every context switch, a flag can be > set to throw an exception the next time the FPU is used. For threads > that never touch the FPU, nothing has to be saved, and when a thread > accesses the FPU, the exception handler saves the FPU state for the last > thread that was using it, then restores the FPU state for the current > thread. If the current thread is also the last thread that was using > the FPU, then nothing has to be saved and the FPU is simply enabled. Sounds like a great idea to me. Anything that can save a few ticks on a context switch it always a good idea :) > This goes hand-in-hand with my plan to extend save_fpu_state() to use > FXSAVE instead of FSAVE on Pentium III and above (including Athlon), > which saves the SSE/SSE2 context as well as the FP/MMX context. This > would allow us to set the bit to enable SSE/SSE2 support so Pentium3/4 > and Athlon optimized vector instructions could be used. Would this allow MMX/SSE/SSE2 instructions in the kernel? Someone was asking about this a little while back, although to be honest I forget why.. -- Vanders http://syllable.sourceforge.net/ http://www.liqwyd.com |
|
From: Kristian V. D. V. <va...@li...> - 2004-10-24 13:41:04
Attachments:
kernel-smp-logs.tar.gz
|
On Monday 25 October 2004 2:12 am, Kristian Van Der Vliet wrote: > On Saturday 23 October 2004 9:03 pm, Jake Hamby wrote: >> Anyway, I'm still looking at the SMP code. Can you give me a better >> idea of how things are failing when compiled with -O3? So far, the only >> real error message I got was from the original poster, William Rose, a >> "Divide error" when opening Terminal. What are the other symptoms? > > I'll capture a few kernel logs from both an -O3 and -O2 kernel; the > symptoms appear to be identical. Logs attached. I did six runs; three with a kernel compiled with -O3 and three with a kernel compiled with -O2 -mprefered-stack-boundry=2 Let me know if you need anything else. -- Vanders http://syllable.sourceforge.net/ http://www.liqwyd.com |
|
From: Jake H. <jh...@an...> - 2004-10-24 17:47:44
|
Kristian Van Der Vliet wrote: > On Saturday 23 October 2004 9:03 pm, Jake Hamby wrote: >> >>What's the status on importing my patches for the atomic primitives? > > I'll get on with checking them in. Hopefully they'll be in CVS today; I'll > let you know as soon as I'm done. Thanks! I assume you checked in the driver and fs patches as well. > I'll capture a few kernel logs from both an -O3 and -O2 kernel; the symptoms > appear to be identical. Got the logs. I'll keep poking around and let you know if I find anything. >>* Implement lazy FP context switching. Instead of spending the time to > > Sounds like a great idea to me. Anything that can save a few ticks on a > context switch it always a good idea :) Definitely! I'm playing around with the lmbench-3.0-a4 microbenchmarks right now, trying to figure out why Syllable is so freaking slow under VMware!! I tried running the LiveCD on my desktop machine and Syllable is an order of magnitude faster on a real machine than under VMware. Unfortunately, I couldn't do a native install because my hard drives are too big and I get a "Drive uses 48bit addressing" error. It looks like I can enable support for that with a kernel flag but I want to look at the code a little and build a more recent kernel for the install CD before I try that. My laptop hard drive is 60GB, which may be small enough to do a native install there for comparison benchmarks. However, I really enjoy being able to hack on Syllable in VMware while browsing the code in Visual Studio (even though I can't build it, VS has great code browsing and search capabilities), listening to music, and watching the debug output in a terminal window. On a laptop, especially, having good OS support for things like power management and 802.11g is very important as well. Speaking of 802.11g, I spent many hours a few months back adapting the latest version of the madwifi Linux driver for Atheros chipsets to FreeBSD. Once I've finished my current crop of projects, I definitely intend to try hacking the 802.11 and Atheros driver code into Syllable. I'm also interested in tackling ACPI, at least well enough to enable Hyperthreading so I have at least a simulation of SMP (though not in VMware of course) to test on. >>This goes hand-in-hand with my plan to extend save_fpu_state() to use >>FXSAVE instead of FSAVE on Pentium III and above (including Athlon), >>which saves the SSE/SSE2 context as well as the FP/MMX context. This >>would allow us to set the bit to enable SSE/SSE2 support so Pentium3/4 >>and Athlon optimized vector instructions could be used. > > Would this allow MMX/SSE/SSE2 instructions in the kernel? Someone was asking > about this a little while back, although to be honest I forget why.. Here's the deal. MMX uses the same registers as the FPU so there is no additional OS support needed. A side effect is that you can't use MMX and FP at the same time but have to switch between them. All this is handled by GCC if you use the built-in vector instructions. SSE/SSE2, on the other hand, uses an additional set of 8 128-bit registers plus a few additional control and status words, so there is additional OS support needed and so these instructions are disabled until the OS signals to the CPU that it can handle them. The changes are not too complicated: basically you have to set aside a 512-byte area per thread to save the FP state with FXSAVE instead of the 128-byte area needed for FSAVE. There's an additional exception, vector 19 (#XF) used to signal SIMD floating-point exceptions, much as vector 16 (#MF) is used for regular floating-point exceptions. The good news is that you can intermix SSE/SSE2 and FP or MMX instructions in the regular registers. For some very basic information on using the vector instructions from GCC without having to write inline assembly code, see: http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html GCC also supports the <xmmintrin.h> header file for compatibility with MSFT and Intel C++ compilers, which is good because it's a lot easier to use. See here for a simple tutorial with links to other sites: http://www.codeproject.com/cpp/sseintro.asp In addition, I believe that compiling with "-march=pentium3", "-march=pentium4", or "-march=athlon-xp" will potentially use the MMX or SSE instructions automatically when beneficial, although I have not seen this first-hand. It will definitely use SSE for floating-point if you add "-mfpmath=sse". I've attached some very short test programs, one using MMX and the other using SSE, that you can use to test for example Linux vs. Syllable. Right now Syllable runs the MMX one but kills the SSE one with an **Illegal instruction**, which makes sense as SSE is not enabled yet. -- Jake |
|
From: Kristian V. D. V. <va...@li...> - 2004-10-24 16:11:45
|
On Saturday 23 October 2004 9:03 pm, Jake Hamby wrote: > What's the status on importing my patches for the atomic primitives? All patches have now been checked in to CVS. They may take a little while to propogate to the anonymous CVS though. -- Vanders http://syllable.sourceforge.net/ http://www.liqwyd.com |
|
From: Kristian V. D. V. <va...@li...> - 2004-10-17 22:02:31
Attachments:
smp.zip
|
On Sunday 17 October 2004 7:33 pm, Jake Hamby wrote: > Kristian Van Der Vliet wrote: >> So far the only thing which has made any difference is to compile with >> -O1 instead of -O2, so it must be an over-agresive optimisation going on >> somewhere. It just doesn't appear to be in the atomic code. > > Can you figure out which file(s) need to be compiled with -O1 for the > kernel to work? You should be able to do a binary search: compile half > of the files with -O1 and keep dividing by two until you've isolated the > files which are responsible. Then post your results and I'll look at > the assembly output to see if anything pops out. Hopefully the problem > is isolated to one or two specific source files. Attached is the result of tonights work. I've identified smp.c as the culprit (Makes sense, at least) I was working with the stock 0.5.4 source which did not include your previous patches for -O2, so I've used -O3 and -O1, although I will confirm tommorrow with -O1 and -O2 just to be 100% sure. The zip contains the following files: smp.o.O1 - smp.c compiled with the following flags: -D__KERNEL__ -D__BUILD_KERNEL__ -no-fPIC -c -g -O1 -fno-strict-aliasing -Wall -march=i586 -mcpu=i586 smp.o.O3 - smp.c compiled with the following flags: -D__KERNEL__ -D__BUILD_KERNEL__ -no-fPIC -c -g -O3 -fno-strict-aliasing -Wall -march=i586 -mcpu=i586 smp.s.O1 - smp.c compiled with the following flags: -D__KERNEL__ -D__BUILD_KERNEL__ -no-fPIC -s -g -O1 -fno-strict-aliasing -Wall -march=i586 -mcpu=i586 smp.s.O3 - smp.c compiled with the following flags: -D__KERNEL__ -D__BUILD_KERNEL__ -no-fPIC -s -g -O3 -fno-strict-aliasing -Wall -march=i586 -mcpu=i586 smp.O1.dump - Output from objdump -ds smp.o.O1 smp.O3.dump - Output from objsump -ds smp.o.O3 If you need anything else please give me a shout and I'll get see what I can do. -- Vanders http://syllable.sourceforge.net/ http://www.liqwyd.com |
|
From: Daniel G. <da...@fp...> - 2004-10-18 17:15:36
|
On Mon, 2004-10-18 at 05:03 +0100, Kristian Van Der Vliet wrote: > Attached is the result of tonights work. I've identified smp.c as the culprit > (Makes sense, at least) I was working with the stock 0.5.4 source which did > not include your previous patches for -O2, so I've used -O3 and -O1, although > I will confirm tommorrow with -O1 and -O2 just to be 100% sure. I'm not an asm guru, and the results of those two compiles are extremely different. Hardly surprising. :) I did read over the source in smp.c, and I didn't see anything that jumped out at me. One obvious (temprorary) solution is to just build this file with -O1 rather than -O2 or -O3. That would get smp working, and I don't think the optimizations will make any difference in this case. FWIW, the Linux kernel is built with -O2, and says not to use anything higher. Daniel |
|
From: Jake H. <jh...@an...> - 2004-10-19 01:53:51
Attachments:
kernel_cpu_id.diff
|
Kristian Van Der Vliet wrote: > > Attached is the result of tonights work. I've identified smp.c as the culprit > (Makes sense, at least) I was working with the stock 0.5.4 source which did > not include your previous patches for -O2, so I've used -O3 and -O1, although > I will confirm tommorrow with -O1 and -O2 just to be 100% sure. Well, there is at least one small bug which could break a -O3 build: the read_cpu_id() function is modifying the %ebx register but doesn't tell GCC that it is. I've attached a patch. That still doesn't explain why a -O2 build is also failing. I'll keep looking at it... -- Jake |
|
From: Kristian V. D. V. <va...@li...> - 2004-10-20 16:29:10
|
On Tuesday 19 October 2004 2:53 am, Jake Hamby wrote: > Kristian Van Der Vliet wrote: >> Attached is the result of tonights work. I've identified smp.c as the >> culprit (Makes sense, at least) I was working with the stock 0.5.4 >> source which did not include your previous patches for -O2, so I've used >> -O3 and -O1, although I will confirm tommorrow with -O1 and -O2 just to >> be 100% sure. > > Well, there is at least one small bug which could break a -O3 build: > the read_cpu_id() function is modifying the %ebx register but doesn't > tell GCC that it is. I've attached a patch. That still doesn't explain > why a -O2 build is also failing. I'll keep looking at it... Even though I don't pretend to fully understand it, the patch looks sensible enough and has been applied. It does not fix an -O3 kernel but it doesn't break anything either. -- Vanders http://syllable.sourceforge.net/ http://www.liqwyd.com |