|
From: Jake H. <jh...@an...> - 2004-10-23 20:03:23
|
Kristian Van Der Vliet wrote: > > I may still accept these atomic patches as it's probably a good idea to move > away from direct access of atomic_t types as the kernel and drivers do now, > and I see no harm in moving to a more Linux-centric way of doing things. It > will make drivers easier to port at least. But sadly it does not fix the > immediate problem of SMP crashes. > > Back to scratching our heads it seems. > What's the status on importing my patches for the atomic primitives? I have a whole slew of new patches that I'm working on and it would be a lot easier for me to create diffs from CVS if the atomic patches were checked in. Also, since the new primitives require updating a file in the GCC include directory before compiling C++ apps, it's best if we make the change now and let everyone on syllable-developer know what's going on, so it'll get plenty of testing before 0.5.5. I don't plan to suggest any more changes that would break source compatibility like this, but in this case, I think the new primitives are definitely the way to go from a performance standpoint. Anyway, I'm still looking at the SMP code. Can you give me a better idea of how things are failing when compiled with -O3? So far, the only real error message I got was from the original poster, William Rose, a "Divide error" when opening Terminal. What are the other symptoms? Here's what I've been hacking on: * Inlining a number of assembly routines from intel.s, such as cli()/sti(), get/put_cpu_flags(), isa_read*(), isa_write*(), flush_tlb(), save/load_fpu_state(), etc. I was particularly concerned with the first two pairs, as they are used whenever interrupts are disabled, such as before acquiring a spinlock. * Change kmalloc() to warn whenever allocating 128K or larger. The current memory allocator (from Linux 2.0.x) is very inefficient when a power-of-two size is requested, as it has to round up to the next higher power-of-two due to a 16-byte overhead subtracted from each block. In other words, a 128K allocation would use up a 256K block, and so on. Even worse, the entire block must consistent of adjacent pages, so memory fragmentation is a real issue. Ultimately I want to import the new slab allocator introduced in Linux 2.2.x, which is more efficient but also much more complex and is still only intended for <128K allocations. In the meantime, introducing the warning allowed me to discover and fix the code in bcache.c to use create_area() instead of kmalloc() for its hash table, which is an exact power-of-two. Another case for potentially large kmalloc()'s is in copy_arg_list() when a _very_ long command line is passed. I'll have to see how Linux handles these types of scenarios as it doesn't allow kmalloc() over 128K at all. * Simplify array.c by removing code to calculate the values nTabCount, nAvgCount, and nMaxCount, which are never used anywhere and so can be safely removed from the structure in inc/array.h. * Implement lazy FP context switching. Instead of spending the time to save and restore the FPU state on every context switch, a flag can be set to throw an exception the next time the FPU is used. For threads that never touch the FPU, nothing has to be saved, and when a thread accesses the FPU, the exception handler saves the FPU state for the last thread that was using it, then restores the FPU state for the current thread. If the current thread is also the last thread that was using the FPU, then nothing has to be saved and the FPU is simply enabled. This goes hand-in-hand with my plan to extend save_fpu_state() to use FXSAVE instead of FSAVE on Pentium III and above (including Athlon), which saves the SSE/SSE2 context as well as the FP/MMX context. This would allow us to set the bit to enable SSE/SSE2 support so Pentium3/4 and Athlon optimized vector instructions could be used. -Jake |