From: Jake H. <jh...@po...> - 2005-02-01 03:12:15
|
Daniel Gryniewicz wrote: > > I can easily sync up and give you a new sched02 patch. I'll do that. > Note that, after updating CVS today, I'm having tons on warnings on my > linux box. These will likely show up with Vanders' experimental GCC, so > we should probably fix those too. (The <atheos/typedefs.h> -> > "inc/typedefs.h" move had me fooled for a minute...) Cool. I'll test Vanders' new GCC and see how stable it is for kernel builds. I've been reading _The Unabridged Pentium 4 IA32 Processor Genealogy_ at Safari.oreilly.com and I found some good tips on getting the best performance with HT. Are you keeping in mind the differences between logical and physical CPUs in your scheduling logic? Here are the techniques that Intel recommends to optimize performance for HT-enabled processors. * The OS scheduler should schedule threads to be executed on logical processors within different physical processors before scheduling threads to be executed on both of the logical processors within the same physical processor. * Eliminate spin-wait loops wherever possible. I need to add a PAUSE instruction to our spinlock() function to keep the loop from spinning too fast on P4 systems and wasting power (on laptops) and cycles that could be used by the other logical CPU. On older CPUs, PAUSE is a NOP. * The OS scheduler should attempt to balance the load on each logical processor. * Attempt to share code and data between threads executing on each logical CPU within a physical processor (the L1 data cache and L2/L3 caches are shared). Two threads in the same process running the same code and accessing the same data set will run faster when executing on the same physical processor. This is related to the optimization of giving threads an affinity to prefer running on the same CPU as they last executed on, only in this case the logical CPU doesn't matter and the affinity should be tied to the physical CPU. * Eliminate or decrease the amount of code and data sharing between threads executing on different physical processors. The ideal situation for things like semaphores is that they should be in separate cache lines from each other and from the data they're protecting. On P6 processors, cache line size is 32 bytes. On P4 processors, the L2 and L3 cache line size is 128 bytes and the L1 data cache line is 64 bytes. I know the Linux kernel has some macros to pad data structures based on the cache line size of the CPU(s) that you have compiled it for, but we haven't really optimized that aspect of Syllable yet. After I get SMP working on my P4 system, I'll update the startup code to store logical CPU information in the ProcessorInfo_s so you'll be able to distinguish between logical and physical processors. In the meantime I wanted to bring up the topic so that you can plan ahead. -Jake |