Thread: RE: [lc-devel] WKdm to kernel mode
Status: Beta
Brought to you by:
nitin_sf
From: Scott K. <sfk...@am...> - 2006-04-24 16:38:02
|
Nitin, > -- current WKdm uses Wk_word temp[1200] to store unpacked 10-bits i.e. > 1 word (32/64-bit) per 10-bits. Why doesn't it use=20 > short(16-bit) to store these 10-bit patterns in unpacked=20 > state? Although 2k memory saving is insignificant but why=20 > waste it for free? Was it done for some optimization purpose? The goal was efficiency. Some architectures don't handle half-words as quickly as full words. For the x86, there may be no difference. I'd handle this as an empirical issue: Compile is using shorts and test the speed; change in back to ints/longs and see if it's any faster. If not, keep it as 16-bit values. > -- WKdm_compress() does kmalloc()/kfree() pair thrice per=20 > call to this function. Is it too much overhead? Should these=20 > temp buffers be made global and use locking to serialize=20 > request for them (this isn't performance friendly either)? The original code, which ran only in user-land, did no heap allocating. Given that the compressor will be called very frequently, it needs to be fast. If you can allocate these spaces statically or on the stack, then do it and save the malloc/free operations. > If kernel preempts in this ceil(), will value of 'x' remain=20 > valid (kernel does not save and restore the floating point processor's > state) ? The problem is worse than that: Since the kernel does not preserve FP registers, use of FP values in the kernel may clobber legitimate user-land values. You can't do that. You have to change the use of doubles to some kind of integer for kernel code. Scott |
From: Scott K. <sfk...@am...> - 2006-04-25 12:52:56
|
Nitin,=20 > Currently I have some problem getting timing information=20 > needed to see how much time de/compress() is taking. Can't you turn off power-saving CPU features and use rdtscll()? > Kernel mode stack is only 4k/8k. So, it is not possible to=20 > allocate these temporary buffers (which are about 4k when=20 > shorts are used and about 7k when ints/long are used) on=20 > stack. Good point. > I was just thinking to bring down temp buffers to <4k=20 > so that I can de/allocate single (0 order) page for these buffers. Why not allocate them statically? Then you don't have to do the work of allocating and freeing even these 0-order pages. Even if it is fast, it's going to be frequently repeated work that can easily be avoided. In any case, either approach is going to be better than kmalloc/kfree. > ceil() is now replaced with simple integer operations to get=20 > ceiling value. Nice! Scott |
From: Nitin G. <nit...@gm...> - 2006-04-26 11:16:55
|
Hello Scott Kaplan, > >> Currently I have some problem getting timing information >> needed to see how much time de/compress() is taking. > > Can't you turn off power-saving CPU features and use rdtscll()? > I was trying get_jiffies_64() following how printk() gets timing info when kernel debugging is turned on. Surprisingly, I was getting same time before and after compress() even when it gives time in nanoseconds -- so there's is certainly some kind of problem with how I'm using it. I will now try your suggestion. >> I was just thinking to bring down temp buffers to <4k >> so that I can de/allocate single (0 order) page for these buffers. > > Why not allocate them statically? Then you don't have to do the work of > allocating and freeing even these 0-order pages. Even if it is fast, > it's going to be frequently repeated work that can easily be avoided. > In any case, either approach is going to be better than kmalloc/kfree. > Static allocation will require calls to compress() to be serialized using locks since same allocated buffer space will be used for each call. This will prevent more than one compress() to happen simultaneously like: compress (page 1) { lock temp_buffers() --- pre-emt ---- compress (page 2) { wait for lock on buffers() do some work() unlock temp_buffers() } // got lock do some work() unlock temp_buffers() } This waiting for lock is not required if these buffers are dynamically allocated since then each call will have separate buffer space to work in. Isn't static allocation a special case of global allocation where variable name is hidden in function scope? Is there any other difference in static/global? Best Regards, Nitin |
From: Scott K. <sfk...@am...> - 2006-04-26 15:07:21
|
Nitin, Nitin wrote: > I was trying get_jiffies_64() From what I can tell, this function returns a simple count of the jiffies (1/100ths of a second) as a 64-bit integer. It's not in nanoseconds, and a (de)compression occurs within much less than one jiffy. I think you want sched_clock(), which not only returns nanoseconds, but uses rdrscll() to obtain the time using the cycle counter (unless it's a NUMA machine, which I assume is not your first concern here). > Static allocation will require calls to compress() to be=20 > serialized [...] Good point! Stick to your idea. :-) > Isn't static allocation a special case of global allocation=20 > where variable name is hidden in function scope? I'd call it the other way around: Both global variables and local variables annotated with `static' result in statically allocated space that becomes part of the executable's image. Only one requires the language-level keyword `static', but both are statically allocated and ultimately located in the same region in memory. Scott |
From: <fla...@gm...> - 2006-04-26 16:33:45
|
Hi, > > Static allocation will require calls to compress() to be > > serialized [...] > > Good point! Stick to your idea. :-) May I raise some points from an uninformed POV? ;-) - maybe this serialization is not that bad? What order of time is compress() expected to take? - if the system is low on (free) RAM, isn't using kmalloc likely to double the pages to swap out? - are you sure you're not neglecting the overhead of kmalloc? Cheers, Fl=E1vio PS. What about changing the default of the list to reply-to-list? ;-) |
From: Rodrigo S de C. <ro...@te...> - 2006-04-26 16:54:22
|
Fl=E1vio Etrusco wrote: > PS. What about changing the default of the list to reply-to-list? ;-) > =20 Done. I hope it works from now on :-) Rodrigo |
From: Nitin G. <nit...@gm...> - 2006-04-27 12:31:46
|
Hi, Flávio Etrusco wrote: > > - maybe this serialization is not that bad? What order of time is > compress() expected to take? I haven't yet measured how much time compress() actually takes (I expect an order of few 100 ns for WKdm). But even if WKdm is fast, I don't expect LZO to be fast enough to be serialized. But then we can keep WKdm serialized and LZO kmalloc()ed. I am not sure of any single solution which will be good for systems of different speeds. Maybe, maybe we are overestimating cost of kmalloc()s...I don't know. Anyway both these methods are worth proper benchmarks -- maybe then we'll be able to see. > - if the system is low on (free) RAM, isn't using kmalloc likely to > double the pages to swap out? If system is so low on memory that it can't even allocate few KBs without swapping then ccache better start freeing pages :) It should be allowed to add pages only when free RAM space is above certain threshold. > - are you sure you're not neglecting the overhead of kmalloc? > I am not sure if kmalloc overhead is negligible or not compared to serialization. Also, which one of these should be faster: kmalloc(sizeof(WK_word)*300, GFP_KERNEL); kmalloc(sizeof(WK_word)*300, GFP_KERNEL); kmalloc(sizeof(unsigned short)*1200, GFP_KERNEL); or, single kmalloc allocating sum of above three spaces ? Cheers, Nitin |
From: Nitin G. <nit...@gm...> - 2006-04-24 19:10:47
|
Hello Scott Kaplan, > >> -- current WKdm uses Wk_word temp[1200] to store unpacked 10-bits i.e. >> 1 word (32/64-bit) per 10-bits. Why doesn't it use >> short(16-bit) to store these 10-bit patterns in unpacked >> state? Although 2k memory saving is insignificant but why >> waste it for free? Was it done for some optimization purpose? >> > > The goal was efficiency. Some architectures don't handle half-words as > quickly as full words. For the x86, there may be no difference. I'd > handle this as an empirical issue: Compile is using shorts and test the > speed; change in back to ints/longs and see if it's any faster. If not, > keep it as 16-bit values. > > Currently I have some problem getting timing information needed to see how much time de/compress() is taking. I will definitely compare speeds for these cases. Based on results it may be useful to have arch specific compile directives -- code needs minimal changes for short to ints/long conversion in this case, so it would not be a problem to have some arch specific optimization too. >> -- WKdm_compress() does kmalloc()/kfree() pair thrice per >> call to this function. Is it too much overhead? Should these >> temp buffers be made global and use locking to serialize >> request for them (this isn't performance friendly either)? >> > > The original code, which ran only in user-land, did no heap allocating. > Given that the compressor will be called very frequently, it needs to be > fast. If you can allocate these spaces statically or on the stack, then > do it and save the malloc/free operations. > > Kernel mode stack is only 4k/8k. So, it is not possible to allocate these temporary buffers (which are about 4k when shorts are used and about 7k when ints/long are used) on stack. I was just thinking to bring down temp buffers to <4k so that I can de/allocate single (0 order) page for these buffers. De/allocating 0-order pages is fast. This will eliminate need for triple kmalloc/kfree(). Or, maybe allocating a double page is still more efficient than triple kmalloc/kfree...I'll benchmark these cases. >> If kernel preempts in this ceil(), will value of 'x' remain >> valid (kernel does not save and restore the floating point processor's >> state) ? >> > > The problem is worse than that: Since the kernel does not preserve FP > registers, use of FP values in the kernel may clobber legitimate > user-land values. You can't do that. You have to change the use of > doubles to some kind of integer for kernel code. > > ceil() is now replaced with simple integer operations to get ceiling value. Best Regards, Nitin |