From: Carl L. <ce...@us...> - 2008-06-10 23:22:43
|
Arnd and company: There are issues with using the cpu buffers for storing the SPU data. I agree with you on some of your comments about the high complexity of the per cpu patch. It seemed like a good idea at the start but in the end I am not all that happy with it. I have another approach to solving this bug. I will layout the approach and then try to give the pros/cons of the proposed approach. Hopefully everyone will help out with the design so we can come up with something that will address the issues that we have seen with the previous approaches. I think we are getting closer to an acceptable solution. Maybe the fourth try will the the one. :-) New approach, call it Single system wide buffer. In arch/powerpc/oprofile/cell/spu_profiler.c, - Create a function to return the per cpu buffer size. - Create a single system wide buffer (call it SPU_BUFFER) to store the SPU context switch and SPU PC data. The buffer would be an array of unsigned longs. Currently the cpu buffer is a struct containing two unsigned longs. The function would be called in function start_spu_profiling() as part of starting an SPU profiling session. The size of the buffer would be based on the size of the per cpu buffers * a scale factor for handling all of the spu data. - Create a function in arch/powerpc/oprofile/cell/spu_profiler.c that would be periodically called via a delayed workqueue call to flush SPU_BUFFER to the kernel buffer. It would have to call the routine to get the kernel buffer mutex. Grab the lock protecting writes to the SPU_BUFFER, flush the data to the kernel buffer and then release the locks. - Create a function to store data into the SPU_BUFFER. It will be managed by a lock to ensure atomic access. If the buffer fills, data will be dropped on the floor as is currently done when the per cpu buffer fills. The cpu_buf->sample_lost_overflow stats would be incremented, as if we were using them, so the user would know that the buffer was not big enough. We would keep the fact that we really are not using per cpu buffers transparent to the user. The idea is the buffer size management would all look the same to the user. In the driver/oprofile/buffer_sync.c function -Create a function to obtain the kernel buffer mutex lock. - Create a second function to release the kernel buffer mutex lock. - Make these two functions visible to the architecture specific code by exporting them in include/linux/oprofile.h. Pros single system wide buffer 1) It would not require changing existing architecture independent code to add the special data to a CPU buffer, move the special data to the kernel buffer, remove the code that was added so that the SPU profiling code does not enable the creation of the per cpu buffers. Summary, minimal changing of architecture independent code 2) The SPU_BUFFER size would be managed using the same user controls and feedback as the per cpu buffers. The goal would be to make management look like we actually used cpu buffers. 3) I think the overall kernel data storage space would be less. Currently, I have to increase the size of the cpu buffers, done in user land, by 8 times to handle the SPU data. Since each entry in the SPU_BUFFER would be one unsigned long instead of two unsigned longs for cpu buffer we would save 2x here. We would not have to have and escape sequence entry before each data sample that would save an additional 2x in space. 4) Do not have to manage mapping of SPUs to CPU buffers. We don't care what the cpu or spu numbers are. 5) I think the memory storage could be adjusted better for SPU event profiling that I am working on. Cons 1) Currently OProfile does not support CPU hotplug. Not sure how easy it would be or what issues there may be in supporting CPU hot plug with this approach. Can't really say if it would be easier/harder then with per cpu buffers. 2) I think the cpu_buf stats are allocated when the module loads. I think it is just the per cpu buffer allocation that is done dynamically when the daemon starts/shutdown. I have not investigated this enough to say for sure. If this is not the case, then we may need to tweek the generic oprofile code a bit more. 3) There is probably some other gotcha out there I have not thought of yet. Any additional ideas, comments, concerns??? It would be nice if we could flush out as many issues as possible now. Thanks for your time and thought on this. Carl Love |
From: Arnd B. <ar...@ar...> - 2008-06-11 12:53:05
|
On Wednesday 11 June 2008, Carl Love wrote: > Arnd and company: > > There are issues with using the cpu buffers for storing the SPU data. I > agree with you on some of your comments about the high complexity of the > per cpu patch. It seemed like a good idea at the start but in the end I > am not all that happy with it. Right, I agree. I assumed it would be enough to always write to the local CPUs buffer, but as you explained, that caused other problems that you had to work around. Thanks for your detailed outline of your ideas, I think we're getting much closer to a solution now! > New approach, call it Single system wide buffer. I like the approach in general, but would prefer to have it more generic, so that other subsystems can use the same infrastructure if necessary: > In arch/powerpc/oprofile/cell/spu_profiler.c, > - Create a function to return the per cpu buffer size. > > - Create a single system wide buffer (call it SPU_BUFFER) to store the > SPU context switch and SPU PC data. The buffer would be an array of > unsigned longs. Currently the cpu buffer is a struct containing two > unsigned longs. The function would be called in function > start_spu_profiling() as part of starting an SPU profiling session. The > size of the buffer would be based on the size of the per cpu buffers * a > scale factor for handling all of the spu data. I think the dependency on the size of the per cpu buffers is a bit artificial. How about making this independently tunable with another oprofilefs file and a reasonable default? You could either create the extra files in oprofile_create_files or in oprofile_ops.create_files. > - Create a function in arch/powerpc/oprofile/cell/spu_profiler.c that > would be periodically called via a delayed workqueue call to flush > SPU_BUFFER to the kernel buffer. It would have to call the routine to > get the kernel buffer mutex. Grab the lock protecting writes to the > SPU_BUFFER, flush the data to the kernel buffer and then release the > locks. > > - Create a function to store data into the SPU_BUFFER. It will be > managed by a lock to ensure atomic access. If the buffer fills, data > will be dropped on the floor as is currently done when the per cpu > buffer fills. The cpu_buf->sample_lost_overflow stats would be > incremented, as if we were using them, so the user would know that the > buffer was not big enough. We would keep the fact that we really are > not using per cpu buffers transparent to the user. The idea is the > buffer size management would all look the same to the user. > > In the driver/oprofile/buffer_sync.c function > -Create a function to obtain the kernel buffer mutex lock. > - Create a second function to release the kernel buffer mutex lock. > - Make these two functions visible to the architecture specific code by > exporting them in include/linux/oprofile.h. As a general design principle, I'd always try to have locks close to the data. What this means is that I think the SPU_BUFFER should be in drivers/oprofile, similar to the cpu_buffer. In that case, it may be better to change the name to something more generic, e.g. aux_buffer. Similar, moving data into the aux_buffer can be done under a spinlock that is defined in the same file, with an interface like the oprofile_add_value() function from your V3 patch, but moving the add_value_lock in there. > Pros single system wide buffer > 1) It would not require changing existing architecture independent code > to add the special data to a CPU buffer, move the special data to the > kernel buffer, remove the code that was added so that the SPU profiling > code does not enable the creation of the per cpu buffers. Summary, > minimal changing of architecture independent code I don't think that's a requirement. I don't mind doing changes to spufs or oprofile common code at all, as long as the change is reasonable. This is especially true when there are potentially other users that can benefit from the change. > 2) The SPU_BUFFER size would be managed using the same user controls and > feedback as the per cpu buffers. The goal would be to make management > look like we actually used cpu buffers. As mentioned above, I'd see this as a disadvantage. > 3) I think the overall kernel data storage space would be less. > Currently, I have to increase the size of the cpu buffers, done in user > land, by 8 times to handle the SPU data. Since each entry in the > SPU_BUFFER would be one unsigned long instead of two unsigned longs for > cpu buffer we would save 2x here. We would not have to have and escape > sequence entry before each data sample that would save an additional 2x > in space. right > 4) Do not have to manage mapping of SPUs to CPU buffers. We don't care > what the cpu or spu numbers are. right > 5) I think the memory storage could be adjusted better for SPU event > profiling that I am working on. right. > Cons > 1) Currently OProfile does not support CPU hotplug. Not sure how easy > it would be or what issues there may be in supporting CPU hot plug with > this approach. Can't really say if it would be easier/harder then with > per cpu buffers. I think it would be harder, but if oprofile doesn't support it anyway, let's not worry about it. > 2) I think the cpu_buf stats are allocated when the module loads. I > think it is just the per cpu buffer allocation that is done dynamically > when the daemon starts/shutdown. I have not investigated this enough to > say for sure. If this is not the case, then we may need to tweek the > generic oprofile code a bit more. Both the event_buffer and the cpu_buffers are allocated in oprofile_setup when the even buffer file is opened. When you have another buffer besides the cpu_buffers, you can either change the oprofile_setup function to allocate it, or do it in your setup() callback. > 3) There is probably some other gotcha out there I have not thought of > yet. None that I can think of. Arnd <>< |
From: Maynard J. <may...@us...> - 2008-06-11 14:56:30
|
Arnd Bergmann wrote: > On Wednesday 11 June 2008, Carl Love wrote: > >> Arnd and company: >> >> There are issues with using the cpu buffers for storing the SPU data. I >> agree with you on some of your comments about the high complexity of the >> per cpu patch. It seemed like a good idea at the start but in the end I >> am not all that happy with it. >> > > Right, I agree. I assumed it would be enough to always write to the > local CPUs buffer, but as you explained, that caused other problems > that you had to work around. > > Thanks for your detailed outline of your ideas, I think we're getting > much closer to a solution now! > > >> New approach, call it Single system wide buffer. >> > > I like the approach in general, but would prefer to have it more generic, > so that other subsystems can use the same infrastructure if necessary: > I believe Carl's thoughts on implementation would result in model-specific buffer create/destroy routines, to be invoked by the generic routine. This would give the flexibility for architectures to implement their own unique buffering mechanism. I don't think it would be worthwhile to go beyond this level in trying to build some generic mechanism since the generic mechanism already exists and has been sufficient for all other architectures. > >> In arch/powerpc/oprofile/cell/spu_profiler.c, >> - Create a function to return the per cpu buffer size. >> >> - Create a single system wide buffer (call it SPU_BUFFER) to store the >> SPU context switch and SPU PC data. The buffer would be an array of >> unsigned longs. Currently the cpu buffer is a struct containing two >> unsigned longs. The function would be called in function >> start_spu_profiling() as part of starting an SPU profiling session. The >> size of the buffer would be based on the size of the per cpu buffers * a >> scale factor for handling all of the spu data. >> > > I think the dependency on the size of the per cpu buffers is a bit > artificial. How about making this independently tunable with > another oprofilefs file and a reasonable default? > You could either create the extra files in oprofile_create_files > or in oprofile_ops.create_files. > With the right scaling factor, we should be able to avoid the buffer overflows for the most part. Adding a new file in oprofilefs for an arch-specific buffer size would have to be documented on the user side, as well as providing a new option on the opcontrol command to adjust the value. Granted, the system-wide buffer that Carl is proposing for Cell clashes with the "cpu buffer" construct, but some other architecture in the future may want to make use of this new model-specific buffer management routines and perhaps their buffers *would* be per-cpu. I'm not sure there would be much value add in exposing this implementation detail to the user as long as we can effectively use the existing user interface mechanism for managing buffer size. > >> - Create a function in arch/powerpc/oprofile/cell/spu_profiler.c that >> would be periodically called via a delayed workqueue call to flush >> SPU_BUFFER to the kernel buffer. It would have to call the routine to >> get the kernel buffer mutex. Grab the lock protecting writes to the >> SPU_BUFFER, flush the data to the kernel buffer and then release the >> locks. >> >> - Create a function to store data into the SPU_BUFFER. It will be >> managed by a lock to ensure atomic access. If the buffer fills, data >> will be dropped on the floor as is currently done when the per cpu >> buffer fills. The cpu_buf->sample_lost_overflow stats would be >> incremented, as if we were using them, so the user would know that the >> buffer was not big enough. We would keep the fact that we really are >> not using per cpu buffers transparent to the user. The idea is the >> buffer size management would all look the same to the user. >> >> In the driver/oprofile/buffer_sync.c function >> -Create a function to obtain the kernel buffer mutex lock. >> - Create a second function to release the kernel buffer mutex lock. >> - Make these two functions visible to the architecture specific code by >> exporting them in include/linux/oprofile.h. >> > > As a general design principle, I'd always try to have locks close > to the data. What this means is that I think the SPU_BUFFER should > be in drivers/oprofile, similar to the cpu_buffer. In that case, > it may be better to change the name to something more generic, e.g. > aux_buffer. > I agree -- in principle -- however, I really don't think a system-wide buffer is something we want to put into generic code. It is not generally useful. Also, putting this new code in drivers/oprofile would not fit well with the general technique of adding model-specific function pointers to 'struct oprofile_operations'. > Similar, moving data into the aux_buffer can be done under a spinlock > that is defined in the same file, with an interface like the > oprofile_add_value() function from your V3 patch, but moving the > add_value_lock in there. > > >> Pros single system wide buffer >> 1) It would not require changing existing architecture independent code >> to add the special data to a CPU buffer, move the special data to the >> kernel buffer, remove the code that was added so that the SPU profiling >> code does not enable the creation of the per cpu buffers. Summary, >> minimal changing of architecture independent code >> > > I don't think that's a requirement. I don't mind doing changes to > spufs or oprofile common code at all, as long as the change is > reasonable. This is especially true when there are potentially > other users that can benefit from the change. > This assumes we can get the oprofile maintainer to review and comment on the new patch, which has been difficult to do lately. I agree that we should not avoid making changes to common code when there's a good justification for it, but I would say that this is not such a case. -Maynard > >> 2) The SPU_BUFFER size would be managed using the same user controls and >> feedback as the per cpu buffers. The goal would be to make management >> look like we actually used cpu buffers. >> > > As mentioned above, I'd see this as a disadvantage. > > >> 3) I think the overall kernel data storage space would be less. >> Currently, I have to increase the size of the cpu buffers, done in user >> land, by 8 times to handle the SPU data. Since each entry in the >> SPU_BUFFER would be one unsigned long instead of two unsigned longs for >> cpu buffer we would save 2x here. We would not have to have and escape >> sequence entry before each data sample that would save an additional 2x >> in space. >> > > right > > >> 4) Do not have to manage mapping of SPUs to CPU buffers. We don't care >> what the cpu or spu numbers are. >> > > right > > >> 5) I think the memory storage could be adjusted better for SPU event >> profiling that I am working on. >> > > right. > > >> Cons >> 1) Currently OProfile does not support CPU hotplug. Not sure how easy >> it would be or what issues there may be in supporting CPU hot plug with >> this approach. Can't really say if it would be easier/harder then with >> per cpu buffers. >> > > I think it would be harder, but if oprofile doesn't support it anyway, > let's not worry about it. > > >> 2) I think the cpu_buf stats are allocated when the module loads. I >> think it is just the per cpu buffer allocation that is done dynamically >> when the daemon starts/shutdown. I have not investigated this enough to >> say for sure. If this is not the case, then we may need to tweek the >> generic oprofile code a bit more. >> > > Both the event_buffer and the cpu_buffers are allocated in oprofile_setup > when the even buffer file is opened. When you have another buffer besides > the cpu_buffers, you can either change the oprofile_setup function to > allocate it, or do it in your setup() callback. > > >> 3) There is probably some other gotcha out there I have not thought of >> yet. >> > > None that I can think of. > > Arnd <>< > > ------------------------------------------------------------------------- > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services for > just about anything Open Source. > http://sourceforge.net/services/buy/index.php > _______________________________________________ > oprofile-list mailing list > opr...@li... > https://lists.sourceforge.net/lists/listinfo/oprofile-list > |
From: Arnd B. <ar...@ar...> - 2008-06-12 12:35:52
|
On Wednesday 11 June 2008, Maynard Johnson wrote: > > > > I like the approach in general, but would prefer to have it more generic, > > so that other subsystems can use the same infrastructure if necessary: > > > I believe Carl's thoughts on implementation would result in > model-specific buffer create/destroy routines, to be invoked by the > generic routine. This would give the flexibility for architectures to > implement their own unique buffering mechanism. I don't think it would > be worthwhile to go beyond this level in trying to build some generic > mechanism since the generic mechanism already exists and has been > sufficient for all other architectures. Well, the amount of code should be the same either way. We currently have an interface between the cpu_buffer and its users, which the interface between the event_buffer and the cpu_buffer is private to the oprofile layer. The logical extension of this would be to have another buffer in common oprofile code, and have its users in the architecture code, rather than interface the architecture with the private event_buffer. > > I think the dependency on the size of the per cpu buffers is a bit > > artificial. How about making this independently tunable with > > another oprofilefs file and a reasonable default? > > You could either create the extra files in oprofile_create_files > > or in oprofile_ops.create_files. > > > With the right scaling factor, we should be able to avoid the buffer > overflows for the most part. Adding a new file in oprofilefs for an > arch-specific buffer size would have to be documented on the user side, > as well as providing a new option on the opcontrol command to adjust the > value. Granted, the system-wide buffer that Carl is proposing for Cell > clashes with the "cpu buffer" construct, but some other architecture in > the future may want to make use of this new model-specific buffer > management routines and perhaps their buffers *would* be per-cpu. I'm > not sure there would be much value add in exposing this implementation > detail to the user as long as we can effectively use the existing user > interface mechanism for managing buffer size. But we're already exposing the implementation detail for the cpu buffers, which are similar enough, just not the same. Since the file and opcontrol option are called cpu-buffer-size, its rather unobvious how this should control the size of a global buffer. As I mentioned, with a reasonable default, you should not need to tune this anyway. Changing opcontrol is a real disadvantage, but if a user needs to tune this without a new opcontrol tool, it's always possible to directly write to the file system. > > As a general design principle, I'd always try to have locks close > > to the data. What this means is that I think the SPU_BUFFER should > > be in drivers/oprofile, similar to the cpu_buffer. In that case, > > it may be better to change the name to something more generic, e.g. > > aux_buffer. > > > I agree -- in principle -- however, I really don't think a system-wide > buffer is something we want to put into generic code. It is not > generally useful. Also, putting this new code in drivers/oprofile would > not fit well with the general technique of adding model-specific > function pointers to 'struct oprofile_operations'. > This assumes we can get the oprofile maintainer to review and comment on > the new patch, which has been difficult to do lately. I agree that we > should not avoid making changes to common code when there's a good > justification for it, but I would say that this is not such a case. You have to change the oprofile code either way, because the existing interface is not sufficient. However, there is no point in trying to do a minimal change to the existing code when a larger change clearly gives us a better resulting code. Arnd <>< |