Thread: [pygccxml-development] Comments on module_builder cache functionality
Brought to you by:
mbaas,
roman_yakovenko
From: Roman Y. <rom...@gm...> - 2006-08-28 20:03:04
|
Hi. I review the functionality you added to module builder, also I like the idea, I don't agree with implementation. 1. You can not use "print" statements within the code. Please change them to self.logger.debug, except loading and save messages. In my opinion they should be info. 2. pygccxml class defines cache classes, why you can not reuse one of them? 3. Why do you think it could be useful to have cache and module_cache arguments? In my opinion, if you want to provide simple interface leave only one of them - cache and make it work with pygccxml. -- Roman Yakovenko C++ Python language binding http://www.language-binding.net/ |
From: Allen B. <al...@vr...> - 2006-08-28 20:23:17
|
Roman Yakovenko wrote: >Hi. I review the functionality you added to module builder, also I >like the idea, >I don't agree with implementation. > >1. You can not use "print" statements within the code. Please change them to > self.logger.debug, except loading and save messages. In my opinion > they should be info. > > Agreed. those should not be there. I will fix it. >2. pygccxml class defines cache classes, why you can not reuse one of them? > > I use the signature methods from there, but the cache classes from there are based on the concept of mapping from a key to a cached value. In this case we just want to load/dump the state of a single object that we know ahead of time and we only want to load it if we know the key matches. This is a different set of requirements so I just did it locally. >3. Why do you think it could be useful to have cache and module_cache > arguments? In my opinion, if you want to provide simple interface >leave only one > of them - cache and make it work with pygccxml. > > > I did it this way so I didn't break the old API in any way. I have no problem with changing it though if you support the change. -Allen |
From: Allen B. <al...@vr...> - 2006-08-29 13:37:50
|
Roman Yakovenko wrote: > On 8/29/06, *Allen Bierbaum* <al...@vr... > <mailto:al...@vr...>> wrote: > > As requested, I have backed out these changes. The implementation is > now available for people to use in goodies.goodies_perf_overrides.py. > Just import that file and you will get the full module caching as well > as the create_identifier override. > > > I don't like the way you do it. There was good reason to remove it - > critical bug. > I did not see the code you added to goodies, but I assume it contains > same bug. > > Bug description: > > Lets say you have 2 header files: > implementation_details.h > ... > > and > > to_be_exported.h: > > #include "implementation_details.h" > .... > > Py++ code: > > mb = module_builder_t( "to_be_exported.h" ) > > The problem with Allen implementation is that you can change > "implementation_details.h" > file, but "cache" will remain valid. In this case module_builder_t > should rebuild the cache, otherwise > Py++ will generate wrong code. This is a very critical bug. pygccxml > cache classes know > to deal with this situation. Actually, if you change any text in implementation_details.h it will rebuild the cache because the md5 hash will catch it. This is the same with the entire list of header files. I have been using it and I can say without a doubt this is how it works and it works exactly the same as the pygccxml decl cache. As I said on IM though there are changes that both of them miss. The thing that both cache implementations miss is that if a file included by to_be_exported.h changes, then the caches will be invalid but they will not know it. The module cache misses this and the existing pygccxml cache misses it. (if you doubt this, take a look at the code. Although pygccxml passing a set of included files into the cache update method those files are not used for the key signature. Thus they are not used to validate the cache entry on load). I have seen this problem before in the pygccxml cache and there is really no good way around it. The only way that pygccxml has to get a full list of dependent header files (ie. build a recursive list of includes) is to either a) run pygccxml on the file or b) add a scanner that does this internally. Option a is out because it would make the cache worthless since we would be doing the exact work we want to cache. Option b doesn't exist in the code base and in my case I would want the use to be optional because it would take significant time to run. I haven't tested for sure but I would guess that in my case it would have to find and scan well over 10,000 files if not more. What both of these bugs mean is that it is possible to make a change in your code and have it not get picked up. That is definitely true and I don't see a great way around it. One hybrid solution that may work is that we could add an optional parameter to the module builder that is a list of "dependent" headers that the user cares about checking. This list could be scanned and made part of the signature. Still not perfect, but it would allow the user to tailor their caching needs. > > There was another bug: module_builder_t.__init__ can take not only > file paths but also > text, that contains valid C++ code. The Allen's code does not deal > with them at all. I will have to look at this part. Does this code come in through the files parameter? As far as I can tell the handling of this parameter is still the same as it was before. If there is a type of "file" that I am missing then I may have to update the signature calcuation but that should not be too hard. > > I proposed to Allen to create and implement module_builder_t cache class > in terms of pygccxml classes. I don't have time to rewrite the existing cache code to make it so it could be reusable for the case I need. I already am using all the parts that are reusable (the file and config signature methods). But the cache can not be used or extended as is for several reasons: 1) The current signature keys are based on 1 file. I have multiple files that make up the key. 2) The current pygccxml cache is based on the idea of loading a map from key to data. These cache entries are marked as dirty or not before being flushed. I don't need anything like this. I just need a cache that has a signature to look at as the first entry in the pickled data. And then the next entry needs to be the tree. The pygccxml cache can't really work this way because it has multiple entries. So the two needs are enough different that I don't see the commonality for reuse beyond what I have done. -Allen > > I am pretty busy this days, so I don't have time to fix the bug. Thus > the code has been removed. > > > -- > Roman Yakovenko > C++ Python language binding > http://www.language-binding.net/ |
From: Allen B. <al...@vr...> - 2006-08-28 22:10:07
|
As requested, I have backed out these changes. The implementation is now available for people to use in goodies.goodies_perf_overrides.py. Just import that file and you will get the full module caching as well as the create_identifier override. -Allen Allen Bierbaum wrote: >Roman Yakovenko wrote: > > > >>Hi. I review the functionality you added to module builder, also I >>like the idea, >>I don't agree with implementation. >> >>1. You can not use "print" statements within the code. Please change them to >> self.logger.debug, except loading and save messages. In my opinion >> they should be info. >> >> >> >> >Agreed. those should not be there. I will fix it. > > > >>2. pygccxml class defines cache classes, why you can not reuse one of them? >> >> >> >> >I use the signature methods from there, but the cache classes from there >are based on the concept of mapping from a key to a cached value. In >this case we just want to load/dump the state of a single object that we >know ahead of time and we only want to load it if we know the key >matches. This is a different set of requirements so I just did it locally. > > > >>3. Why do you think it could be useful to have cache and module_cache >> arguments? In my opinion, if you want to provide simple interface >>leave only one >> of them - cache and make it work with pygccxml. >> >> >> >> >> >I did it this way so I didn't break the old API in any way. I have no >problem with changing it though if you support the change. > >-Allen > > > >------------------------------------------------------------------------- >Using Tomcat but need to do more? Need to support web services, security? >Get stuff done quickly with pre-integrated technology to make your job easier >Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo >http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 >_______________________________________________ >pygccxml-development mailing list >pyg...@li... >https://lists.sourceforge.net/lists/listinfo/pygccxml-development > > > |
From: Roman Y. <rom...@gm...> - 2006-08-29 05:14:35
|
On 8/29/06, Allen Bierbaum <al...@vr...> wrote: > > As requested, I have backed out these changes. The implementation is > now available for people to use in goodies.goodies_perf_overrides.py. > Just import that file and you will get the full module caching as well > as the create_identifier override. > I don't like the way you do it. There was good reason to remove it - critical bug. I did not see the code you added to goodies, but I assume it contains same bug. Bug description: Lets say you have 2 header files: implementation_details.h ... and to_be_exported.h: #include "implementation_details.h" .... Py++ code: mb = module_builder_t( "to_be_exported.h" ) The problem with Allen implementation is that you can change "implementation_details.h" file, but "cache" will remain valid. In this case module_builder_t should rebuild the cache, otherwise Py++ will generate wrong code. This is a very critical bug. pygccxml cache classes know to deal with this situation. There was another bug: module_builder_t.__init__ can take not only file paths but also text, that contains valid C++ code. The Allen's code does not deal with them at all. I proposed to Allen to create and implement module_builder_t cache class in terms of pygccxml classes. I am pretty busy this days, so I don't have time to fix the bug. Thus the code has been removed. -- Roman Yakovenko C++ Python language binding http://www.language-binding.net/ |
From: Allen B. <al...@vr...> - 2006-08-29 14:09:16
|
Allen Bierbaum wrote: >Roman Yakovenko wrote: > > > >>On 8/29/06, *Allen Bierbaum* <al...@vr... >><mailto:al...@vr...>> wrote: >> >> As requested, I have backed out these changes. The implementation is >> now available for people to use in goodies.goodies_perf_overrides.py. >> Just import that file and you will get the full module caching as well >> as the create_identifier override. >> >> >>I don't like the way you do it. There was good reason to remove it - >>critical bug. >>I did not see the code you added to goodies, but I assume it contains >>same bug. >> >>Bug description: >> >>Lets say you have 2 header files: >>implementation_details.h >> ... >> >>and >> >>to_be_exported.h: >> >> #include "implementation_details.h" >> .... >> >>Py++ code: >> >>mb = module_builder_t( "to_be_exported.h" ) >> >>The problem with Allen implementation is that you can change >>"implementation_details.h" >>file, but "cache" will remain valid. In this case module_builder_t >>should rebuild the cache, otherwise >>Py++ will generate wrong code. This is a very critical bug. pygccxml >>cache classes know >>to deal with this situation. >> >> > >Actually, if you change any text in implementation_details.h it will >rebuild the cache because the md5 hash will catch it. This is the same >with the entire list of header files. I have been using it and I can >say without a doubt this is how it works and it works exactly the same >as the pygccxml decl cache. > >As I said on IM though there are changes that both of them miss. > >The thing that both cache implementations miss is that if a file >included by to_be_exported.h changes, then the caches will be invalid >but they will not know it. The module cache misses this and the >existing pygccxml cache misses it. (if you doubt this, take a look at >the code. Although pygccxml passing a set of included files into the >cache update method those files are not used for the key signature. >Thus they are not used to validate the cache entry on load). > >I have seen this problem before in the pygccxml cache and there is >really no good way around it. The only way that pygccxml has to get a >full list of dependent header files (ie. build a recursive list of >includes) is to either a) run pygccxml on the file or b) add a scanner >that does this internally. Option a is out because it would make the >cache worthless since we would be doing the exact work we want to >cache. Option b doesn't exist in the code base and in my case I would >want the use to be optional because it would take significant time to >run. I haven't tested for sure but I would guess that in my case it >would have to find and scan well over 10,000 files if not more. > >What both of these bugs mean is that it is possible to make a change in >your code and have it not get picked up. That is definitely true and I >don't see a great way around it. > >One hybrid solution that may work is that we could add an optional >parameter to the module builder that is a list of "dependent" headers >that the user cares about checking. This list could be scanned and made >part of the signature. Still not perfect, but it would allow the user >to tailor their caching needs. > > > I just committed a change to the module builder override in goodies that implements this feature. I have started using it for my projects and it seems to work well so far. -Allen |
From: Roman Y. <rom...@gm...> - 2006-08-29 14:34:48
|
On 8/29/06, Allen Bierbaum <al...@vr...> wrote: > >Actually, if you change any text in implementation_details.h it will > >rebuild the cache because the md5 hash will catch it. Your code does not contains the check. > > > >As I said on IM though there are changes that both of them miss. > > > >The thing that both cache implementations miss is that if a file > >included by to_be_exported.h changes, then the caches will be invalid > >but they will not know it. This is an open source project. I claim that pygccxml.parser.file_cache_t class will not miss. > >I have seen this problem before in the pygccxml cache and there is > >really no good way around it. The only way that pygccxml has to get a > >full list of dependent header files (ie. build a recursive list of > >includes) GCCXML does it. > >What both of these bugs mean is that it is possible to make a change in > >your code and have it not get picked up. pygccxml does not contain this bug. > >One hybrid solution that may work is that we could add an optional > >parameter to the module builder that is a list of "dependent" headers > >that the user cares about checking. In pygccxml this workds out of the box. -- Roman Yakovenko C++ Python language binding http://www.language-binding.net/ |
From: Allen B. <al...@vr...> - 2006-08-29 16:30:43
|
Roman Yakovenko wrote: > On 8/29/06, Allen Bierbaum <al...@vr...> wrote: > >> >Actually, if you change any text in implementation_details.h it will >> >rebuild the cache because the md5 hash will catch it. > > > Your code does not contains the check. It doesn't contain a check for implicit includes, no. I have temporarily added the ability for the user to pass in a list of files they know should be used for the signature. This is not a great solution, but it will help out until a better solution is found (see below). >> > >> >As I said on IM though there are changes that both of them miss. >> > >> >The thing that both cache implementations miss is that if a file >> >included by to_be_exported.h changes, then the caches will be invalid >> >but they will not know it. > > > This is an open source project. I claim that pygccxml.parser.file_cache_t > class will not miss. > >> >I have seen this problem before in the pygccxml cache and there is >> >really no good way around it. The only way that pygccxml has to get a >> >full list of dependent header files (ie. build a recursive list of >> >includes) > > > GCCXML does it. > >> >What both of these bugs mean is that it is possible to make a change in >> >your code and have it not get picked up. > > > pygccxml does not contain this bug. Okay. Roman is right. Sorry for doubting you. I just wish I could have found the code quicker. I tracked down where this test is coming from. For those of you that are interested, this check is on line 218 of declarations_cache.py. The test doesn't test the signature by computing a signature against the files that should be included during the current run, instead it does something that works equally as well. It checks to see if any of the files included when the cache was built have changed. If they have then it says the signature is invalid. >> >One hybrid solution that may work is that we could add an optional >> >parameter to the module builder that is a list of "dependent" headers >> >that the user cares about checking. > > > In pygccxml this workds out of the box. > So now I have to figure out how to get this to work with module builder cache and there is a problem. Namely, in module builder the call to parse_declarations does not have access to the list of files that gccxml had to include. The only place where these files seem to be available is in source_reader.py on line 237. It does not appear that this information remains in the pygccxml tree but I could be wrong. Also, the files list is only used to update the decl cache so it is not available later during execution. Does anybody know a way to either a) get the computed file list from source_reader back into the module builder or b) get a list of files that were included from a pygccxml decl tree? Note: Modifying declarations_cache.file_cache_t would not eliminate this need. No matter how the cache is held I need a way to get this list of files in module builder so I can use it to update the cache. -Allen |
From: Roman Y. <rom...@gm...> - 2006-08-29 17:03:01
|
On 8/29/06, Allen Bierbaum <al...@vr...> wrote: > So now I have to figure out how to get this to work with module builder > cache and there is a problem. Namely, in module builder the call to > parse_declarations does not have access to the list of files that gccxml > had to include. The only place where these files seem to be available > is in source_reader.py on line 237. It does not appear that this > information remains in the pygccxml tree but I could be wrong. Also, > the files list is only used to update the decl cache so it is not > available later during execution. > > Does anybody know a way to either a) get the computed file list from > source_reader back into the module builder You don't need this > or b) get a list of files > that were included from a pygccxml decl tree? pygccxml.declarations.declarations_files But you also don't need this functionality My idea was next: class mbcache_t( pygccxml.parser.base_cache_t ): def __init__( self, *args ): pygccxml.parser.base_cache_t.__init__( self ) test whether there is a need to load from cache on disk self.decls_cache = pygccxml.parser.file_cache_t( file_name ) def update( self, .... ): self.decls_cache.update( self, .... ) def flush( self, ... ) self.decls_cache.flush() write to disk files passed to module_builder_t Thus, mbcache_t only needs to warry about files list that was passed to module_builder_t I think this is pretty simple idea, that will take few hours to implement. What do you think? -- Roman Yakovenko C++ Python language binding http://www.language-binding.net/ |
From: Allen B. <al...@vr...> - 2006-08-29 19:43:42
|
Roman Yakovenko wrote: > On 8/29/06, Allen Bierbaum <al...@vr...> wrote: > >> So now I have to figure out how to get this to work with module builder >> cache and there is a problem. Namely, in module builder the call to >> parse_declarations does not have access to the list of files that gccxml >> had to include. The only place where these files seem to be available >> is in source_reader.py on line 237. It does not appear that this >> information remains in the pygccxml tree but I could be wrong. Also, >> the files list is only used to update the decl cache so it is not >> available later during execution. >> >> Does anybody know a way to either a) get the computed file list from >> source_reader back into the module builder > > > You don't need this > >> or b) get a list of files >> that were included from a pygccxml decl tree? > > > pygccxml.declarations.declarations_files > But you also don't need this functionality How would I call update then in the interface you propose below? I need the list of include files if that part of the key check is going to work. > > My idea was next: > > class mbcache_t( pygccxml.parser.base_cache_t ): > def __init__( self, *args ): > pygccxml.parser.base_cache_t.__init__( self ) > test whether there is a need to load from cache on disk > self.decls_cache = pygccxml.parser.file_cache_t( file_name ) > > def update( self, .... ): > self.decls_cache.update( self, .... ) > > def flush( self, ... ) > self.decls_cache.flush() > write to disk files passed to module_builder_t > > Thus, mbcache_t only needs to warry about files list that was passed to > module_builder_t > > I think this is pretty simple idea, that will take few hours to > implement. Maybe I am missing something, but what I don't see is how the interface provides the capabilities I need. It would not be difficult for me to write a little helper class that wrapped the cache file and loading if that is the issue, but I don't see a good way to reuse the file_cache_t code or really a good reason to do so. What I need is different and smaller. I need something like: class mbcache_t(object): def __init__(self, cache_file) self.cache_file = file(cache_file) def getCachedTree(self, sourceFiles, configuration): # Read and test prefix signature against sourceFiles and configuration # If no match, early exit # Read decl tree # Read signature and list of files used to create cache # If one of those files has changed, then return None # Else return tree def updateCachedTree(self, sourceFiles, configuration, includedFiles, declTree): # Open cache file for writing # Compute prefix sig and dump it # Dump the decl tree # Compute and dump signature for and includeFile list # Save and close the file I can easily create a class like this and use it if you think that will make the code more modular. But I hope you can see this is quite a bit different from the cache map that we use in pygccxml. There is no map, there is no need for dirty/used flag, it works for multiple files, it is smart enough to only load if needed. I would reuse the signature code, but the rest would be unique. But, back to the question at hand. How would I get the "includedFiles" for the updateCacheTree call? Note that this call would need to happen builder.__init__ where all I have access to is the self.__global_ns decl tree that I am caching. The gccxml scanner that is used to pass the parameters to the file_cache_t.update() method is not available at this level in the code. Any ideas? -Allen > > What do you think? > |