Re: [pygccxml-development] Comments on module_builder cache functionality
Brought to you by:
mbaas,
roman_yakovenko
From: Allen B. <al...@vr...> - 2006-08-29 13:37:50
|
Roman Yakovenko wrote: > On 8/29/06, *Allen Bierbaum* <al...@vr... > <mailto:al...@vr...>> wrote: > > As requested, I have backed out these changes. The implementation is > now available for people to use in goodies.goodies_perf_overrides.py. > Just import that file and you will get the full module caching as well > as the create_identifier override. > > > I don't like the way you do it. There was good reason to remove it - > critical bug. > I did not see the code you added to goodies, but I assume it contains > same bug. > > Bug description: > > Lets say you have 2 header files: > implementation_details.h > ... > > and > > to_be_exported.h: > > #include "implementation_details.h" > .... > > Py++ code: > > mb = module_builder_t( "to_be_exported.h" ) > > The problem with Allen implementation is that you can change > "implementation_details.h" > file, but "cache" will remain valid. In this case module_builder_t > should rebuild the cache, otherwise > Py++ will generate wrong code. This is a very critical bug. pygccxml > cache classes know > to deal with this situation. Actually, if you change any text in implementation_details.h it will rebuild the cache because the md5 hash will catch it. This is the same with the entire list of header files. I have been using it and I can say without a doubt this is how it works and it works exactly the same as the pygccxml decl cache. As I said on IM though there are changes that both of them miss. The thing that both cache implementations miss is that if a file included by to_be_exported.h changes, then the caches will be invalid but they will not know it. The module cache misses this and the existing pygccxml cache misses it. (if you doubt this, take a look at the code. Although pygccxml passing a set of included files into the cache update method those files are not used for the key signature. Thus they are not used to validate the cache entry on load). I have seen this problem before in the pygccxml cache and there is really no good way around it. The only way that pygccxml has to get a full list of dependent header files (ie. build a recursive list of includes) is to either a) run pygccxml on the file or b) add a scanner that does this internally. Option a is out because it would make the cache worthless since we would be doing the exact work we want to cache. Option b doesn't exist in the code base and in my case I would want the use to be optional because it would take significant time to run. I haven't tested for sure but I would guess that in my case it would have to find and scan well over 10,000 files if not more. What both of these bugs mean is that it is possible to make a change in your code and have it not get picked up. That is definitely true and I don't see a great way around it. One hybrid solution that may work is that we could add an optional parameter to the module builder that is a list of "dependent" headers that the user cares about checking. This list could be scanned and made part of the signature. Still not perfect, but it would allow the user to tailor their caching needs. > > There was another bug: module_builder_t.__init__ can take not only > file paths but also > text, that contains valid C++ code. The Allen's code does not deal > with them at all. I will have to look at this part. Does this code come in through the files parameter? As far as I can tell the handling of this parameter is still the same as it was before. If there is a type of "file" that I am missing then I may have to update the signature calcuation but that should not be too hard. > > I proposed to Allen to create and implement module_builder_t cache class > in terms of pygccxml classes. I don't have time to rewrite the existing cache code to make it so it could be reusable for the case I need. I already am using all the parts that are reusable (the file and config signature methods). But the cache can not be used or extended as is for several reasons: 1) The current signature keys are based on 1 file. I have multiple files that make up the key. 2) The current pygccxml cache is based on the idea of loading a map from key to data. These cache entries are marked as dirty or not before being flushed. I don't need anything like this. I just need a cache that has a signature to look at as the first entry in the pickled data. And then the next entry needs to be the tree. The pygccxml cache can't really work this way because it has multiple entries. So the two needs are enough different that I don't see the commonality for reuse beyond what I have done. -Allen > > I am pretty busy this days, so I don't have time to fix the bug. Thus > the code has been removed. > > > -- > Roman Yakovenko > C++ Python language binding > http://www.language-binding.net/ |