Thread: [CEDET-devel] Why project-specific caches?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Why does every project type I've seen maintain its own cache of
projects, usually managed in the implementation of the load-type slot,
instead of just relying on the global one managed by auto.el?

On 03/17/2014 09:10 PM, Daniel Colascione wrote:
> Why does every project type I've seen maintain its own cache of
> projects, usually managed in the implementation of the load-type slot,
> instead of just relying on the global one managed by auto.el?

For some projects, it is necessary, such as ede-project-root.  For 
others it is primarily for performance.

If a project was already detected, you can save a bunch of time by 
testing against existing projects.

Some projects can only be detected from the root of the project.  For 
such a project EDE will not see your project unless it checks the roots 
of previously found projects of the same type.

If you have a long list of different kinds of projects, there is no 
sense testing projects not of the same time you are in.

Some of it is historical too.  The independence between the projects has 
some to do with it.  In retrospect I have also thought it would be 
better to search to core list only once from the core instead of asking 
each project to do it one at a time.

Eric

On 03/21/2014 07:41 PM, Eric M. Ludlam wrote:
> On 03/17/2014 09:10 PM, Daniel Colascione wrote:
>> Why does every project type I've seen maintain its own cache of
>> projects, usually managed in the implementation of the load-type slot,
>> instead of just relying on the global one managed by auto.el?
> 
> For some projects, it is necessary, such as ede-project-root.  For
> others it is primarily for performance.
> 
> If a project was already detected, you can save a bunch of time by
> testing against existing projects.
> 
> Some projects can only be detected from the root of the project.  For
> such a project EDE will not see your project unless it checks the roots
> of previously found projects of the same type.
> 
> If you have a long list of different kinds of projects, there is no
> sense testing projects not of the same time you are in.
> 
> Some of it is historical too.  The independence between the projects has
> some to do with it.  In retrospect I have also thought it would be
> better to search to core list only once from the core instead of asking
> each project to do it one at a time.

Isn't that what the code in auto.el does now? In ede-load-project-file,
we see whether we have a projects in ede-projects corresponding to a
directory; if we don't, we call into ede-auto-load-project, which builds
a project and it to ede-projects via ede-add-project-to-global-list.

So why does each project type have to redundantly maintain its own list
of projects? We already have a global list.

All this complexity is very confusing when trying to create a new
project type. The latest problem I'm having is that there are weird
state dependencies, and sometimes detection fails with
ede-object-project being nil and somethings with both ede-object-project
and ede-object-root-project being nil. (My project type has no
subprojects, and ede-project-root-directory works fine.) I wish there
were a much simpler way to just wire up a simple project (for a type for
which we don't have some kind of existing XXX-root thing pre-built).

On 03/24/2014 04:53 PM, Daniel Colascione wrote:
> On 03/21/2014 07:41 PM, Eric M. Ludlam wrote:
>> On 03/17/2014 09:10 PM, Daniel Colascione wrote:
>>> Why does every project type I've seen maintain its own cache of
>>> projects, usually managed in the implementation of the load-type slot,
>>> instead of just relying on the global one managed by auto.el?
>>
>> For some projects, it is necessary, such as ede-project-root.  For
>> others it is primarily for performance.
>>
>> If a project was already detected, you can save a bunch of time by
>> testing against existing projects.
>>
>> Some projects can only be detected from the root of the project.  For
>> such a project EDE will not see your project unless it checks the roots
>> of previously found projects of the same type.
>>
>> If you have a long list of different kinds of projects, there is no
>> sense testing projects not of the same time you are in.
>>
>> Some of it is historical too.  The independence between the projects has
>> some to do with it.  In retrospect I have also thought it would be
>> better to search to core list only once from the core instead of asking
>> each project to do it one at a time.
>
> Isn't that what the code in auto.el does now? In ede-load-project-file,
> we see whether we have a projects in ede-projects corresponding to a
> directory; if we don't, we call into ede-auto-load-project, which builds
> a project and it to ede-projects via ede-add-project-to-global-list.
>
> So why does each project type have to redundantly maintain its own list
> of projects? We already have a global list.

Hi Daniel,

I agree that project loading is a bit confusing.  A vast majority of the 
complexity is due to performance optimization.  As EDE started providing 
facilities to other operations, such as Semantic for finding header 
files, the poor performance forced performance optimizations.

I also spent time trying to get EDE to function where I work, and we 
have network files systems there, and the old behavior actually started 
to cause our filers to crash due to too many Emacs users  querying for 
project files that didn't exist at the filer root where the automounter 
kicks in.

What this means is that EDE now ONLY checks the current directory for 
the different projects.  It doesn't scan upward for a project root if no 
project is found.

Some projects, such as the one that leaves Project.ede files around will 
allow the auto-loader to identify a project, and only THEN will it scan 
upward for the root.  This too was pretty slow, and the local variable 
for the root project was added to speed that up.

Anyway, this means that any project that ONLY has an identifying project 
file at the root needs to handle the case where the user opens a file in 
a subdirectory.  It used to be this was the minority case, so it was 
handled only in the project definitions.  I think the ratios have since 
changed as new project styles I've used have in a majority of cases only 
had a unique identifier at the root.

> All this complexity is very confusing when trying to create a new
> project type.

I agree.  I think it would be worthwhile to take this common case and 
pull some of the logic up into the core of EDE.  That is bound to 
simplify creating new projects.

Fortunately, after initial project identification is done, everything is 
cached internal to EDE and your code won't be called anymore except in 
new directories.

The latest problem I'm having is that there are weird
> state dependencies, and sometimes detection fails with
> ede-object-project being nil and somethings with both ede-object-project
> and ede-object-root-project being nil. (My project type has no
> subprojects, and ede-project-root-directory works fine.) I wish there
> were a much simpler way to just wire up a simple project (for a type for
> which we don't have some kind of existing XXX-root thing pre-built).

If this happens while you are in the middle of testing changes in your 
ede project, you may be encountering cached results from a previous test 
run.   You can use ede-flush-directory-hash to clear out any pesky caches.

You can also ede-flush-project-hash to clear out data from any calls 
that use ede-locate.  That seems like an unlikely cause here though.

Eric

On 03/24/2014 05:47 PM, Eric M. Ludlam wrote:
> On 03/24/2014 04:53 PM, Daniel Colascione wrote:
>> On 03/21/2014 07:41 PM, Eric M. Ludlam wrote:
>>> On 03/17/2014 09:10 PM, Daniel Colascione wrote:
>>>> Why does every project type I've seen maintain its own cache of
>>>> projects, usually managed in the implementation of the load-type slot,
>>>> instead of just relying on the global one managed by auto.el?
>>>
>>> For some projects, it is necessary, such as ede-project-root.  For
>>> others it is primarily for performance.
>>>
>>> If a project was already detected, you can save a bunch of time by
>>> testing against existing projects.
>>>
>>> Some projects can only be detected from the root of the project.  For
>>> such a project EDE will not see your project unless it checks the roots
>>> of previously found projects of the same type.
>>>
>>> If you have a long list of different kinds of projects, there is no
>>> sense testing projects not of the same time you are in.
>>>
>>> Some of it is historical too.  The independence between the projects has
>>> some to do with it.  In retrospect I have also thought it would be
>>> better to search to core list only once from the core instead of asking
>>> each project to do it one at a time.
>>
>> Isn't that what the code in auto.el does now? In ede-load-project-file,
>> we see whether we have a projects in ede-projects corresponding to a
>> directory; if we don't, we call into ede-auto-load-project, which builds
>> a project and it to ede-projects via ede-add-project-to-global-list.
>>
>> So why does each project type have to redundantly maintain its own list
>> of projects? We already have a global list.
> 
> Hi Daniel,
> 
> I agree that project loading is a bit confusing.  A vast majority of the
> complexity is due to performance optimization.  As EDE started providing
> facilities to other operations, such as Semantic for finding header
> files, the poor performance forced performance optimizations.

I think it'd help to build abstractions for the complexity. Right now,
the complexity is scattered throughout the code, which hurts understanding.

> Some projects, such as the one that leaves Project.ede files around will
> allow the auto-loader to identify a project, and only THEN will it scan
> upward for the root.  This too was pretty slow, and the local variable
> for the root project was added to speed that up.
> 
> Anyway, this means that any project that ONLY has an identifying project
> file at the root needs to handle the case where the user opens a file in
> a subdirectory.  It used to be this was the minority case, so it was
> handled only in the project definitions.  I think the ratios have since
> changed as new project styles I've used have in a majority of cases only
> had a unique identifier at the root.

Yes. Lots of other tools, like git, scan upwards as well. The "normal",
default case should just be scanning upward for a project root every
time, for simplicity's sake. It's going to be fast enough on most
systems, and the statelessness of the system will go a long way toward
simplifying understanding of the code and building new projects.

If you need a stateful cache, please build it as an optional add-on.

Still, what I'm asking about specifically are caches specific to project
types, like ede-cpp-root-project-list. I don't understand why *this
specific* variable needs to exist at all, and why cpp-root.el has to
have its own cache. Anything the cpp-root specific cache can do, an
overload of ede-dir-to-projectfile can do, yes?

> 
>> All this complexity is very confusing when trying to create a new
>> project type.
> 
> I agree.  I think it would be worthwhile to take this common case and
> pull some of the logic up into the core of EDE.  That is bound to
> simplify creating new projects.
> 
> Fortunately, after initial project identification is done, everything is
> cached internal to EDE and your code won't be called anymore except in
> new directories.
> 
> You can also ede-flush-project-hash to clear out data from any calls
> that use ede-locate.  That seems like an unlikely cause here though.

How are these flush functions supposed to know about private caches
maintained by individual project type classes?

On 03/24/2014 09:17 PM, Daniel Colascione wrote:
> On 03/24/2014 05:47 PM, Eric M. Ludlam wrote:
>> Some projects, such as the one that leaves Project.ede files around will
>> allow the auto-loader to identify a project, and only THEN will it scan
>> upward for the root.  This too was pretty slow, and the local variable
>> for the root project was added to speed that up.
>>
>> Anyway, this means that any project that ONLY has an identifying project
>> file at the root needs to handle the case where the user opens a file in
>> a subdirectory.  It used to be this was the minority case, so it was
>> handled only in the project definitions.  I think the ratios have since
>> changed as new project styles I've used have in a majority of cases only
>> had a unique identifier at the root.
>
> Yes. Lots of other tools, like git, scan upwards as well. The "normal",
> default case should just be scanning upward for a project root every
> time, for simplicity's sake. It's going to be fast enough on most
> systems, and the statelessness of the system will go a long way toward
> simplifying understanding of the code and building new projects.
>
> If you need a stateful cache, please build it as an optional add-on.

EDE used to do searches that way, and while 'fast enough' for 
identification of a file, the number of other functions that kept asking 
for the location of the project root made that check far too slow 
requiring caches.  Note that the cache I am talking bout here is NOT the 
same as the per-project-type list of projects you might be thinking of.

>>
>> You can also ede-flush-project-hash to clear out data from any calls
>> that use ede-locate.  That seems like an unlikely cause here though.
>
> How are these flush functions supposed to know about private caches
> maintained by individual project type classes?

The directory hash tracks directories and their associated projects so 
classic searching isn't needed.

The project hash uses the locator database, usually something like the 
unix system "locate", or perhaps GNU Global to find files using a short 
name more quickly.

These have nothing to do with the lists of projects maintained in 
individual project classes like ede-cpp-root.

Eric

On 03/24/2014 07:08 PM, Eric M. Ludlam wrote:
> On 03/24/2014 09:17 PM, Daniel Colascione wrote:
>> On 03/24/2014 05:47 PM, Eric M. Ludlam wrote:
>>> Some projects, such as the one that leaves Project.ede files around will
>>> allow the auto-loader to identify a project, and only THEN will it scan
>>> upward for the root.  This too was pretty slow, and the local variable
>>> for the root project was added to speed that up.
>>>
>>> Anyway, this means that any project that ONLY has an identifying project
>>> file at the root needs to handle the case where the user opens a file in
>>> a subdirectory.  It used to be this was the minority case, so it was
>>> handled only in the project definitions.  I think the ratios have since
>>> changed as new project styles I've used have in a majority of cases only
>>> had a unique identifier at the root.
>>
>> Yes. Lots of other tools, like git, scan upwards as well. The "normal",
>> default case should just be scanning upward for a project root every
>> time, for simplicity's sake. It's going to be fast enough on most
>> systems, and the statelessness of the system will go a long way toward
>> simplifying understanding of the code and building new projects.
>>
>> If you need a stateful cache, please build it as an optional add-on.
>
> EDE used to do searches that way, and while 'fast enough' for
> identification of a file, the number of other functions that kept asking
> for the location of the project root made that check far too slow
> requiring caches.  Note that the cache I am talking bout here is NOT the
> same as the per-project-type list of projects you might be thinking of.

The choice doesn't have to be between walking the filesystem for each
call and caching everything in global data structures forever. You can
reference count projects --- use filesystem traversal to find a project
for a buffer, then cache that project object in a buffer-local variable.
Instead of just keeping that project on a list forever, add a reference
for each buffer using it, and delete the project object when the last
buffer associated with a project disappears. This way, the global state
problem is mitigated and the mental modeling of state becomes a lot simpler.

If you want to cache more aggressively than that, you should do it by
providing alternate implementations of filesystem functions instead of
using Emacs primitives that turn directly into system calls. I really
don't see why EDE *core*, for example, has to know anything about
inodes. The logic gets in the way of trying to understand both the
actual flow of the code and the intended method of operation.

>>> You can also ede-flush-project-hash to clear out data from any calls
>>> that use ede-locate.  That seems like an unlikely cause here though.
>>
>> How are these flush functions supposed to know about private caches
>> maintained by individual project type classes?
>
> The directory hash tracks directories and their associated projects so
> classic searching isn't needed.
>
> The project hash uses the locator database, usually something like the
> unix system "locate", or perhaps GNU Global to find files using a short
> name more quickly.

Fair enough. So why is it a hash mapping file shortnames to full paths?
Why doesn't each project just maintain a list of files belonging to that
project --- if we want to find a file not on that list, we can find that
file the hard way (using locate or whatever) and update the list as we
go. Implementing the existing short-name-to-full-path mapping using this
list is trivial.

> These have nothing to do with the lists of projects maintained in
> individual project classes like ede-cpp-root.

So why do these individual lists exist? I don't understand what purpose
they serve. What would go wrong if we just got rid of, say,
ede-emacs-project-list?

On 03/24/2014 10:42 PM, Daniel Colascione wrote:
> On 03/24/2014 07:08 PM, Eric M. Ludlam wrote:
>> On 03/24/2014 09:17 PM, Daniel Colascione wrote:
>>> On 03/24/2014 05:47 PM, Eric M. Ludlam wrote:
>> EDE used to do searches that way, and while 'fast enough' for
>> identification of a file, the number of other functions that kept asking
>> for the location of the project root made that check far too slow
>> requiring caches.  Note that the cache I am talking bout here is NOT the
>> same as the per-project-type list of projects you might be thinking of.
>
> The choice doesn't have to be between walking the filesystem for each
> call and caching everything in global data structures forever. You can
> reference count projects --- use filesystem traversal to find a project
> for a buffer, then cache that project object in a buffer-local variable.
> Instead of just keeping that project on a list forever, add a reference
> for each buffer using it, and delete the project object when the last
> buffer associated with a project disappears. This way, the global state
> problem is mitigated and the mental modeling of state becomes a lot simpler.

Hi Daniel,

Sure - there are of course more than two ways to do this.  The EDE 
mechanism for matching a file to a project is not just a simple hash 
match either.

Projects are asked for a bunch of different reasons.  If you just want 
to know a project for a buffer, that is a local buffer, as you suggest. 
  If you want to know a project for a new buffer when it gets first 
created, we check the to see if that directory has been matched up to a 
project yet.  If so, it is a nice fast answer.   If it hasn't been 
matched up yet, we go through a process of trying to detect a project on 
disk for it.

> If you want to cache more aggressively than that, you should do it by
> providing alternate implementations of filesystem functions instead of
> using Emacs primitives that turn directly into system calls. I really
> don't see why EDE *core*, for example, has to know anything about
> inodes. The logic gets in the way of trying to understand both the
> actual flow of the code and the intended method of operation.

The inode thing is in the core because while I was profiling, that was 
the fastest way to resolve sym links I found.  Many folks were plagued 
by symlinks, automounter problems, and EDE identifying files to the 
wrong projects.  Once I resolved that with inodes, all has been 
peaceful.   I originally used file-truename, which is very slow, 
especially on networked file systems, and quite abusive to automounter 
systems.

>> The project hash uses the locator database, usually something like the
>> unix system "locate", or perhaps GNU Global to find files using a short
>> name more quickly.
>
> Fair enough. So why is it a hash mapping file shortnames to full paths?

There is a hash between fully qualified directory names, and already 
found projects, which is what I was mostly talking about above.

There is a second hash in the locator subsystem to speed up cases where 
programatic use keeps pinging for the same files many times in a row. 
Usually header files during a smart complete operation.

It has taken me an extra round on this thread to realize the combination 
of your questions is identifying a flaw where different sub-directories 
in your project should identify short-names (ie a header file) 
differently based on location.  I think that is a real problem I hadn't 
encountered before.  It will require some restructuring, or just 
ignoring this hash to make that work correctly.

> Why doesn't each project just maintain a list of files belonging to that
> project --- if we want to find a file not on that list, we can find that
> file the hard way (using locate or whatever) and update the list as we
> go. Implementing the existing short-name-to-full-path mapping using this
> list is trivial.

Not all projects have a mechanism for quickly creating the list of 
files, and some projects can have an external tool for managing that 
list.   The ones that do maintain a list do implement expand-file-name 
as you suggest, just by scanning it quickly.

The extra 'locate' stuff is a handy way for users to combine a tool they 
have (ie - locate) and some other independent project they use together 
to get a feature.

>> These have nothing to do with the lists of projects maintained in
>> individual project classes like ede-cpp-root.
>
> So why do these individual lists exist? I don't understand what purpose
> they serve. What would go wrong if we just got rid of, say,
> ede-emacs-project-list?

Their existence is a side-effect of the history behind the code.  I am 
not opposed to getting rid of the list if a suitable replacement of the 
behavior is proposed.   History tells me that performance will be a key 
test when evaluating any updated system.

I do not actually like most of the code you have been challenging.  It 
started out quite simple and has evolved into something I have a hard 
time fixing bugs in.   It is, however, quite fast for what it does, and 
has a good feature set that is important in making smart completion work 
in the Semantic package which is what most people use it for.  I would 
be glad to accept patches and I would help advise testing strategies 
based on my experience for any good ideas that would help simplify it 
and make it easier to extend.

Eric

Thread: [CEDET-devel] Why project-specific caches?

cedet-devel