Re: [Bacula-devel] Query changes in the catalog browser and indexes

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Thursday 23 August 2007 04:02:07 Dirk Bartley wrote:
> On Wed, 2007-08-22 at 10:11 -0400, Dan Langille wrote:
> > On 21 Aug 2007 at 20:07, Dirk Bartley wrote:
> > > On Mon, 2007-08-20 at 08:04 -0400, Dan Langille wrote:
> > > > > There are basically three queries.  One to get a list of
> > > > > directories, another to get a list of files and one to get a list
> > > > > of versions for a specific file.
> > > > >
> > > > > 1. To get a list of directories given a list of jobs.  Here is an
> > > > > example: =20
> > > > >
> > > > > SELECT DISTINCT Path.Path AS Path, File.PathId AS PathId FROM File
> > > > > INNER JOIN Path ON (File.PathId=3DPath.PathId) WHERE
> > > > > File.FilenameId=3D3 AND File.Jobid IN
> > > > > (441,440,439,438,437,436,435,434,433,432,431,430,429,428,427,426,42
> > > > >5,424,42=
> > > > > 3,422,421,420,419,418,416,415,413,412,411,410,409,408,407,406,405,4
> > > > >04,403,4=
> > > > > 02,400,399,398,397,396,395,394,393,392,391,385,384,383,382,381,379,
> > > > >378,376,= 375,374,373,372,371) ORDER BY Path
> > > > > Time: 3988.458 ms
> > > >
> > > > That is four seconds.  I'd have to see the explain analyse output.
> > >
> > >  Unique  (cost=338755.18..340527.31 rows=6297 width=36) (actual
> > > time=4000.332..4000.437 rows=16 loops=1)
> > >    ->  Sort  (cost=338755.18..339345.89 rows=236283 width=36) (actual
> > > time=4000.330..4000.359 rows=204 loops=1)
> > >          Sort Key: path.path, file.pathid
> > >          ->  Hash Join  (cost=16509.58..307281.74 rows=236283 width=36)
> > > (actual time=3932.884..3994.564 rows=204 loops=1)
> > >                Hash Cond: (file.pathid = path.pathid)
> > >                ->  Index Scan using file_filenameid_jobid2 on file
> > > (cost=0.00..280519.50 rows=236283 width=4) (actual
> > > time=130.950..3289.826 rows=204 loops=1)
> > >                      Filter: (jobid = ANY
> > > ('{2629,2616,2598,2584,2570,2556,2537,2519,2503,2488,2473,2458,2440,242
> > >5,2410,2395,2380,2363,2348,2333,2318,2303}'::integer[])) ->  Hash 
> > > (cost=8713.59..8713.59 rows=366159 width=36) (actual
> > > time=453.745..453.745 rows=366951 loops=1)
> > >                      ->  Seq Scan on path  (cost=0.00..8713.59
> > > rows=366159 width=36) (actual time=0.021..179.985 rows=366951 loops=1)
> > >  Total runtime: 4000.694 ms
> > > (10 rows)
> > >
> > > This shows the database choosing to use the partial index when both
> > > exist.  My impression of the partial indexex advantages is that it is
> > > less expensive to maintain.  Maintenence of the index will only be
> > > triggered when a record in file is modified that contains the
> > > filenameid.  The disadvantage is that it could not be created until
> > > after the first job is done and the filenameid where name='' can be
> > > queried for.
> >
> > Partial indexes are usually only useful if what you are looking for
> > is a constant.
>
> My hope and current understanding is that the filenameid which is
> returned from the query
> SELECT FilenameId FROM Filename WHERE name=''
> is "relatively" constant and will not change until the last job that has
> a directory is purged from the database.

It is constant. But different from install to install.

>
> > I think you are proposing creating an index on-the-
> > fly.
>
> On the fly is a term that I would use if the index were being re-created
> on a regular basis, this index is not.
>
> >   I think this is not the way to go.  I think we can/should find
> > another way to improve this query's performance.
>
> If there is another way, that would be most excellent.  All ways  that I
> can imagine involve changes in the schema, like if there was a client
> column in the file table, an index could be put on that and the
> interface could be modified to not worry about jobs, only the client.

A change in the schema could be distinct tables (prefixed, or in a different 
schema)...

>
> > If you can put your database on a website somewhere, I'll download it
> > and try it out and work on the query that way.
>
> I will respectfully choose not to do that at this time.  I fear there
> may be reprisals from management if it were ever found out that I
> allowed corporate data which describes our entire filestructure
> available.  Not to mention using a gigs worth of our companies bandwidth
> and that we have no web server to place the file on.
>
> > With this, we'll both
> > be working with the same dataset, and I can show you how to make
> > things faster.
>
> I think we can do this with separate data as long as you have a
> reasonably large enough data set to be comparable.
>
> SELECT DISTINCT Path.Path AS Path, File.PathId AS PathId FROM File INNER
> JOIN Path ON (File.PathId=Path.PathId) WHERE File.FilenameId=YVAL AND
> File.Jobid IN (XLIST) ORDER BY Path
>
> just use the above and replace YVAL with the result of:
> SELECT FilenameId FROM Filename WHERE name=''
>
> and replace XLIST with the result of:
> SELECT Job.Jobid AS Id FROM Job INNER JOIN Client ON
> (Job.ClientId=Client.ClientId) INNER JOIN FileSet ON
> (Job.FileSetId=FileSet.FileSetId) WHERE Client.Name='srvalum1-fd' ORDER
> BY Job.EndTime DESC LIMIT 20
>
> of course replace srvalum1-fd with your favorite client.
>
> Thank you very much for your willingness to take a look at the indexes I
> am proposing.
>

I still have the same issues I discussed with you a few days ago : even with 
your new index, everything will look good as long as the SELECT DISTINCT 
returns only a very small amount of data (a few hundred to a few thousand 
records). It will be unbearable from a GUI point of view if you have to 
retrieve hundreds of thousands of directories at once (or even millions of 
directories, as the joblist takes the last 20 jobs, even if they are full 
jobs).
So here we are talking hundreds of thousands of index scans/seeks for the 
database (because the dir entries are completely mixed with file entries in 
the File table, you probably won't have more than one/a few dirs per database 
page).
My point is, if you really want to make it blazing fast (or at least 
reasonably fast), there are, I think, very few methods :
- The dir entries should not be drowned in the bulk of file entries (ie you 
need a new table to store them and only them, or sort physically the table on 
disk so that directories are all stored together on a few contiguous 
pages...)
- Even better, if you could only retrieve as few entries as possible each 
time, it would be even better. It means being able to retrieve only the 
subdirectories of the current directory. This isn't feasible with the current 
schema either, as you need a way to link directories with subdirectories 
(bi-directonnaly if possible).