From: Kern S. <ke...@si...> - 2007-09-02 10:48:30
|
On Thursday 30 August 2007 20:44, Marc Cousin wrote: > > > If we separate the current File data into Dirs and Files, first, we > > > have reduced the amount of data that we need to look through to find > > > the directory tree for a job by about a factor of 100 on most typical > > > Unix systems. That is already good. Then, in the Dirs table we can > > > have the following columns: > > > > > > JobId > > > ParentId > > > PathId > > > > > > If ParentId is NULL (or perhaps zero), we know it is a top level > > > directory. That is one directly mentioned in a FileSet. Otherwise, it > > > is nested down. So for a given JobId we can quickly find all the top > > > level directories and parse them any way we want. > > > > So it means that the client has to find all the root directories of the > > backup, then calculate its parent directories if required, to display > > them. Is the root directory the one setup in the fileset ? > > Is there no risk of missing some 'root directories' from incremental > > backups, where the real root directory has not been modified ? (I > > honestly don't know, I'm asking...) > > BTW, Avoid NULL at all costs if you want to be able to use the index to > > retrieve your records : null values aren't indexed. > > I've been thinking about it again. What I don't like about chaining Ids in > the Dirs table, is that it makes some Dir records linked to dirs of another > jobid (in case of incremental backups). And when we do the incremental, we > may even not know to which parentid link, as there will be several of them > available. Actually, it is worse than what you describe above. The problem is that I did not consider in my proposal what happens in an Incremental backup. As proposed, it simply will not work. > > Maybe it would be easier to add a parentid in the Path table. This is probably a nice solution that will help improve performance a lot. > Of course it > means we don't restrain links to the ones that should be displayed for a > given server... But this is then easily filtered matching data from the Dir > or File table and saves a lot of space. Of course, it defeats the purpose > of having an easy way to recognise 'root' directories, as the info isn't > there anymore... Maybe then this info should be stored in another place ? > Something like having more metadata in the job table (or another table > describing all the root directories associated with a peculiar job, or > anythink of this sort, I really don't know) Yes, I think we will need some new table. If this interests you, I would be interested in what you could come up for a proposal. The problem I have in designing this is that I have never writing code like your brestore, so I probably would not get the table structure right the first time (missing minor points). Now is probably the time to implement the tables, even if we cannot implement all the Bacula core code necessary, since the next release of Bacula will probably be around the end of the year, and will be version 3.0.0. That will be the first version free of the OpenSSL license constraints, and it would be a good time to make any database reorganization such as adding a new field to the Path table and adding new tables. > > Having a table/set of tables describing precisely how a backup was done may > be very interesting compared to storing a big amount of useless data in > these tables : it seems better paying a fixed amount on saving full > metadata of each backup than waste 4 bytes per dir to save the parentid of > every dir we back up ? Yes, my intention was never to store a whole lot of extra data, and certainly, we need to be careful in doing Diff and Inc backups that we don't store a whole directory tree. On the other side of that, you mentioned somewhere that you didn't see the need for the PathId in the File record if we have a Dirs table (which I still think is a good idea since it separates Files and Directories). The reason I kept the PathId was two fold: 1. It means that none of the existing code that uses PathId needs to be changed. That doesn't mean that we cannot change it if we find better mechanisms, but it takes some of the pressure off. 2. I don't like multiple SQL links. I.e. I don't much like that to get the Path, from a File record that you have to do a lookup in the Dirs table and then a lookup in the Path table. Actually, in my proposal, I didn't think that adding a DirsId was really necessary in the File table, but it is probably desirable. Given that my proposal did not take into account Diff and Inc backups, I think it needs to be totally reworked. Ideas that I like: 1. Separate the current File table into Files and Dirs. 2. Add a ParentId link in the Path table. 3. Add a new table (or two) that provides the rest of the information that a "browser" needs for efficient lookups. 4. Try to design it so that most if not all the entries can be created during backup. 5. Try to design it so that users have some flexibility (probably via Bacula scripts) as to which indexes are created. I.e. if they do not browse maybe some indexes can be eliminated, and if someone browses a lot then have some scripts that will easily add the indexes for the user. > > > > In the Files table, in *addition* to the existing columns (Path and > > > Filename), if we need it we can have a DirId, which points to the Dirs > > > record for the given Path and Filename. > > > > If we have the dirid, we don't need the pathid anymore, I guess, as it > > would be in the dir table. > > Then comes another doubt :) > What happens if the dirid isn't there anymore ? (we have made an > incremental backup of a file, and the full it refers to doesn't exist > anymore) > > > > To do the above, we > > > 1. Split the FIle table into Dirs and Files > > > 2. Add one new column to Files, which is DirId (if necessary) > > > 3. Delete the FilenameId from the Dirs record (i.e. it is identical to > > > the current File record less the FilenameId column). > > > 4. Add one new ParentId column to the Dirs table. > > > > Here I've got a question : will you calculate the ParentId at insert time > > ? (we must avoid updates, it has a big performance impact on all > > transactional SGBDs). I really don't know how much it will cost, but it > > may slow down database insertions by a big amount... The parentid dir may > > not even be in the same job... > > > > The point of brestore's method is to calculate as little of these links > > as possible, thanks to the hierarchy table : the links between dirs and > > their parents is not correlated with jobs in our case, so we do the links > > once for all the jobs, except in the case of a new directory. > > If we store parentpathid in the path table,it becomes much less costly... > but then again, we don't have the root directories information at hand > anymore. > I think having the top level directories information is important. There is probably no reason why we cannot add it for each JobId. I'm sure you are fully aware of it, but some developers may not be, but such a table would need to have multiple values for the top level directory for a given JobId. For Linux, all the top level directories are ultimately accessable through /, but for Windows, there can actually be multiple roots. As noted above, if we can come up with a new table structure in say the next month, I believe that we can get it into the 3.0.0 release, even if we don't have all the backend code implemented. Once the table structure is implemented (i.e. in 3.0.0), we can if necessary add additional functionality in subreleases (i.e. start using the new columns). Best regards, Kern |