From: Josip A. <jo...@vr...> - 2012-07-15 21:32:21
|
Thanks Kern, that makes it rather clear, I'll go hashing. Vacation first though, think I may do this in august. Regards... Kern Sibbald wrote: > Hello Josip, > > OK for dropping the temp SQL table. It is something I would > do only if memory becomes an issue as it will someday, but > right now it is not the main limiting factor. > > Either the rb tree or the hash list will work. In my opinion, each > will consume approximately the same space. The hash list has > table space that is used, but links are single links. The rb tree has > no table, but uses double links. > > I think you are missing one critical point: you must match both the > FileIndex and the JobId. The FileIndex is four bytes the JobId is also > four bytes. So, the FileIndex and the JobId could be put into one > of our 64 bit hash tables, but they could also be composed as a > string. For the rb tree, you will need to compose the FileIndex and > JobId into a string. In both cases, the payload is a void *, which is > sufficient to handle a node pointer. > > The coding of the two is essential identical, with using a composed > string being just slightly more complicated (a couple of sprintfs for > the composition). > > It is true that a node does not currently know that it is a hard linked > file, but I explained last time how to tell it. There are still 2 bits left > to add two new booleans, so making it aware will not add any additional > memory -- the only hard part (joking) is to figure out how to distinguish > the current hard_link flag from the new one you must create. > > Best regards, > Kern > > > On 07/15/2012 06:00 AM, Josip Almasi wrote: >> Hello Kern, >> >> (avoiding nesting mail but keeping it all so I can forward to collegues) >> yep, it all makes sense. >> >> A temp table indexed on JobId and FileIndex was first thing I thought of. >> Then again, it seems like more work than in-memory table: >> - DB's have different temp table syntax >> - creating and releasing temp table >> - more source files to patch >> - can't beat in-memory solution. >> So if I can make it in-memory with patching only two files, I'd go for it. >> >> Hash table would sure be faster solution than rbtree. >> But how about memory consumption? >> I think rblist would use only 4 bytes for each hardlink pointer, so 4MB for a million hardlinks. (right?) >> As for complexity, log(n) is waaay better than current dir traversal. >> And, 1M * log(1M) may be just fine - a million nanoseconds, who cares. >> >> As far as I've seen in the code, a node does know it's a hardlink - it's encoded in the stat, as you described. >> Just keep in mind, I'm not talking about FD, I'm talking about stored, and mark/unmark lasting 80 mins. >> See, it hurts even more than slow backup/restore, we can't even show up a presentation;) >> AFAIK backup/restore works fine, at least my collegues have no objections. >> (they may, once we fix mark/umark; all in due time) >> >> So, I think my 5 points need search rblist/replace hashtable. >> Well, maybe. It's your decission, resource usage & performance estimate. >> I'd prefer rblist, and if mark works for 30 secs or so, leave it be. >> If mark lasts for 5 mins, I'd switch to hastable, since production servers have 5 times more hardlinks than our test. >> >> Regards... >> >> Kern Sibbald wrote: >>> Hello Josip, >>> >>> I am posting this at the top to avoid nesting the email too much. >>> >>> OK, you got my attention: 1. You are willing to code. 2. You are >>> working for a partner. 3. You pointed me right to the problem. >>> >>> If the SQL queries are not the problem then there is no need to >>> store the LinkFI in each node, which would increase the node size >>> but eliminate the call to db_get_file_attributes_record(). So assuming >>> that the real problem is in the "linear" search through the tree looking >>> for the hard link, which is uniquely identified by LinkFI and JobId (by the >>> way LinkFI is the "pointer" I was talking about -- a somewhat loose use >>> of the word pointer). >>> >>> About the only way I could see to speed this up would be to keep a hash >>> table of all the links that hashes on LinkFI and JobId and stores the node >>> address as the result. It could be quite fast but also increases the memory >>> usage, which will already be high because of the total number of files in >>> the tree. >>> >>> Bacula already has hash code, so it wouldn't be hard to add the hash >>> table as an item in the root of the tree (needs confirmation by looking >>> at the code). This assumes that the node can know that it is a hard link. >>> I think it can know, but am not sure: if its link count in the stat packet >>> is greater than 1 and the LinkFI is zero, then it is a file that has other >>> files that point to its FileIndex -- i.e. those hard linked files all have >>> their LinkFI set to the FileIndex of the file actually backed up. >>> If that is not the case, then some new code will be needed >>> when the FD is processing a hard linked file. Basically, if I remember >>> right, during the backup process when the FD sees a file that is >>> hard linked, it saves that file and then puts it into a hash list, and if >>> any other file is linked to a file that is already >>> in the hash list, then the file is not saved a second time, but written as >>> a hard link with LinkFI "pointing" to the file that was actually backed up, >>> so all such files must some how be flagged so that they could >>> be put into to the hash table during restore. >>> >>> Another way of doing it that would not require memory would be to create >>> a temporary SQL table that is indexed on LinkFI and JobId with the value being >>> the node memory address in the tree. That would push the work off to >>> SQL rather than Bacula's has routines. >>> >>> Without too much thought, I probably would use a hash table (or something >>> similar) and if that ran into memory problems provide the SQL temp table as >>> an alternative. >>> >>> I don't think this is quite what you had in mind with your 5 points shown below, >>> but I think it would work. Does that make any sense to you? >>> >>> Best regards, >>> Kern >>> >>> On 07/14/2012 04:10 PM, Josip Almasi wrote: >>>> Kern Sibbald wrote: >>>>> Hello, >>>>> >>>>> On 07/13/2012 08:05 AM, Josip Almasi wrote: >>>>>> Hi, >>>>>> >>>>>> collegues ran into performance issues during mark/unmark. >>>>>> We narrowed it down to hardlinks: mark and unmark both call set_extract, >>>>>> which traverses entire directory tree for each hardlink. >>>>>> >>>>>> Let's have a look at hardlinks [1] on a casual RHEL 6 system: >>>>>> >>>>>> root@joe:~/tmp> find / ! -type d -links +1 -ls|sort -n>hardlinks.txt >>>>>> root@joe:~/tmp> wc -l hardlinks.txt >>>>>> 23298 hardlinks.txt >>>>>> >>>>>> Most of these are pyc & pyo, and yum things. >>>>>> >>>>>> On our test mail server, 392577 hardlinks in /var/spool/imap. >>>>>> It's how cyrus lmtpd works, and it's good. >>>>>> However, mark works for well over an hour. >>>>>> But on a production mail server - 1459466 hardlinks in imap folders. >>>>>> >>>>>> Turning off hardlinks in bacula is not a solution, and turning of >>>>>> hardlinks in cyrus is neither. >>>>>> Bottom line, we can't sell bacula for mail servers. >>>>>> >>>>>> So, what can we do about it? >>>>>> >>>>>> A rblist could be used for lookup instead of dir tree traversal: >>>>> The tree code was written a long time ago, and from what >>>>> I remember, it already uses a red black tree structure, so I >>>>> am not sure without looking at the code exactly what you want >>>>> to do differently. The best I can tell, you want a separate hard-link >>>>> list. >>>> >>>> Sure, I'd keep a hardlink list, but I'd use rblist for log(n) complexity. >>>> >>>>>> 1) TREE_ROOT would contain hardlinks rblist member (tree.h) >>>>>> 2) A comparator function needs to be implemented (ua_tree.c?) >>>>>> It would compare items based on JobId and FileIndex. >>>>>> 3) insert_tree_handler needs to maintain hardlink rblist (ua_tree.c) >>>>>> Only first ones, marked with ok. >>>>>> This would build up hardlinks rblist along with directory tree. >>>>>> 4) set_extract would use that rblist for hardlink lookup (ua_tree.c) >>>>>> 5) What about bookkeeping? >>>>>> I see rblist internally allocates and frees, and free_tree would relase >>>>>> it all, right? >>>>>> >>>>>> Did I miss something? >>>>> >>>>> I am sorry, I cannot really say if that is the right thing to do. I am >>>>> not even sure if you have nailed down the problem. It seems to me >>>>> that Bacula keeps a "pointer" to the link, and the problem I have >>>>> previously seen is that for each file that is a hard link, Bacula must >>>>> do a SQL query to find the file -- when you start doing millions of >>>>> SQL queries, things can get slow unless you are a DBA and do the >>>>> appropriate DB tuning, which is not evident. >>>> >>>> In fact, I've done some measurement. >>>> First, enabled postgres query timing: >>>> >>>> log_min_duration_statement = 0 # -1 is disabled, 0 logs all statements >>>> # and their durations, > 0 logs only >>>> # statements running at least this number >>>> # of milliseconds >>>> >>>> Queries show up in /var/lib/pgsql/data/pg_log along with their duration in milliseconds. >>>> Duration was 0.1 - 0.5 ms - as fast as it gets. >>>> For each file, bacula uses two queries, plus one more for each dir. >>>> >>>> Total number of queries during mark() on our test box with 392577 hardlinks was >>>> >>>> [root@sandbox pg_log]# grep SELECT postgresql-Wed.log |wc -l >>>> 639067 >>>> >>>> Total number of unique queries was >>>> >>>> [root@sandbox pg_log]# grep SELECT postgresql-Wed.log |awk --field-separator='SELECT ' '!a[$2]++'|wc -l >>>> 377958 >>>> >>>> meaning we got 69% queries too much. >>>> >>>> But bottleneck was not database, it was dird, spending some 90% CPU for some 80 mins. >>>> >>>> So it led me to set_extract recursive tree traversal: >>>> >>>> static int set_extract(UAContext *ua, TREE_NODE *node, TREE_CTX *tree, bool extract) >>>> ... >>>> /* >>>> * If we point to a hard linked file, traverse the tree to >>>> * find that file, and mark it to be restored as well. It >>>> * must have the Link we just obtained and the same JobId. >>>> */ >>>> if (LinkFI) { >>>> for (n=first_tree_node(tree->root); n; n=next_tree_node(n)) { >>>> ... >>>> >>>> In-memory lookup for first occurence of a specific hardlinked file. >>>> >>>> Seems like queries result from db_get_file_attributes_record calls from within the same function. >>>> I'm not sure how I'd get duplicate queries there, but now it's of secondary importance - above lines mean 300K dir tree traversals. >>>> >>>> As for pointer to the link, I've seen it only in filed, and used only on solaris. >>>> Not sure what it means though. >>>> But sure I can't simply reuse it. >>>> >>>>> So, are you going to implement this and submit it as a contribution? >>>> >>>> That's the plan, assuming I'm on the right way. >>>> >>>>> If so, please let me know and >>>>> I will look at the code a bit more when I get back from vacation >>>>> in August and tell you what I think. >>>> >>>> Please do. >>>> I'm hoping to use some vacation too:) So august is just fine for me. >>>> >>>>> Hard links have always been a problem >>>>> for Bacula >>>> >>>> Not only for bacula, i.e. I think tivoli simply ignores them, treats them as different files. >>>> Bad tivoli bad - can't restore /var/spool/imap :> >>>> >>>>> -- and by the way, if you create too many of them you may not >>>>> be able to boot your system since during the boot process (non-Bacula) >>>>> from what I understand all the hard links found must be able to fit within >>>>> the available boot memory. >>>> >>>> I don't think this applies for non-boot partitions. >>>> Regardless, we do have zillions of hardlinks on working mail servers. >>>> >>>>> If not, are you asking us to implement something for free so that you can sell >>>>> and make money off of Bacula? It might be possible for the Enterprise version, >>>>> but right now, I am pretty overloaded doing bug fixes and other implementations >>>>> for the community version. >>>> >>>> Oh I work for Bacula partner, Nimium. (in public I use my open source mail address) >>>> I think we already made some money off bacula, and once we solve this, we're about to make some more;) >>>> My boss can provide more details on that, I deal with the code. >>>> >>>> Regards... >>>> >>> >>> >>> >> >> >> > > > |